pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-07 01:50:04 +08:00

Author	SHA1	Message	Date
ydwu4	138e2895d0	Enable tuple operands for cond (#108026 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108026 Approved by: https://github.com/zou3519 ghstack dependencies: #108025	2023-08-28 00:17:54 +00:00
ydwu4	8688965337	Move cond to torch/_higher_order_ops/ (#108025 ) 1. Move cond to torch/_higher_order_ops 2. Fix a bug in map, which didn't respect tensor dtype when creating a new one from them. We cannot directly use empty_strided because boolean tensor created by empty_strided is not properly intialized so it causes error "load of value 190, which is not a valid value for type 'bool'" on clang asan environment on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108025 Approved by: https://github.com/zou3519	2023-08-28 00:01:35 +00:00
cyy	1fd4e787ce	[2/N] fix clang-tidy warnings in torch/csrc (#107966 ) Apply fixes to some found issues by clang-tidy in torch/csrc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107966 Approved by: https://github.com/Skylion007	2023-08-27 18:06:21 +00:00
Jerry Zhang	9ae3d7ca90	[reland][quant][pt2e][xnnpack_quantizer] Add support for mul and mul_relu (#107930 ) (#107992 ) Summary: att Test Plan: buck2 run executorch/examples/quantization:example -- -m=mv3 --verify Differential Revision: D48588121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107992 Approved by: https://github.com/digantdesai, https://github.com/mcr229	2023-08-27 14:50:03 +00:00
angelayi	a432f37e49	Serialize pytree to json string (#106116 ) Fixes https://github.com/pytorch/pytorch/pull/102577#issuecomment-1650905536 Serializing to json is more stable, and renamed the API: ``` # Takes in a treespec and returns the serialized treespec as a string. Also optionally takes in a protocol version number. def treespec_dumps(treespec: TreeSpec, protocol: Optional[int] = None) -> str: # Takes in a serialized treespec and outputs a TreeSpec def treespec_loads(data: str) -> TreeSpec: ``` If users want to register their own serialization format for a given pytree, they can go through the `_register_treespec_serializer` API which optionally takes in a `getstate` and `setstate` function. ``` _register_treespec_serializer(type_, *, getstate, setstate) # Takes in the context, and outputs a json-dumpable context def getstate(context: Context) -> DumpableContext: # Takes in a json-dumpable context, and reconstructs the original context def setstate(dumpable_context: DumpableContext) -> Context: ``` We will serialize to the following dataclass, and then json.dump this it to string. ``` class TreeSpec type: Optional[str] # a string name of the type. null for the case of a LeafSpec context: Optional[Any] # optional, a json dumpable format of the context children_specs: List[TreeSpec], } ``` If no getstate/setstate function is registered, we will by default serialize the context using `json.dumps/loads`. We will also serialize the type through `f"{typ.__module__}.{typ.__name__}"`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106116 Approved by: https://github.com/zou3519	2023-08-27 14:34:49 +00:00
Xia, Weiwen	4b27e46ddb	[Quant][Inductor] add UT of dequant promotion for linear (#106935 ) Summary Previously the UT of dequant promotion in Inductor only tests convolution. Now add linear case in the UT. This is for quantization PT2E with Inductor. Test plan python test/inductor/test_mkldnn_pattern_matcher.py -k test_dequant_promotion Pull Request resolved: https://github.com/pytorch/pytorch/pull/106935 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/eellison ghstack dependencies: #105818, #106781, #106782, #106934	2023-08-27 13:53:13 +00:00
Xia, Weiwen	f3adbab4bb	[Quant][Inductor] Enable quantization linear pattern fusion inside inductor (#106934 ) Summary Enable lowering of quantized linear in Inductor Test plan python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_unary Pull Request resolved: https://github.com/pytorch/pytorch/pull/106934 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison ghstack dependencies: #105818, #106781, #106782	2023-08-27 13:00:16 +00:00
Xia, Weiwen	15ceafb5c5	[Quant][Inductor] Enable qlinear weight prepack inside inductor constant folding (#106782 ) Summary To realize weight prepack for quantized linear, we replace the following pattern ``` int8 activation \| dequant_per_tensor \| mm/addmm <- t <- dequant_per_channel <- int8_weight ``` with ``` int8 activation \| onednn.qlinear_pointwise <- onednn.qlinear_prepack <- int8_weight ``` And we register weight prepack path inside inductor constant folding. Constant folding evaluates the prepack op and replace it with prepacked weight (a constant parameter) Test plan python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_unary Pull Request resolved: https://github.com/pytorch/pytorch/pull/106782 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison ghstack dependencies: #105818, #106781	2023-08-27 12:53:44 +00:00
Xia, Weiwen	e9b0f62a19	[Quant][PT2E] Enable linear and linear-unary post-op quant recipe for x86 inductor quantizer (#106781 ) Summary Add linear and linear-unary post-op quantization recipe to x86 inductor quantizer. For PT2E with Inductor. With this, the quantization path will add `quant-dequant` pattern for linear and linear-unary post op. Test plan python test/test_quantization.py -k test_linear_with_quantizer_api python test/test_quantization.py -k test_linear_unary_with_quantizer_api Pull Request resolved: https://github.com/pytorch/pytorch/pull/106781 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168 ghstack dependencies: #105818	2023-08-27 10:50:17 +00:00
Yang Chen	2179ebde1f	[inductor] correctly handle resize for AOTInductor wrapper calls (#107848 ) When generating a wrapper call, we may have implicit resize applied to the kernel's output. For example, for addmm(3d_tensor, 2d_tensor), its output buffer is resized to a 2d tensor. This triggers a warning from Aten's resize_output op: "UserWarning: An output with one or more elements was resized since it had... This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements..." More importantly, the output shape is not the same as we would expect, i.e. 2d tensor v.s. 3d tensor. This PR fixed the issue by injecting resize_(0) before calling the relevant kernel and resize_(expected_shape) after the kernel call. We also fixed a minor typo in the PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107848 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-08-27 09:56:16 +00:00
Xia, Weiwen	a6d3da1835	[Quant] Add int8 linear op impl for quantization PT2E with Inductor. input is an int8 CPU tensor; weight is an int8 MdkldnnCPU tensor. (#105818 ) Summary Add a new onednn qlinear op for quantization PT2E with Inductor. input is an int8 CPU tensor and weight is an int8 MkldnnCPU tensor. Test plan python test/test_quantization.py -k test_qlinear_pt2e Pull Request resolved: https://github.com/pytorch/pytorch/pull/105818 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2023-08-27 08:13:12 +00:00
PyTorch UpdateBot	bad3f2db40	[vision hash update] update the pinned vision hash (#108011 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108011 Approved by: https://github.com/pytorchbot	2023-08-27 03:29:53 +00:00
james-a-watson	808e088615	Update writing_batching_rules.md (#108007 ) Was reading through the batching rules info which is very cool and just saw a couple of typos 😊. Thanks Pull Request resolved: https://github.com/pytorch/pytorch/pull/108007 Approved by: https://github.com/msaroufim	2023-08-26 19:07:36 +00:00
Jack Taylor	a18ee0c6ec	[ROCm] ROCm compatible configs for triton kernels (#107584 ) This PR brings in a few inductor changes required for ROCm ~1 - Introduction of a toggle for enforced channel last convolution fallbacks~ This addition is split off into its own PR after some cleanup by @pragupta https://github.com/pytorch/pytorch/pull/107812 2 - Addition of ROCm specific block sizes We are now able to support the MAX_AUTOTUNE mode on ROCm, we are proposing conditions to allow us to finetune our own block tuning. Currently triton on ROCm does not benefit from pipelining so we are setting all configs to `num_stages=1` and we have removed some upstream tunings on ROCm to avoid running out of shared memory resources. In the future we will provide more optimised tunings for ROCm but for now this should mitigate any issues ~3 - Addition of device_type to triton's compile_meta~ ~Proposing this addition to `triton_heuristics.py`, Triton on ROCm requires device_type to be set to hip https://github.com/ROCmSoftwarePlatform/triton/pull/284 suggesting to bring this change in here so we can pass down the correct device type to triton.~ This change is split off and will arrive in the wheel update PR https://github.com/pytorch/pytorch/pull/107600 leaving this PR to focus on the ROCm specific block sizes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107584 Approved by: https://github.com/jithunnair-amd, https://github.com/jansel, https://github.com/eellison	2023-08-26 18:24:55 +00:00
Aaron Bockover	15e5bd5103	[ONNX] Support `torch.compile(backend="onnxrt", options=OrtBackendOptions(...))` (#107973 ) This reworks the DORT backend factory function to support the options kwarg of torch.compile, and defines a concrete OrtBackendOptions type that can be used to influence the backend. Caching is also implemented in order to reuse backends with equal options. Wrapping the backend in auto_autograd also becomes an option, which allows `OrtBackend` to always be returned as the callable for torch.compile; wrapping happens internally if opted into (True by default). Lastly, expose options for configuring preferred execution providers (will be attempted first), whether or not to attempt to infer an ORT EP from a torch found device in the graph or inputs, and finally the default/fallback EPs. ### Demo The following demo runs `Gelu` through `torch.compile(backend="onnxrt")` using various backend options through a dictionary form and a strongly typed form. It additionally exports the model through both the ONNX TorchScript exporter and the new TorchDynamo exporter. ```python import math import onnx.inliner import onnxruntime import torch import torch.onnx torch.manual_seed(0) class Gelu(torch.nn.Module): def forward(self, x): return x * (0.5 * torch.erf(math.sqrt(0.5) * x) + 1.0) @torch.compile( backend="onnxrt", options={ "preferred_execution_providers": [ "NotARealEP", "CPUExecutionProvider", ], "export_options": torch.onnx.ExportOptions(dynamic_shapes=True), }, ) def dort_gelu(x): return Gelu()(x) ort_session_options = onnxruntime.SessionOptions() ort_session_options.log_severity_level = 0 dort_gelu2 = torch.compile( Gelu(), backend="onnxrt", options=torch.onnx._OrtBackendOptions( preferred_execution_providers=[ "NotARealEP", "CPUExecutionProvider", ], export_options=torch.onnx.ExportOptions(dynamic_shapes=True), ort_session_options=ort_session_options, ), ) x = torch.randn(10) torch.onnx.export(Gelu(), (x,), "gelu_ts.onnx") export_output = torch.onnx.dynamo_export(Gelu(), x) export_output.save("gelu_dynamo.onnx") inlined_model = onnx.inliner.inline_local_functions(export_output.model_proto) onnx.save_model(inlined_model, "gelu_dynamo_inlined.onnx") print("Torch Eager:") print(Gelu()(x)) print("DORT:") print(dort_gelu(x)) print(dort_gelu2(x)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107973 Approved by: https://github.com/BowenBao	2023-08-26 18:20:18 +00:00
leslie-fang-intel	c85c5954f2	[Quant][PT2E]Make _fuse_conv_bn_ support graph capture by torch._dynamo.export (#107951 ) Summary The latest check-in `a0cfaf0688` for the conv-bn folding assumes the graph is captured by the new graph capture API `torch._export.capture_pre_autograd_graph`. Since we still need to use the original graph capture API `torch._dynamo_export` in 2.1 release. So, this check-in made negative impact to workloads' performance heavily. Made this PR to fix this issue by trying to make the conv-bn folding function workable with both new and original graph capture API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107951 Approved by: https://github.com/jgong5, https://github.com/jerryzh168 ghstack dependencies: #106836, #106838, #106958	2023-08-26 17:19:41 +00:00
leslie-fang-intel	fdbc2ec5cb	[Quant][Inductor] Fix the non contiguous load with uint8 data type (#106958 ) Summary Currently, the load vectorization code generation with `non_contiguous` and `uint8` data type has issue in determining the data type. It caused wrong results in `shufflenet_v2_x1_0` model after we enable the `cat` quantization recipe. - Previously code gen with the example in this PR: ``` cpp_fused_clone_view_0 = async_compile.cpp(''' #include "/tmp/torchinductor_root/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h" extern "C" void kernel(const unsigned char* in_ptr0, float* out_ptr0) { #pragma omp parallel num_threads(56) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(232L); i0+=static_cast<long>(1L)) { for(long i1=static_cast<long>(0L); i1<static_cast<long>(784L); i1+=static_cast<long>(16L)) { auto tmp0 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = flag_to_float_scalar(in_ptr0[static_cast<long>((116L(static_cast<long>(i0) % static_cast<long>(2L))) + (232Li1) + (232Li1_inner) + (at::native::div_floor_integer(i0, 2L)))]); return at::vec::Vectorized<uint8_t>::loadu_one_fourth(tmpbuf); })(); auto tmp1 = at::vec::convert_uint8_to_float(tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1.0)); auto tmp5 = tmp3 tmp4; auto tmp6 = tmp5 * tmp4; auto tmp7 = tmp6.round(); auto tmp8 = tmp7 + tmp2; auto tmp9 = at::vec::maximum(tmp8, tmp2); auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp11 = at::vec::minimum(tmp9, tmp10); auto tmp12 = at::vec::convert_float_to_uint8(tmp11); auto tmp13 = at::vec::convert_uint8_to_float(tmp12); auto tmp14 = tmp13 - tmp2; auto tmp15 = tmp14 * tmp4; tmp15.store(out_ptr0 + static_cast<long>(i1 + (784Li0))); } } } } } ''') ``` - After this PR, the code gen is: ``` cpp_fused_clone_view_0 = async_compile.cpp(''' #include "/tmp/torchinductor_root/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h" extern "C" void kernel(const unsigned char in_ptr0, float* out_ptr0) { #pragma omp parallel num_threads(56) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(232L); i0+=static_cast<long>(1L)) { for(long i1=static_cast<long>(0L); i1<static_cast<long>(784L); i1+=static_cast<long>(16L)) { auto tmp0 = ([&]() { __at_align__ unsigned char tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>((116L(static_cast<long>(i0) % static_cast<long>(2L))) + (232Li1) + (232Li1_inner) + (at::native::div_floor_integer(i0, 2L)))]; return at::vec::Vectorized<uint8_t>::loadu_one_fourth(tmpbuf); })(); auto tmp1 = at::vec::convert_uint8_to_float(tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1.0)); auto tmp5 = tmp3 tmp4; auto tmp6 = tmp5 * tmp4; auto tmp7 = tmp6.round(); auto tmp8 = tmp7 + tmp2; auto tmp9 = at::vec::maximum(tmp8, tmp2); auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp11 = at::vec::minimum(tmp9, tmp10); auto tmp12 = at::vec::convert_float_to_uint8(tmp11); auto tmp13 = at::vec::convert_uint8_to_float(tmp12); auto tmp14 = tmp13 - tmp2; auto tmp15 = tmp14 * tmp4; tmp15.store(out_ptr0 + static_cast<long>(i1 + (784Li0))); } } } } } ''') ``` Test Plan* ``` clear && python -m pytest test_cpu_repro.py -k test_non_contiguous_load_buf_quant ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106958 Approved by: https://github.com/jgong5, https://github.com/eellison ghstack dependencies: #106836, #106838	2023-08-26 16:58:45 +00:00
leslie-fang-intel	9e3f3f0b3d	[Quant][Inductor] Enable lowering of qcat (#106838 ) Summary Enable the lowering of `qcat` inside inductor as extern kernel. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qcat ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106838 Approved by: https://github.com/jgong5, https://github.com/eellison ghstack dependencies: #106836	2023-08-26 16:56:27 +00:00
leslie-fang-intel	1147a28b0b	[Quant][PT2E] Add cat and avg_pool2d recipe into x86InductorQuantizer (#106836 ) Summary Add `cat` and `avg_pool2d` quantization recipe as input output share observer into `x86InductorQuantizer`. Test Plan ``` clear && python -m pytest test_x86inductor_quantizer.py -k test_cat_recipe clear && python -m pytest test_x86inductor_quantizer.py -k test_cat_recipe_same_inputs clear && python -m pytest test_x86inductor_quantizer.py -k test_cat_recipe_single_input clear && python -m pytest test_x86inductor_quantizer.py -k test_avg_pool2d_recipe ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106836 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-08-26 16:51:13 +00:00
Jerry Zhang	15d4dedbbf	[quant][pt2e] Add reference representation rewrite for statically quantized linear (#107994 ) Summary: att Test Plan: ``` python test/test_quantization.py TestQuantizePT2E.test_representation_linear buck2 test 'fbcodemode/opt' fbcodecaffe2/test:quantization_pt2e -- 'test_representation_linear' ``` Reviewed By: kimishpatel Differential Revision: D48674862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107994 Approved by: https://github.com/mcr229, https://github.com/guangy10	2023-08-26 15:39:52 +00:00
zhxchen17	162109f6c2	[export] Don't save example_inputs for now. (#107978 ) Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/107978 Approved by: https://github.com/angelayi	2023-08-26 14:36:56 +00:00
Michael Lazos	d4a99631dd	Handle 2D blocking with foreach (#107840 ) Previously blocking in foreach ops was only 1D. This PR allows handling kernels with 2D blocking with foreach as well. Code when at least one dim matches: [example code + output](https://gist.github.com/mlazos/9f100b21cfe2540f0a24303a8349c196) Code when neither X or Y dim matches: [example code + output](https://gist.github.com/mlazos/14e2a455f635896dface09be601595dd) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107840 Approved by: https://github.com/jansel	2023-08-26 11:02:46 +00:00
Valentin Andrei	558a9501fa	[cuda] remove dead CUDA code in layer_norm_kernel.cu (#107976 ) Removing CUDA kernels which are not used anywhere. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107976 Approved by: https://github.com/Skylion007	2023-08-26 09:18:05 +00:00
Huamin Li	9fa5283401	[dynamo+aten] Enable embedding_bag_byte_unpack + meta kernel impl (#107937 ) Summary: ``` torch._dynamo.exc.Unsupported: unsupported operator: quantized.embedding_bag_byte_unpack.default ``` Differential Revision: D48652953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107937 Approved by: https://github.com/houseroad	2023-08-26 08:52:42 +00:00
leslie-fang-intel	2bddfb0263	[submodule][Quant][PT2E] Upgrade IDeep to remove redundant QConv weight scale reciprocal calculation (#107565 ) Summary Upgrade IDeep which includes 1 IDeep change as IDeep PR: https://github.com/intel/ideep/pull/226 - For IDeep PR: https://github.com/intel/ideep/pull/226 which has done 2 things: - Remove the redundant QConv weight scale reciprocal calculation. - Pump IDEEP_VERSION_REVISION version from 0 to 1. So only QConv related calculation will be impacted and we already use IDeep version API in https://github.com/pytorch/pytorch/pull/105996 to make the corresponding change in PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107565 Approved by: https://github.com/jgong5, https://github.com/jerryzh168 ghstack dependencies: #104580, #104581, #104588, #104590, #105455, #105456, #105639, #105906, #105996	2023-08-26 08:42:12 +00:00
leslie-fang-intel	780a5a0c7d	[Quant][PT2E] Enable weight scale optimization in QConv PT2E (#105996 ) Summary After oneDNN 3.1 upgrade, we don't need to do the weight scale reciprocal calculation. So, remove the redundant reciprocal calculation to optimize QConv performance and using IDeep version API to implement it in this PR: - This QConv implementation expects to work functionally both with current IDeep version and the following IDeep upgrade in PR: https://github.com/pytorch/pytorch/pull/107565. - With the following IDeep upgrade in PR: https://github.com/pytorch/pytorch/pull/107565, the QConv has better performance since the redundant reciprocal calculation are removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105996 Approved by: https://github.com/jgong5, https://github.com/jerryzh168 ghstack dependencies: #104580, #104581, #104588, #104590, #105455, #105456, #105639, #105906	2023-08-26 08:39:18 +00:00
leslie-fang-intel	9319dd1c7c	[Quant][Inductor] Enable the lowering of quantized maxpool2d (#105906 ) Summary Enable the `dq-maxpool2d-q` pattern match and lower into `torch.ops.quantized.max_pool2d`. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qmaxpool2d python -m pytest test_quantized_op.py -k test_max_pool2d_pt2e ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105906 Approved by: https://github.com/jgong5, https://github.com/eellison ghstack dependencies: #104580, #104581, #104588, #104590, #105455, #105456, #105639	2023-08-26 08:36:47 +00:00
leslie-fang-intel	70ca18f8a0	[Quant][PT2E] Enable X86InductorQuantizer single quantizable op(maxpool2d) (#105639 ) Summary In this PR, we mainly enable 2 things. - Enable the skeleton of quantization recipe for single quantizable operators in `X86InductorQuantizer`. - Add quantization recipe of `maxpool2d` and annotate it as input./output share observer. Test Plan ``` python -m pytest test_x86inductor_quantizer.py -k test_maxpool2d_recipe ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105639 Approved by: https://github.com/jgong5, https://github.com/jerryzh168 ghstack dependencies: #104580, #104581, #104588, #104590, #105455, #105456	2023-08-26 08:34:15 +00:00
Pearu Peterson	c5ad44be1d	Add torch.sparse.as_sparse_gradcheck decorator of gradcheck that allows gradcheck input function to receive and return sparse tensors (#107150 ) Compared to #104848, this PR makes a step further: when the enable_sparse_support decorator is applied to `torch.autograd.gradcheck`, the resulting callable is equivalent to `torch.autograd.gradcheck` with an extra feature of supporting functions that can have input sparse tensors or/and can return sparse tensors. At the same time, the underlying call to `torch.autograd.gradcheck` will operate on strided tensors only. This basically means that torch/autograd/gradcheck.py can be cleaned up by removing the code that deals with sparse tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107150 Approved by: https://github.com/albanD, https://github.com/amjames, https://github.com/cpuhrsch ghstack dependencies: #107638, #107777	2023-08-26 07:24:31 +00:00
Pearu Peterson	e4b38b9ce9	Support torch.sparse_mask on strided input with sparse CSR, CSC, BSR, and BSC mask. (#107777 ) While `input.sparse_mask(mask)` can be defined as `input.mul(ones_like(mask))`, implementing this definition leads to a chicken-and-egg problem because the multiplication of dense and sparse_compressed tensors relies on the `sparse_mask` support. This PR implements `sparse_mask` support for sparse compressed masks using utility functions from sparse compressed tensor conversions support. Fixes https://github.com/pytorch/pytorch/issues/107373 Fixes https://github.com/pytorch/pytorch/issues/107370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107777 Approved by: https://github.com/amjames, https://github.com/cpuhrsch ghstack dependencies: #107638	2023-08-26 07:24:31 +00:00
Pearu Peterson	fe3309b4b8	Add optional is_coalesced argument to sparse coo tensor factory function. (#107638 ) Resolves https://github.com/pytorch/pytorch/issues/107097 After this PR, instead of ```python torch.sparse_coo_tensor(indices, values, size)._coalesced_(is_coalesced) ``` (that does not work in the autograd context, see #107097), use ```python torch.sparse_coo_tensor(indices, values, size, is_coalesced=is_coalesced) ``` All sparse coo factory functions that take indices as input support the `is_coalesced` argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107638 Approved by: https://github.com/cpuhrsch	2023-08-26 07:24:29 +00:00
wz337	781b7ebe91	[DeviceMesh] Expose init_device_mesh (#107969 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107969 Approved by: https://github.com/wanchaol, https://github.com/fduwjj	2023-08-26 06:48:17 +00:00
CYuxian	35f4bb9a25	[ONNX] Return input itself for non-fp inputs and support decimals for aten::round op (#107920 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107920 Approved by: https://github.com/justinchuby	2023-08-26 05:54:52 +00:00
JackCaoG	ed8f21282f	Minor fixs to make torchbench runable on torch/xla (#107919 ) `import torch_xla.core.xla_model as xm` no longer trigger the xla runtime to init, hence explictly create the device here. This is a workaround for https://github.com/pytorch/xla/issues/4174. `is_correct` reference has been deleted, I think it is a deadcode. After this patch, I am able to run ``` python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=openxla --only resnet50 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107919 Approved by: https://github.com/shunting314, https://github.com/wconstab	2023-08-26 03:34:54 +00:00
Shunting Zhang	95cacb7fa9	[reland][inductor] make thread order consistent with loop order (#107902 ) This PR relands https://github.com/pytorch/pytorch/pull/106827 which get reverted because of causing compilation error for some ads model. Yanbo provide a repro in one of the 14k model ( `pytest ./generated/test_KaiyangZhou_deep_person_reid.py -k test_044`). This is also the model I used to confirm the fix and come up with a unit test. In this model, we call `tritoin_heuristics.triton_config` with size_hints [2048, 2]. Previously this would result in a trition config with XBLOCK=2048 and YBLOCK=2 . But since we change the mapping between size_hints and XYZ dimension, we now generate a triton config with XBLOCK=2 and YBLOCK=2048. This fails compilation since we set max YBLOCK to be 1024. My fix is to make sure we never generate a triton config that exceeds the maximum block size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107902 Approved by: https://github.com/jansel	2023-08-26 02:56:20 +00:00
angelayi	4e9d7f878b	[export] Serialize getattr nodes (#107924 ) Turns out some graphs will result in getattr nodes...so let's serialize them Pull Request resolved: https://github.com/pytorch/pytorch/pull/107924 Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri	2023-08-26 02:41:49 +00:00
Tugsbayasgalan Manlaibaatar	27afb1c61f	Disable Constraint constructor (#107918 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107918 Approved by: https://github.com/zhxchen17	2023-08-26 02:12:47 +00:00
Jason Ansel	f877d0a4bf	[dynamo] Treat monkey patched .forward as dynamic (#107104 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107104 Approved by: https://github.com/anijain2305	2023-08-26 01:41:29 +00:00
andrewor14	240bdbea61	[quant][pt2e] Fix annotation for conv no bias case (#107971 ) Summary: This fixes the no bias case for conv annotations. Previously this would result in an index out of bounds, since the new aten.conv2d op may not have the bias arg (unlike the old aten.convolution op). This was not caught because of a lack of test cases, which are added in this commit. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_qat_conv_no_bias python test/test_quantization.py TestQuantizePT2E.test_qat_conv_bn_relu_fusion_no_conv_bias Reviewers: jerryzh168, kimishpatel Subscribers: jerryzh168, kimishpatel Differential Revision: [D48696874](https://our.internmc.facebook.com/intern/diff/D48696874) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107971 Approved by: https://github.com/jerryzh168	2023-08-26 01:01:54 +00:00
BowenBao	25d98a3e3b	[ONNX] Remove API reference for TorchScript export diagnostics (#107979 ) Remove both api reference and rules specific to TorchScript ONNX export. The page should display only info related to `torch.onnx.dynamo_export` diagnostics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107979 Approved by: https://github.com/justinchuby	2023-08-26 00:52:59 +00:00
Tugsbayasgalan Manlaibaatar	52eb773e9c	Add runtime assertions for prim values (#107939 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107939 Approved by: https://github.com/gmagogsfm	2023-08-26 00:51:28 +00:00
Jerry Zhang	f92f69dbfb	[quant][pt2e] Enable testing for reference quant model representations (#107474 ) Summary: Previously these tests were disabled due to time out in dynamo export in fbcode, this might have been resolved, so trying to enable again Test Plan: python test/test_quantization.py TestQuantizePT2E Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D48619072](https://our.internmc.facebook.com/intern/diff/D48619072) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107474 Approved by: https://github.com/andrewor14	2023-08-26 00:37:45 +00:00
PyTorch MergeBot	8d44b0f5a5	Revert "[quant][pt2e][xnnpack_quantizer] Add support for mul and mul_relu (#107930 )" This reverts commit 1d1739dc6d7365c28719cd0175081f9d9aab0324. Reverted https://github.com/pytorch/pytorch/pull/107930 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/107930#issuecomment-1694069330))	2023-08-26 00:37:02 +00:00
CaoE	3267996372	add channel last 3d support for maxpool3d on CPU (#97775 ) ### Testing Single socket (28 cores): shape \| fp32 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig \| 3.959584 \| 5.493402 \| 0.557232 \| 0.568485 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL \| 0.815511 \| 1.351261 \| 5.710506 \| 10.57506 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig \| 10.63426 \| 15.28637 \| 2.67656 \| 1.71365 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL \| 2.63570 \| 2.05532 \| 2.55452 \| 2.33923 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig \| 0.375469 \| 0.479748 \| 0.066364 \| 0.065155 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d \| 0.112197 \| 0.112326 \| 0.111697 \| 0.145364 Single core: shape \| fp32 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig \| 92.16582 \| 128.6513 \| 6.684325 \| 12.21541 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL \| 10.14318 \| 29.80297 \| 7.350142 \| 11.25323 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig \| 238.55453 \| 331.89967 \| 19.694657 \| 32.78853 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL \| 30.17079 \| 32.75628 \| 22.44543 \| 30.17796 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig \| 7.474389 \| 9.937217 \| 0.236015 \| 0.434229 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d \| 2.318954 \| 2.469444 \| 0.262125 \| 0.401361 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97775 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki	2023-08-26 00:21:27 +00:00
AllenTiTaiWang	ee171465ad	[ONNX] Support constant tensors in FakeMode exporting (#107836 ) Fixes https://github.com/pytorch/pytorch/issues/107475 - Constant tensors was wrongly recognized as weights and buffers, and then was detached from its default value during `to_model_proto`. This PR fixes the bug and pick up Bloom CI test back successfully. NOTE: non-persistent buffer and weights has different situation and is not fixed by this PR. - Reduce transformers model size by modifying their config parameters to speed up CI tests. (Unrelated to this PR title) Corresponding change with https://github.com/microsoft/onnxscript/pull/1023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107836 Approved by: https://github.com/BowenBao, https://github.com/justinchuby	2023-08-26 00:06:49 +00:00
drisspg	42d60d012e	Bias overflow fix mem eff bias (#107968 ) Fixes #107959 This should have been fixed here https://github.com/pytorch/pytorch/pull/103201 Edit: Looking at git blame it appears the dropout revet squashed the changes from this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/107968 Approved by: https://github.com/cpuhrsch	2023-08-26 00:00:49 +00:00
Jerry Zhang	1d1739dc6d	[quant][pt2e][xnnpack_quantizer] Add support for mul and mul_relu (#107930 ) Summary: att Test Plan: buck2 run executorch/examples/quantization:example -- -m=mv3 --verify Differential Revision: D48588121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107930 Approved by: https://github.com/kimishpatel	2023-08-25 23:36:19 +00:00
PyTorch MergeBot	d35d7de60e	Revert "Handle 2D blocking with foreach (#107840 )" This reverts commit f87ffe473d4825c15eaea0360baf08cad49979de. Reverted https://github.com/pytorch/pytorch/pull/107840 on behalf of https://github.com/huydhn due to Sorry for reverting this, but test_2d_blocking is failing in trunk, probably a landrace as PR was green ([comment](https://github.com/pytorch/pytorch/pull/107840#issuecomment-1694009217))	2023-08-25 22:49:15 +00:00
Kunal Bhalla	af229ecd34	[RFC] Change --standalone to bind to a random port (#107734 ) Given standalone generates args anyways, it seems like it would be more convenient if it explicitly used a random port by default instead of trying to use 29400. That way users can directly go with `--standalone` instead of having to spell out `--rdzv-backend=c10d --rdzv-endpoint=localhost:0` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107734 Approved by: https://github.com/H-Huang	2023-08-25 22:13:44 +00:00
fduwjj	7ef13b1831	[TP][2D][EZ] Fix Error in FSDP 2D test (#107975 ) As title, TP dimension should be the second dim, so we need to pass tp_degree to the second rather the first dim of the mesh tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107975 Approved by: https://github.com/wz337, https://github.com/awgu	2023-08-25 21:56:03 +00:00
JackCaoG	08e49fe97a	Make openxla and opexla_eval backend show up in list_backends (#107905 ) The reason to keep the non-aot(openxla_eval) backend is discussed in https://github.com/pytorch/xla/issues/5430#issuecomment-1683191662. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107905 Approved by: https://github.com/jansel	2023-08-25 21:52:17 +00:00
Peter Bell	6c0ce03b1f	[inductor] WeakDep should not prevent dead node elimination (#107813 ) A WeakDep is classed as a read dependency but the buffer is never actually read. Instead it only effects schedule ordering. So for the purposes of dead node elimination we should ignore WeakDeps. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107813 Approved by: https://github.com/jansel, https://github.com/mlazos	2023-08-25 21:20:04 +00:00
Huamin Li	71045f4885	AOTInductor: error: ‘c10::Dispatcher’ has not been declared for CPU model (#107935 ) Differential Revision: D48675055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107935 Approved by: https://github.com/houseroad	2023-08-25 21:17:28 +00:00
leslie-fang-intel	1374974d60	[Quant][Inductor] Enable quantization conv_binary(add/add_relu) pattern fusion inside inductor (#105456 ) Summary Enable the `dequant-conv2d-binary_postop(add)-unary_postop(relu)-quant` pattern fusion and lowering inside inductor. Test Plan ``` clear && python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_binary ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105456 Approved by: https://github.com/jgong5, https://github.com/eellison ghstack dependencies: #104580, #104581, #104588, #104590, #105455	2023-08-25 21:16:02 +00:00
XiaobingSuper	d2105a8688	inductor: support masked load for cpu path (#107670 ) For max_pooling code: ``` #pragma GCC ivdep for(long i2=static_cast<long>(0L); i2<static_cast<long>(56L); i2+=static_cast<long>(1L)) { for(long i3=static_cast<long>(0L); i3<static_cast<long>(64L); i3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<int>(static_cast<int>((-1L) + (2Li1))); auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0)); auto tmp2 = to_float_mask(tmp0 >= tmp1); auto tmp3 = at::vec::Vectorized<int>(static_cast<int>(112)); auto tmp4 = to_float_mask(tmp0 < tmp3); auto tmp5 = tmp2 & tmp4; auto tmp6 = at::vec::Vectorized<int>(static_cast<int>((-1L) + (2Li2))); auto tmp7 = to_float_mask(tmp6 >= tmp1); auto tmp8 = to_float_mask(tmp6 < tmp3); auto tmp9 = tmp7 & tmp8; auto tmp10 = tmp5 & tmp9; auto tmp11 = [&] { auto tmp12 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>((-7232L) + i3 + (128Li2) + (14336Li1) + (802816L*i0)), 16); load auto tmp13 = cvt_lowp_fp_to_fp32<bfloat16>(tmp12); return tmp13; } ; auto tmp14 = decltype(tmp11())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp11(), to_float_mask(tmp10)); ``` the index of ```tmp12 ``` may be a correct index, such as ```i1=0, i2=0, i3=0```, the index is ```-7232L```, it is not a valid index. We may meet segmentation fault error when we call ```tmp11()```, the original behavior is that only the ```tmp10```(index check variable) is true, we can safely get the value, this PR will support masked_load to fixing this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107670 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-08-25 21:11:09 +00:00
Michael Lazos	f87ffe473d	Handle 2D blocking with foreach (#107840 ) Previously blocking in foreach ops was only 1D. This PR allows handling kernels with 2D blocking with foreach as well. Code when at least one dim matches: [example code + output](https://gist.github.com/mlazos/9f100b21cfe2540f0a24303a8349c196) Code when neither X or Y dim matches: [example code + output](https://gist.github.com/mlazos/14e2a455f635896dface09be601595dd) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107840 Approved by: https://github.com/jansel	2023-08-25 20:32:36 +00:00
cyy	d9fb7166d6	[BE] use DeviceIndex instead of int64_t for related device interfaces (#103068 ) This PR unifies the device interfaces in aten/cpp and torch/csrc/cpp to use c10::DeviceIndex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103068 Approved by: https://github.com/malfet	2023-08-25 20:16:14 +00:00
Jane Xu	4656e09431	Fixes #107737 SGD doc blank line (#107738 ) docs preview brings joy <img width="774" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/1bfaae64-16f2-448a-8af2-36303d2845db"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107738 Approved by: https://github.com/mikaylagawarecki	2023-08-25 19:48:30 +00:00
PyTorch MergeBot	161ea463e6	Revert "Remove remaining global `set_default_dtype` calls from tests (#107246 )" This reverts commit aa8ea1d787a9d21b064b664c5344376265feea6c. Reverted https://github.com/pytorch/pytorch/pull/107246 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/107246#issuecomment-1693838522))	2023-08-25 19:34:55 +00:00
Peter Bell	c68d0a7042	[ATen] Update pre-compiled header (#106915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106915 Approved by: https://github.com/lezcano ghstack dependencies: #106914	2023-08-25 18:24:05 +00:00
Peter Bell	a6c29b7227	Remove some unnecessary <iostream> includes from headers (#106914 ) In almost all cases this is only included for writing the output formatter, which only uses `std::ostream` so including `<ostream>` is sufficient. The istream header is ~1000 lines so the difference is non-trivial. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106914 Approved by: https://github.com/lezcano	2023-08-25 18:24:05 +00:00
Animesh Jain	78a053bad7	[activation checkpointing] Add default autocast keys to functional rng wrappers (#107934 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107934 Approved by: https://github.com/xw285cornell	2023-08-25 18:22:02 +00:00
CaoE	3992450e8d	Add backward check for test_memory_format (#106104 ) Add backward check for test_memory_format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106104 Approved by: https://github.com/mikaylagawarecki	2023-08-25 18:11:54 +00:00
leslie-fang-intel	c1e0fb7ff0	[Quant][Inductor] Enable quantization conv_unary(relu) pattern fusion inside inductor (#105455 ) Summary Enable the `dequant-conv2d-unary_postop(relu)-quant` pattern fusion and lowering inside inductor. Test Plan ``` clear && python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_unary ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105455 Approved by: https://github.com/jgong5, https://github.com/eellison ghstack dependencies: #104580, #104581, #104588, #104590	2023-08-25 18:07:29 +00:00
leslie-fang-intel	4f3ff16baf	[Quant][Inductor] Enable dequant promotion inside inductor (#104590 ) Summary Enable the `dequant pattern` promotion pass in inductor. Since in the qconv weight prepack pass, we will match the `dequant->conv2d` pattern. If the `dequant pattern` has multi user nodes, it will fail to be matched. Taking the example of ``` conv1 / \ conv2 conv3 ``` After quantization flow, it will generate pattern as ``` dequant1 \| conv1 \| quant2 \| dequant2 / \ conv2 conv3 ``` We need to duplicate `dequant2` into `dequant2` and `dequant3`, in order to make `dequant2->conv2` and `dequant3->conv3` pattern matched. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_dequant_promotion ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104590 Approved by: https://github.com/jgong5, https://github.com/eellison ghstack dependencies: #104580, #104581, #104588	2023-08-25 18:01:06 +00:00
David Watson	087c0613c3	Implement size checking for copy_ with meta tensors (#107779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107779 Approved by: https://github.com/ezyang	2023-08-25 17:59:16 +00:00
leslie-fang-intel	46f63e283b	[Quant][Inductor] Enable quantization conv pattern fusion inside inductor (#104588 ) Summary Enable the `dequant-quantization-quant` pattern fusion and lowering inside inductor. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_unary ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104588 Approved by: https://github.com/jgong5, https://github.com/eellison ghstack dependencies: #104580, #104581	2023-08-25 17:57:13 +00:00
Will Constable	572bc4817d	Fix how DDPOptimizer clones dynamo callback (#107834 ) Instead of hardcoding a new callback creation using 'convert_frame', add an attribute to both callbacks that implement 'self cloning with new backend', so DDPOptimizer can invoke this in a consistent way. Fixes #107686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107834 Approved by: https://github.com/ezyang	2023-08-25 17:46:36 +00:00
leslie-fang-intel	25678e31dc	[Quant][Inductor] Enable quantized conv weight prepack inside inductor constant folding (#104581 ) Summary Enable quantization conv weight prepack inside inductor constant folding. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_unary ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104581 Approved by: https://github.com/jgong5, https://github.com/eellison ghstack dependencies: #104580	2023-08-25 17:37:41 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	2b7271c703	Support cond and out_dtype for predispatch (#107941 ) Summary: Title Test Plan: CI Differential Revision: D48675742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107941 Approved by: https://github.com/jerryzh168	2023-08-25 17:37:16 +00:00
leslie-fang-intel	8ef057255d	[Quant][PT2E] Enable qconv for quantization 2.0 export (#104580 ) Summary Enable `qconv1d/2d/3d`, `qconv2d_relu`, `qconv2d_add`, and `qconv2d_add_relu` operator for quantization 2.0 export with oneDNN library. Test Plan ``` python -u -m pytest -s -v test_quantized_op.py -k test_qconv1d_pt2e python -u -m pytest -s -v test_quantized_op.py -k test_qconv2d_pt2e python -u -m pytest -s -v test_quantized_op.py -k test_qconv3d_pt2e python -u -m pytest -s -v test_quantized_op.py -k test_qconv2d_relu_pt2e python -u -m pytest -s -v test_quantized_op.py -k test_qconv2d_add_pt2e python -u -m pytest -s -v test_quantized_op.py -k test_qconv2d_add_relu_pt2e ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104580 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-08-25 17:34:45 +00:00
Xiao Wang	679e8e9d48	[cuda] Fix the incorrect types in int8_gemm (#107895 ) Fixes #107671 From cublas team: alpha and beta need to be of the same C++ type as of scaleType, which is `int32_t` here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107895 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch	2023-08-25 17:30:54 +00:00
Adnan Akhundov	d24c457b30	[inductor] Add cat + split_with_sizes elimination pass (#107956 ) Summary: When the `cat` inputs' sizes and the `split_sizes` of the downstream `split_with_sizes` match, the `cat` + `split_with_sizes` constellation can be eliminated. E.g. here: ``` @torch.compile def fn(a, b, c): cat = torch.ops.aten.cat.default([a, b, c], 1) split_with_sizes = torch.ops.aten.split_with_sizes.default(cat, [2, 3, 5], 1) return [s ** 2 for s in split_with_sizes] inputs = [ torch.randn(2, 2, device="cuda"), torch.randn(2, 3, device="cuda"), torch.randn(2, 5, device="cuda"), ] output = fn(*inputs) ``` This PR adds a new fx pass for such elimination. The new pass is similar to the existing [`splitwithsizes_cat_replace`](`b18e1b684a/torch/_inductor/fx_passes/post_grad.py (L508)`), but considers the ops in the opposite order. Test Plan: ``` $ python test/inductor/test_pattern_matcher.py ... ---------------------------------------------------------------------- Ran 21 tests in 46.450s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/107956 Approved by: https://github.com/jansel	2023-08-25 17:17:19 +00:00
gmagogsfm	9af0e47653	Hide `transform` method by renaming it (#107940 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107940 Approved by: https://github.com/tugsbayasgalan	2023-08-25 16:31:44 +00:00
David Watson	598babf017	Added normal op decomposition for specializations of the normal op (#106792 ) This fixes running normal with the meta key. ``` import torch t = torch.tensor(4.0, device='meta') torch.normal(0.5, t) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106792 Approved by: https://github.com/lezcano	2023-08-25 16:18:28 +00:00
PyTorch MergeBot	b4c6c4da88	Revert "[Dynamo] cache_size policy (#107496 )" This reverts commit 4175a6e9443c58d57e0fc595f506c43aec8cb477. Reverted https://github.com/pytorch/pytorch/pull/107496 on behalf of https://github.com/ZainRizvi due to Breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/107496#issuecomment-1693590121))	2023-08-25 16:07:14 +00:00
angelayi	4b44b1861d	[export] Store the arguments used to trace the exported program in itself (#107906 ) Proper fix would be to do something like https://github.com/pytorch/pytorch/pull/107877, but since that depends on internal changes and it would take too long for diff train to land we will first just make OSS work using torch.save. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107906 Approved by: https://github.com/gmagogsfm	2023-08-25 16:04:58 +00:00
hongxyan	48e05d5d44	[ROCm] enable missed cpp tests - test_libtorch_jit (test_jit and test_lazy) (#107234 ) [ROCm] enable missed cpp tests - test_libtorch_jit (test_jit and test_lazy) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107234 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/ezyang	2023-08-25 16:03:58 +00:00
angelayi	b382d55338	[core aten] Remove split.Tensor from core aten (#107938 ) Removing split.Tensor from core aten as it can be decomposed into split_with_sizes Pull Request resolved: https://github.com/pytorch/pytorch/pull/107938 Approved by: https://github.com/larryliu0820	2023-08-25 15:53:04 +00:00
DanilBaibak	b445ed3158	Cleanup RUNNER_TEMP folder (#107868 ) Cleanup RUNNER_TEMP folder Pull Request resolved: https://github.com/pytorch/pytorch/pull/107868 Approved by: https://github.com/atalman	2023-08-25 15:09:15 +00:00
PyTorch MergeBot	3a3cf0e09d	Revert "[optim] Make casting to match params a hook (#106725 )" This reverts commit 9f86d8517201a3d473a8e80d12e22b46570c88a2. Reverted https://github.com/pytorch/pytorch/pull/106725 on behalf of https://github.com/janeyx99 due to We acknowledge this is a huge risk because people do not remember to call super().__init__ from their Optimizer subclasses and so this will break lots of load_state_dict behavior ([comment](https://github.com/pytorch/pytorch/pull/106725#issuecomment-1693386137))	2023-08-25 13:47:19 +00:00
albanD	b9472decf8	Initial Python 3.12 build fixes (#106083 ) This compiles with python 3.12 You can get numpy from https://anaconda.org/scientific-python-nightly-wheels/numpy/files so that you don't need to remove numpy from test files. Basic core tests work but obviously dynamo and first class dims don't work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106083 Approved by: https://github.com/ezyang	2023-08-25 13:23:48 +00:00
Xia, Weiwen	97a291f6bd	[ONEDNN][BC-breaking] update onednn from v2.7.3 to v3.1.1 (#97957 ) Summary Update onednn from v2.7.3 to v3.1.1. It is bc-breaking as some APIs are changed on oneDNN side. Changes include: - PyTorch code where oneDNN is directly called - Submodule `third_party/ideep` to adapt to oneDNN's new API. - CMAKE files to fix build issues. Test plan Building issues and correctness are covered by CI checks. For performance, we have run TorchBench models to ensure there is no regression. Below is the comparison before and after oneDNN update. ![image](https://github.com/pytorch/pytorch/assets/12522207/415a4ff0-7566-40c6-aed0-24997a475b0e) Note: - Base commit of PyTorch: da322ea - CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Ice Lake) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97957 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-08-25 12:13:18 +00:00
dilililiwhy	ff37f6018d	Enable custom device support in fsdp checkpoint (#107289 ) Fixes https://github.com/pytorch/pytorch/issues/104390 Enable custom device(privateuse1 backend) support in checkpointing by a dynamic abstract device module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107289 Approved by: https://github.com/wz337	2023-08-25 11:50:03 +00:00
dependabot[bot]	b18e1b684a	Bump scipy from 1.9.3 to 1.10.1 in /.ci/docker (#104746 ) * Bump scipy from 1.9.3 to 1.10.1 in /.ci/docker Bumps [scipy](https://github.com/scipy/scipy) from 1.8.1 to 1.10.0. - [Release notes](https://github.com/scipy/scipy/releases) - [Commits](https://github.com/scipy/scipy/compare/v1.8.1...v1.10.0) --- updated-dependencies: - dependency-name: scipy dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Update .ci/docker/requirements-ci.txt --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Nikita Shulga <nshulga@meta.com>	2023-08-25 20:20:20 +09:00
Jack Taylor	196ef78b90	[ROCm] Use rocm manylinux builder image for triton wheels (#107600 ) Update to ROCm triton pinned commit for the 2.1 branch cut off. As part of this we are updating `build_triton_wheel.py` and `build-triton-wheel.yml` to support building ROCm triton wheels through pytorch/manylinux-rocm to avoid the need of slowly downloading rpm libraries for ROCm in the cpu manylinux builder image and avoiding the need to maintain a conditional file with hard coded repositories from radeon.org for every ROCm release. This new approach will allow us to build wheels faster in a more easily maintainable way. This PR also brings in a required change as Triton on ROCm requires device_type to be set to hip so we can pass down the correct device type to triton (https://github.com/ROCmSoftwarePlatform/triton/pull/284). Pull Request resolved: https://github.com/pytorch/pytorch/pull/107600 Approved by: https://github.com/jansel, https://github.com/jithunnair-amd	2023-08-25 10:25:29 +00:00
gmagogsfm	39854df1d3	Make validate private by renaming validate to _validate (#107927 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107927 Approved by: https://github.com/tugsbayasgalan	2023-08-25 08:14:56 +00:00
Michael Lazos	f2f82855e2	Add tests for foreach copy (#107860 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/107860 Approved by: https://github.com/eellison, https://github.com/jansel	2023-08-25 07:32:11 +00:00
Christian Puhrsch	925d71e72e	[core][sparse][pruning] cuSPARSELt Kernels and ops. (#107398 ) Summary: This is a duplicate PR of 102133, which was reverted because it was failing internal tests. It seems like that internal builds did not like my guard to check if cuSPARSELt was available or not. Test Plan: python test/test_sparse_semi_structured.py Differential Revision: D48440330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107398 Approved by: https://github.com/cpuhrsch	2023-08-25 07:04:15 +00:00
Simon Fan	c2ac0da445	Enhance fakepg: add fsdp+tp tests (#107626 ) from working on a starter task with @wanchaol (T161350434): Add the previously unsupported fsdp+tp example in FakePG, which required scatter and broadcast, as a unit test: https://github.com/pytorch/examples/blob/main/distributed/tensor_parallelism/two_d_parallel_example.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/107626 Approved by: https://github.com/wanchaol ghstack dependencies: #107625	2023-08-25 06:17:54 +00:00
Avik Chaudhuri	bfcd26459c	improved error message for IO mismatch (#107907 ) Previously when we found some input or output mismatch between original args / traced result vs. graph-captured input / output, we would have a pretty sparse error message. (This might be partly due to the urge to reuse the same code for matching both inputs and outputs.) With this PR we now point out which input or output is problematic, what its type is, and also present the expected types along with descriptions of what they mean. We don't suggest any fixes, but the idea is that it should be evident what went wrong looking at the error message. Differential Revision: [D48668059](https://our.internmc.facebook.com/intern/diff/D48668059/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107907 Approved by: https://github.com/gmagogsfm	2023-08-25 06:08:44 +00:00
gmagogsfm	bfb09204bd	Expose torch.export.{save,load} APIs (#107888 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107888 Approved by: https://github.com/angelayi	2023-08-25 06:06:36 +00:00
Chen Lai	4f2ff1d019	add get buffer from exported program (#107809 ) Summary: We have the util function to get params, for parity we also need util function to get buffer` Test Plan: ``` buck test //caffe2/test:test_export ``` Differential Revision: D48610877 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107809 Approved by: https://github.com/JacobSzwejbka	2023-08-25 05:46:04 +00:00
Jerry Zhang	a0cfaf0688	[quant][pt2e] Make sure XNNPACKQuantizer works with the pre_dispatch=True (#107872 ) Summary: att Test Plan: ``` buck test //executorch/backends/xnnpack/test:test_xnnpack_quantized_models -- test_resnet18 buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_pt2e ``` Reviewed By: andrewor14, tugsbayasgalan Differential Revision: D48415977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107872 Approved by: https://github.com/andrewor14	2023-08-25 05:04:01 +00:00
ssjia	86f9fec3ac	Avoid decomposing `_unsafe_index` in Inductor (#107882 ) `_unsafe_index` was previously added to the core ATen decomp table in https://github.com/pytorch/pytorch/pull/106814, but this has performance ramifications for Inductor. Therefore, this diff removes it from the decomposition table used by Inductor. Differential Revision: [D48649210](https://our.internmc.facebook.com/intern/diff/D48649210/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107882 Approved by: https://github.com/SherlockNoMad	2023-08-25 04:51:53 +00:00
FFFrog	e00bd83124	Fix the example of torch.slice_scatter (#107849 ) Fixes #107681 fix the example of torch.slice_scatter Pull Request resolved: https://github.com/pytorch/pytorch/pull/107849 Approved by: https://github.com/drisspg	2023-08-25 04:19:49 +00:00
Animesh Jain	8b7b824dca	[inductor][ac] preserve recompute tags through pattern matching (#107742 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107742 Approved by: https://github.com/eellison	2023-08-25 03:48:26 +00:00
PyTorch UpdateBot	df2ca1871d	[vision hash update] update the pinned vision hash (#107911 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107911 Approved by: https://github.com/pytorchbot	2023-08-25 03:27:42 +00:00
Nikita Shulga	6e85a68829	[MPS] Implement `polar` via metal shader (#107324 ) Use `view_as_real` to cast complex into a pair of floats and then it becomes just another binary operator. Enable `polar` and `view_as_complex` consistency tests, but skip `test_output_grad_match_polar_cpu` as `mul` operator is yet not supported Remove redundant `#ifdef __OBJC__` and capture and re-throw exceptions captured during `createCacheBlock` block. Fixes https://github.com/pytorch/pytorch/issues/78503 TODOs(in followup PRs): - Implement backwards (requires complex mul and sgn) - Measure the perf impact of computing the strides on the fly rather than ahead of time (unrelated to this PR) Partially addresses https://github.com/pytorch/pytorch/issues/105665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107324 Approved by: https://github.com/albanD	2023-08-25 03:16:23 +00:00
BowenBao	00e9735ee3	[ONNX] Enable 'ExportOutput.save' for models larger than 2GB (#107904 ) Previously it fails during serialization, despite onnxscript graph_building managed to return ModelProto > 2GB. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107904 Approved by: https://github.com/abock	2023-08-25 03:08:38 +00:00
Jackie (Jiaqi) Xu	fc33dc014a	[inductor][fx passes] batch tanh in pre grad (#107881 ) Summary: Daohang report this pattern in f469463749 {F1074472207} {F1074473348} Hence, we can fuse the tanh after same split. Typically the pattern looks like split->getitem0,...n-> tanh(geitem 0,..., n). Hence, we search for parent node of tahn nodes and the node should be getitem(parent, index). If tanh is after same split node, parent nodes of getitem nodes should be same. Test Plan: ``` [jackiexu0313@devgpu005.cln5 ~/fbsource/fbcode (c78736187)]$ buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py Buck UI: https://www.internalfb.com/buck2/df87affc-d294-4663-a50d-ebb71b98070d Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149208311124 Network: Up: 0B Down: 0B Jobs completed: 16. Time elapsed: 1:19.9s. Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D48581140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107881 Approved by: https://github.com/yanboliang	2023-08-25 03:02:30 +00:00
Elias Ellison	0a9778a372	Expose cudaStreamCaptureMode in CUDA Graphs, use local setting in inductor (#107407 ) > capture_error_mode (str, optional): specifies the cudaStreamCaptureMode for the graph capture stream. Can be "global", "thread_local" or "relaxed". During cuda graph capture, some actions, such as cudaMalloc, may be unsafe. "global" will error on actions in other threads, "thread_local" will only error for actions in the current thread, and "relaxed" will not error on these actions. Inductor codegen is single-threaded, so it should be safe to enable "thread_local" for inductor's cuda graph capturing. We have seen errors when inductor cudagraphs has been used concurrently with data preprocessing in other threads. Differential Revision: [D48656014](https://our.internmc.facebook.com/intern/diff/D48656014) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107407 Approved by: https://github.com/albanD, https://github.com/eqy	2023-08-25 01:44:26 +00:00
David Berard	c18d2a3c05	profiler tree test: skip cudaGetDeviceProperties_v2, cudaGetDeviceCount (#107887 ) I don't know why these are getting called. But, they only get called on cuda machines, which is breaking tests (https://github.com/pytorch/pytorch/issues/100728). We can just prune them so that the same result is shown for both CPU and CUDA tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107887 Approved by: https://github.com/aaronenyeshi	2023-08-25 01:32:25 +00:00
weifengpy	ec10b17cfb	[FSDP] verify backward_prefetch works correctly with unit test (#107058 ) issue resolved: https://github.com/pytorch/pytorch/pull/105984 context: * CI did not catch the commit that breaks backward_prefetch https://github.com/pytorch/pytorch/pull/105006 * we had an action item to add unit test to prevent similar cases: https://github.com/pytorch/pytorch/pull/105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and check which handles are prefetched for backward_prefetch = BackwardPrefetch.BACKWARD_PRE * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * pre-backward hook order: root -> decoder 5...0 -> encoder 5...0 * prefetch order: decoder 5...0 -> encoder 5...0 -> None * when current_handle=encoder 0, _get_handle_to_prefetch returns None for backward_prefetch = BackwardPrefetch.BACKWARD_POST * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * post-backward hook (AccumulateGrad) order: decoder 5, 4...0 -> encoder 5...0 -> root * prefetch order: decoder 4...0 -> encoder 5...0 -> None -> None * 1st None: when current_handle=encoder 0, _get_handle_to_prefetch returns None * 2nd None: when current_handle=root, we get decoder 5 inside _get_handle_to_prefetch but is not needed. so returns None Pull Request resolved: https://github.com/pytorch/pytorch/pull/107058 Approved by: https://github.com/awgu	2023-08-25 01:12:43 +00:00
Tugsbayasgalan Manlaibaatar	485de73004	Improve unbacked symint error msg (#107806 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107806 Approved by: https://github.com/avikchaudhuri	2023-08-25 01:07:09 +00:00
cyy	49eeca00d1	[1/N] fix clang-tidy warnings in torch/csrc (#107648 ) Apply fixes to some found issues by clang-tidy in torch/csrc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107648 Approved by: https://github.com/Skylion007	2023-08-25 00:30:09 +00:00
gmagogsfm	7dd1113463	Expose ExportedProgram and related classes (#107852 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107852 Approved by: https://github.com/zhxchen17, https://github.com/angelayi	2023-08-25 00:07:00 +00:00
Xilun Wu	49fbaa29e6	[c10d] Increase socket buffer size to allow ProcessGroup init up to 12k ranks (#107878 ) The c10d socket and gloo listener both set their buffer size to 2048 which causes connection issue at 4k scale. This diff sets the buffer size to `-1` which uses `somaxconn` as the actual buffer size, aiming to enable 24k PG init without crash. The experiment shows the ability to successful creation of 12k ranks without crash. split the original diff for OSS vs. internal. Caution: we need the change on both gloo and c10d to enable 12k PG init. Updating only one side may not offer the benefit. Differential Revision: [D48634654](https://our.internmc.facebook.com/intern/diff/D48634654/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107878 Approved by: https://github.com/H-Huang, https://github.com/fduwjj	2023-08-25 00:06:30 +00:00
Digant Desai	8a7a6867b9	[PyTorch][Tensor] Introduce tensor.dim_order (#106835 ) Summary: This is a stride based attribute for a tensor available in Python. This can help inspect tensors generated using `torch.empty_permuted(.., physical_layout, ...)`, where physical_layout should match the dim_order returned here. `empty_permuted` will be renamed to use dim_order as the param name in the future. And also help Executorch export pipeline with implementing dim_order based tensors. Differential Revision: D48134476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106835 Approved by: https://github.com/ezyang	2023-08-25 00:06:03 +00:00
Kimish Patel	2fbe6ef2f8	[pytorch][Quant] Fix bias quant bug (#107810 ) Summary: Bias should be quantized by act_scale * weight_scale in conv and linear Test Plan: Rewrite tests Differential Revision: D48606828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107810 Approved by: https://github.com/jerryzh168	2023-08-24 23:44:19 +00:00
Wei Wei	497571df58	[aot_inductor] fix hardcoded output dtype (#107825 ) Summary: as titled Reviewed By: chenyang78 Differential Revision: D47779519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107825 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2023-08-24 22:16:13 +00:00
Simon Fan	870fa460be	Enhance fakepg: send and recv (#107625 ) from working on a starter task with @wanchaol (T161350434): Enhance Fake Process Group by adding missing collectives: send, recv Pull Request resolved: https://github.com/pytorch/pytorch/pull/107625 Approved by: https://github.com/fduwjj, https://github.com/wanchaol	2023-08-24 22:06:34 +00:00
youkaichao	66dc1aba03	[Inductor][MacOS] resolve macos openmp problem and provide a holistic instruction (#107111 ) There has been several reports of difficulty in using OpenMP in MacOS, e.g.: https://github.com/pytorch/pytorch/issues/95708 . And there are several PRs to fix it, e.g.: https://github.com/pytorch/pytorch/pull/93895 and https://github.com/pytorch/pytorch/pull/105136 . This PR tries to explain the root cause, and provide a holistic and systematic way to fix the problem. For the OpenMP program below to run, the compiler must: - Be able to process macros like `#pragma omp parallel` - Be able to find header files like `<omp.h>` - Be able to link to a library file like `libomp` ```C++ #include <omp.h> int main() { omp_set_num_threads(4); #pragma omp parallel { int id = omp_get_thread_num(); int nthrds = omp_get_num_threads(); int y = id * nthrds; } } ``` In MacOS, there might be different compiler tools: - Apple builtin `clang++`, installed with `xcode commandline tools`. The default `g++` and `clang++` commands both point to the Apple version, as can be confirmed by `g++ --version` - Public `clang++`, can be installed via `brew install llvm`. - Public GNU compiler `g++`, can be installed via `brew install gcc`. Among these compilers, public `clang++` from LLVM and `g++` from GNU both support OpenMP with the flag `-fopenmp`. They have shipped with `<omp.h>` and `libomp` support. The only problem is that Apple builtin `clang++` does not contain `<omp.h>` or `libomp`. Therefore, users can follow the steps to enable OpenMP support: - Use a compiler other than Apple builtin clang++ by specifying the `CXX` environment variable - Use `conda install llvm-openmp` to place the header files and lib files inside conda environments (and can be discovered by `CONDA_PREFIX`) - Use `brew install libomp` to place the header files and lib files inside brew control (and can be discovered by `brew --prefix libomp`) - Use a custom install of OpenMP by specifying an `OMP_PREFIX` where header files and lib files can be found. This PR reflects the above logic, and might serve as a final solution for resolving OpenMP issues in MacOS. This PR also resolves the discussion raised in https://dev-discuss.pytorch.org/t/can-we-add-a-default-backend-when-openmp-is-not-available/1382/5 with @jansel , and provide a way for brew users to automatically find the installation via `brew --prefix libomp`, and provide instructions to switch to another compiler by `CXX` environment variable. I have tested the following code in different conditions: - Use `CXX` to point to an LLVM-clang++, works fine. - Use `CXX` to point to a GNU g++, not working because the compiler flag `-Xclang`. Manually removing the code `base_flags += " -Xclang"` works. - Use default compiler and `conda install llvm-openmp`, works fine - Use default compiler and `brew install libomp`, works fine - Do nothing, compiler complains `omp.h` not found. ```python import torch @torch.compile def f(x): return x + 1 f(torch.randn(5, 5)) ``` If we want the code to be more portable, we can also deal with the `-Xclang` issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107111 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-08-24 21:58:27 +00:00
Animesh Jain	4175a6e944	[Dynamo] cache_size policy (#107496 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107496 Approved by: https://github.com/ezyang ghstack dependencies: #107645	2023-08-24 21:50:00 +00:00
soulitzer	d7130e9704	Add SingletonSymIntNode (#107089 ) Adds `SingletonSymNodeImpl` (alternatively, `SkolemSymNodeImpl`). This is a int-like object that only allows the`eq` operation; any other operation produces an error. The main complexity is that we require operations that dispatch to SymNode must take and return SymNodes, but when performing operations involving `SingletonSymNodeImpl`, operations involving SymNode can return non-SymNode bools. For more discussion see [here](https://docs.google.com/document/d/18iqMdnHlUnvoTz4BveBbyWFi_tCRmFoqMFdBHKmCm_k/edit) - Introduce `ConstantSymNodeImpl` a generalization of `LargeNegativeIntSymNodeImpl` and replace usage of `LargeNegativeIntSymNodeImpl` in SymInt. - Also use ConstantSymNodeImpl to enable SymBool to store its data on a SymNode. Remove the assumption that if SymBool holds a non-null SymNode, it must be symbolic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107089 Approved by: https://github.com/ezyang ghstack dependencies: #107839	2023-08-24 21:38:47 +00:00
Aaron Gokaslan	a41d15e458	Update nccl submodule to 2.18.5 (#107883 ) Updates NCCL submodule to v2.18.5 . It's exactly the same as 2.18.3 except for a few bugfixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107883 Approved by: https://github.com/ezyang	2023-08-24 21:30:27 +00:00
Svetlana Karslioglu	5c133e91c3	Add guidance on the tutorials release proccess (#107871 ) Clarify the tutorials release process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107871 Approved by: https://github.com/atalman	2023-08-24 21:27:35 +00:00
Prachi Gupta	1ef4bd169d	[ROCm] Add conditions for channels last logic (#107812 ) Although there are some performance benefits by enforcing NHWC convolutions as inductor's fallback method for all hardware this may not be the case. Currently on ROCm we are seeing some slow downs in gcnArch that do not have optimal NHWC implementations and would like to introduce some control on this behavior in pytorch. On ROCm MI200 series we will default to the enforced last channels behavior aligned with the rest of pytorch but on non-MI200 series we will disable the forced layout. For now we are using torch.cuda.get_device_name(0) for this control but we will replace with gcnArchName when https://github.com/pytorch/pytorch/pull/107477 lands. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107812 Approved by: https://github.com/jataylo, https://github.com/eellison	2023-08-24 19:39:49 +00:00
Zachary DeVito	40cbda274b	document memory snapshotting (#107660 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107660 Approved by: https://github.com/albanD ghstack dependencies: #107171, #107399	2023-08-24 19:20:03 +00:00
atalman	cd031f13ba	[security] Move s3-html-update workflow into its own environment (#107889 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at d02bfb0</samp> Add environment name for S3 HTMLs workflow. This allows secure and controlled access to the secrets and approval for updating the PyTorch whl indexes on S3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107889 Approved by: https://github.com/huydhn	2023-08-24 19:08:51 +00:00
Michael Lazos	6c508e0be4	refactor common code, fix test discovery (#107506 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/107506 Approved by: https://github.com/voznesenskym	2023-08-24 18:14:38 +00:00
FFFrog	969bf8a054	Fix the document of torch.nn.functional.conv2d (#107851 ) Fixes #107692 Fix the document of torch.nn.functional.conv2d Pull Request resolved: https://github.com/pytorch/pytorch/pull/107851 Approved by: https://github.com/mikaylagawarecki	2023-08-24 18:02:03 +00:00
soulitzer	f6cce3c468	Fix sym_{sizes,strides} slow path (#107839 ) Previously, when SymInt is returned from sym_sizes slow path, it would segfault. This is useful for tensors that have symbolic sizes and use the sym_sizes slow path, e.g. NestedTensor returning SingletonSymInt as its sizes in the slow path. See also: https://github.com/pytorch/pytorch/pull/106405/files#r1303714865 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107839 Approved by: https://github.com/ezyang	2023-08-24 17:28:05 +00:00
Vishwa Raj Singh	35de780aa6	Fix Inplace tensor update on transpose (#104689 ) Fixes #https://github.com/pytorch/pytorch/issues/103650 - To align with HPU device backend architecture. Ensure all non-view ops return contiguous fake tensor outputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104689 Approved by: https://github.com/ezyang	2023-08-24 16:58:50 +00:00
soulitzer	3cc5c42a23	Fix aot sequence_nr to reset bwd flag (#107210 ) The way the aot autograd sequence_nr tracking works is that we run the aot export logic, the dynamo captured forward graph is run under an fx.Interpreter, which iterates through the nodes of the forward graph while setting the `current_metadata`. Since during backward what is run doesn't correspond to any node during forward, we fallback to the global `current_metadata`. And since this global metadata is ends up being shared between runs, that leads to weirdness if we forget to reset things, e.g., depending whether this is the first test run, the printed results will be different. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107210 Approved by: https://github.com/bdhirsh	2023-08-24 16:58:12 +00:00
PyTorch MergeBot	eefce56b66	Revert "[dynamo] Treat monkey patched .forward as dynamic (#107104 )" This reverts commit 79b3a9f94537677f9079915001c88bb0745c1e52. Reverted https://github.com/pytorch/pytorch/pull/107104 on behalf of https://github.com/ZainRizvi due to Breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/107104#issuecomment-1692072018))	2023-08-24 16:55:33 +00:00
Kurt Mohler	aa8ea1d787	Remove remaining global `set_default_dtype` calls from tests (#107246 ) Fixes #68972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107246 Approved by: https://github.com/ezyang	2023-08-24 16:10:48 +00:00
Elias Ellison	918df10198	[Easy] use dtype.itemsize in partitions (#107749 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107749 Approved by: https://github.com/davidberard98	2023-08-24 16:07:05 +00:00
Animesh Jain	0156eeb564	[dynamo] bugfix - make module setattr more restrictive (#107828 ) A check got missed in https://github.com/pytorch/pytorch/pull/106092 Fixes https://github.com/pytorch/pytorch/issues/107721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107828 Approved by: https://github.com/eellison	2023-08-24 16:00:29 +00:00
Aiden Nibali	85b0e03df8	Default permissions for torch.hub downloads (#82869 ) ### Description The `download_url_to_file` function in torch.hub uses a temporary file to prevent overriding a local working checkpoint with a broken download.This temporary file is created using `NamedTemporaryFile`. However, since `NamedTemporaryFile` creates files with overly restrictive permissions (0600), the resulting download will not have default permissions and will not respect umask on Linux (since moving the file will retain the restrictive permissions of the temporary file). This is especially problematic when trying to share model checkpoints between multiple users as other users will not even have read access to the file. The change in this PR fixes the issue by using custom code to create the temporary file without changing the permissions to 0600 (unfortunately there is no way to override the permissions behaviour of existing Python standard library code). This ensures that the downloaded checkpoint file correctly have the default permissions applied. If a user wants to apply more restrictive permissions, they can do so via usual means (i.e. by setting umask). See these similar issues in other projects for even more context: * https://github.com/borgbackup/borg/issues/6400 * https://github.com/borgbackup/borg/issues/6933 * https://github.com/zarr-developers/zarr-python/issues/325 ### Issue https://github.com/pytorch/pytorch/issues/81297 ### Testing Extended the unit test `test_download_url_to_file` to also check permissions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82869 Approved by: https://github.com/vmoens	2023-08-24 15:48:24 +00:00
Andrew Or	64d5851b1f	make python decomp for native_batch_norm CompositeImplicitAutograd, remove native_batch_norm from core aten opset (#107791 ) Summary: (From Brian Hirsh) Description copied from what I put in a comment in this PR: https://github.com/pytorch/pytorch/pull/106329 So, the slightly-contentious idea behind this PR is that lower in the stack, I updated torch._decomps.get_decomps() to check not only the decomp table to see if a given op has a decomposition available, but to also check the dispatcher for any decomps registered to the CompositeImplicitAutograd key (link: https://github.com/pytorch/pytorch/pull/105865/files#diff-7008e894af47c01ee6b8eb94996363bd6c5a43a061a2c13a472a2f8a9242ad43R190) There's one problem though: we don't actually make any hard guarantees that a given key in the dispatcher points does or does not point to a decomposition. We do rely pretty heavily, however, on the fact that everything registered to the CompositeImplicitAutograd key is in fact a decomposition into other ops. QAT would like this API to faithfully return "the set of all decomps that would have run if we had traced through the dispatcher". However, native_batch_norm is an example of an op that has a pre-autograd decomp registered to it (through op.py_impl(), but the decomp is registered directly to the Autograd key instead of being registered to the CompositeImplicitAutograd key. If we want to provide a guarantee to QAT that they can programatically access all decomps that would have run during tracing, then we need to make sure that every decomp we register to the Autograd key is also registered to the CompositeImplicitAutograd key. This might sound kind of painful (since it requires auditing), but I think in practice this basically only applies to native_batch_norm. Test Plan: python test/test_decomp.py Differential Revision: D48607575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107791 Approved by: https://github.com/jerryzh168, https://github.com/SherlockNoMad	2023-08-24 15:19:07 +00:00
dvorst	91a674ccd4	Fix docstring for shape of `target` for MultiLabelSoftMarginLoss (#107817 ) Fixes #92000 The documentation at https://pytorch.org/docs/stable/generated/torch.nn.MultiLabelSoftMarginLoss.html#multilabelsoftmarginloss states: > label targets padded by -1 ensuring same shape as the input. However, the shape of input and target tensor are compared, and an exception is raised if they differ in either dimension 0 or 1. Meaning the label targets are never padded. See the code snippet below and the resulting output. The documentation is therefore adjusted to: > label targets must have the same shape as the input. ``` import torch import torch.nn as nn # Create some example data input = torch.tensor( [ [0.8, 0.2, -0.5], [0.1, 0.9, 0.3], ] ) target1 = torch.tensor( [ [1, 0, 1], [0, 1, 1], [0, 1, 1], ] ) target2 = torch.tensor( [ [1, 0], [0, 1], ] ) target3 = torch.tensor( [ [1, 0, 1], [0, 1, 1], ] ) loss_func = nn.MultiLabelSoftMarginLoss() try: loss = loss_func(input, target1).item() except RuntimeError as e: print('target1 ', e) try: loss = loss_func(input, target2).item() except RuntimeError as e: print('target2 ', e) loss = loss_func(input, target3).item() print('target3 ', loss) ``` output: ``` target1 The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 0 target2 The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1 target3 0.6305370926856995 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107817 Approved by: https://github.com/mikaylagawarecki	2023-08-24 15:13:46 +00:00
David Watson	256fed02e9	Check tensors are defined before attempting to access their impl (#106787 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106787 Approved by: https://github.com/albanD	2023-08-24 11:38:35 +00:00
Aleksandar Samardžić	c91d2f5bf6	Remove CUTLASS extensions merged upstream (#107612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107612 Approved by: https://github.com/cpuhrsch, https://github.com/albanD	2023-08-24 11:34:59 +00:00
Avik Chaudhuri	cf76938f70	remove redundant dynamic_dim (#107815 ) Differential Revision: D48618472 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107815 Approved by: https://github.com/tugsbayasgalan, https://github.com/gmagogsfm	2023-08-24 10:46:24 +00:00
Rohan Varma	8354d32f6b	Ensure optimizer in backward works with 2d parallel (#107748 ) Summary: Test to ensure optimizer in backward works with 2D parallel. Test Plan: CI Differential Revision: D48508057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107748 Approved by: https://github.com/awgu, https://github.com/fduwjj	2023-08-24 09:20:00 +00:00
Adnan Akhundov	1491bae277	[reland][inductor] Adjust dynamic SMEM limit when above default in AOT (#107814 ) Summary: This relands #107601, which was reverted due to the new test failing in the internal CI. Here we skip the new test (as well as the existing tests in `test_aot_inductor.py`, as those are also failing in the internal CI). Test Plan: ``` $ python test/inductor/test_aot_inductor.py ... ---------------------------------------------------------------------- Ran 5 tests in 87.309s OK ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D48623171](https://our.internmc.facebook.com/intern/diff/D48623171) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107814 Approved by: https://github.com/eellison	2023-08-24 07:59:51 +00:00
Jerry Zhang	16fcb07846	[quant][pt2e] Add support for channel in DerivedQuantizationSpec (#107833 ) Summary: att Test Plan: python test/test_quantization.py TestQuantizePT2E.test_derived_qspec_per_channel Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D48630535](https://our.internmc.facebook.com/intern/diff/D48630535) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107833 Approved by: https://github.com/andrewor14	2023-08-24 07:45:13 +00:00
Justin Chu	387556318e	[ONNX] Cap opset version at 17 for torch.onnx.export (#107829 ) Cap opset version at 17 for torch.onnx.export and suggest users to use the dynamo exporter. Warn users instead of failing hard because we should still allow users to create custom symbolic functions for opset>17. Also updates the default opset version by running `tools/onnx/update_default_opset_version.py`. Fixes #107801 Fixes #107446 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107829 Approved by: https://github.com/BowenBao	2023-08-24 07:21:10 +00:00
Avik Chaudhuri	444875cd25	constraint violation error messages (#107790 ) Currently there are 4 cases where contraint violation errors are raised, but the error messages are (a) inconsistent in their information content (b) worded in ways that are difficult to understand for the end user. This diff cuts one of the cases that can never be reached, and makes the other 3 (a) consistent, e.g. they all point out that some values in the given range may not work, citing a reason and asking the user to run with logs to follow up (b) user-friendly, e.g., compiler-internal info is cut out or replaced with user-facing syntax. Differential Revision: D48576608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107790 Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi	2023-08-24 06:58:25 +00:00
angelayi	1e71c51350	[export] Serialize map correctly (#107837 ) Summary: Previously serializing graphs using map would error because map returns a singleton tensor list rather than a single tensor. So this diff adds support for if a higher order operator returns a list of tensors as output. We also run into an issue with roundtripping the source_fn on map nodes/subgraphs. The source_fn originally is <functorch.experimental._map.MapWrapper object at 0x7f80a0549930>, which serializes to `functorch.experimental._map.map`. However, we are unable to construct the function from this string. This should be fixed once map becomes a fully supported operator like torch.ops.higher_order.cond. Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D48631302](https://our.internmc.facebook.com/intern/diff/D48631302) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107837 Approved by: https://github.com/zhxchen17 ghstack dependencies: #107818	2023-08-24 06:47:50 +00:00
angelayi	1166f9a02c	[export] Custom object serialization (#107666 ) Some NvidaTRT folks were asking for a way to integrate the serialization of custom objects with export's serialization. After some discussion (more background [here](https://docs.google.com/document/d/1lJfxakmgeoEt50inWZ53MdUtOSa_0ihwCuPy_Ak--wc/edit)), we settled on a way for users to register their custom object's serializer/deserializer functions. Since TorchScript's `.def_pickle` already exists for [registering custom classes](https://pytorch.org/tutorials/advanced/torch_script_custom_classes.html), and `tensorrt.ICudaEngine` already contains a `.def_pickle` implementation, we'll start off by reusing the existing framework and integrating it with export's serialization. TorchScript's `.def_pickle` requires users to register two functions, which end up being the `__getstate__` and `__setstate__` methods on the class. The semantics of `__getstate__` and `__setstate__` in TorchScript are equivalent to that of Python pickle modules. This is then registered using pybind's `py::pickle` function [here](https://www.internalfb.com/code/fbsource/[f44e048145e4697bccfaec300798fce7daefb858]/fbcode/caffe2/torch/csrc/jit/python/script_init.cpp?lines=861-916) to be used with Python's pickle to initialize a ScriptObject with the original class, and set the state back to what it used to be. I attempted to call `__getstate__` and `__setstate__` directly, but I couldn't figure out how to initial the object to be called with `__setstate__` in python. One option would be to create a `torch._C.ScriptObject` and then set the class and call `__setstate__`, but there is no constructor initialized for ScriptObjects. Another option would be to construct an instance of the serialized class itself, but if the class constructor required arguments, I wouldn't know what to initialize it with. In ScriptObject's `py::pickle` registration it directly creates the object [here](https://www.internalfb.com/code/fbsource/[f44e048145e4697bccfaec300798fce7daefb858]/fbcode/caffe2/torch/csrc/jit/python/script_init.cpp?lines=892-906), which is why I was thinking that just directly using Python's `pickle` will be ok since it is handled here. So, what I did is that I check if the object is pickle-able, meaning it contains `__getstate__` and `__setstate__` methods, and if so, I serialize it with Python's pickle. TorchScript does have its own implementation of [pickle/unpickle](https://www.internalfb.com/code/fbsource/[59cbc569ccbcaae0db9ae100c96cf0bae701be9a][history]/fbcode/caffe2/torch/csrc/jit/serialization/pickle.h?lines=19%2C82), but it doesn't seem to have pybinded functions callable from python. A question is -- is it ok to combine this pickle + json serialization? Pull Request resolved: https://github.com/pytorch/pytorch/pull/107666 Approved by: https://github.com/gmagogsfm	2023-08-24 06:36:23 +00:00
angelayi	6ec2ec845c	[exportdb] Fix generating docs (#107838 ) Previously I accidentally replaced all `=` with `-`, resulting in clowny code rendering like: ![image](https://github.com/pytorch/pytorch/assets/10901756/738eaf92-8cc6-43bd-b531-224ec44afa9f) The purpose of replacing the `=` with `-` is to change the RST heading size of modules. So now, I replace strings with more than 3 `=`'s with `-`. This should avoid incorrectly replacing code where we set variables with `=` and do equality checks with `==`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107838 Approved by: https://github.com/gmagogsfm	2023-08-24 06:32:51 +00:00
PyTorch MergeBot	2fcda650cf	Revert "inductor: remove conv_bn folding from pre_grad pass (#106686 )" This reverts commit 22bc08da29ca8900ff877bb8a20f8369894c4c68. Reverted https://github.com/pytorch/pytorch/pull/106686 on behalf of https://github.com/XiaobingSuper due to Has big accuracy drop for internal models test ([comment](https://github.com/pytorch/pytorch/pull/106686#issuecomment-1690972043))	2023-08-24 04:19:11 +00:00
PyTorch MergeBot	3af04ce0ff	Revert "enable conv+bn folding for mixed-dtype when bn has post activation (#107142 )" This reverts commit 29813c61ea61d616230d3251ade81f56472ecb77. Reverted https://github.com/pytorch/pytorch/pull/107142 on behalf of https://github.com/XiaobingSuper due to [Depends on reverted https://github.com/pytorch/pytorch/pull/106576](https://github.com/pytorch/pytorch/pull/106686) ([comment](https://github.com/pytorch/pytorch/pull/107142#issuecomment-1690968509))	2023-08-24 04:14:00 +00:00
PyTorch UpdateBot	6178022aac	[vision hash update] update the pinned vision hash (#107831 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107831 Approved by: https://github.com/pytorchbot	2023-08-24 03:44:33 +00:00
Jackie (Jiaqi) Xu	9bda8f1e16	[inductor][fx passes]batch linear in pre grad (#107759 ) Summary: After we compile dense arch, we observe split-linear-cat pattern. Hence, we want to use bmm fusion + split cat pass to fuse the pattern as torch.baddmm. Some explanation why we prefer pre grad: 1) We need to add bmm fusion before split cat pass which is in pre grad pass to remove the new added stack and unbind node with the original cat/split node 2) Post grad does not support torch.stack/unbind. There is a hacky workaround but may not be landed in short time. Test Plan: # unit test ``` buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion [jackiexu0313@devgpu005.cln5 ~/fbsource/fbcode (f0ff3e3fc)]$ buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py Buck UI: https://www.internalfb.com/buck2/189dd467-d04d-43e5-b52d-d3b8691289de Test UI: https://www.internalfb.com/intern/testinfra/testrun/5910974704097734 Network: Up: 0B Down: 0B Jobs completed: 14. Time elapsed: 1:05.4s. Tests finished: Pass 5. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` # local test ``` =================Single run start======================== enable split_cat_pass for control group ================latency analysis============================ latency is : 73.79508209228516 ms =================Single run start======================== enable batch fusion for control group enable split_cat_pass for control group ================latency analysis============================ latency is : 67.94447326660156 ms ``` # e2e test todo add e2e test Differential Revision: D48539721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107759 Approved by: https://github.com/yanboliang	2023-08-24 03:42:09 +00:00
gmagogsfm	f8119f8bda	Move `Constraint` class to torch.export() to avoid circular dependency in _dynamo package (#107750 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107750 Approved by: https://github.com/tugsbayasgalan	2023-08-24 03:07:28 +00:00
angelayi	7bab98f161	[export] Serialize cond submodules (#107818 ) Cond submodules only return a single tensor, which was not supported by the serializer. Since the serializer assumes that the graph always returns a list -- this is true for the toplevel graph from dynamo, but not true for the subgraphs. Differential Revision: [D48622687](https://our.internmc.facebook.com/intern/diff/D48622687) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107818 Approved by: https://github.com/avikchaudhuri	2023-08-24 02:29:26 +00:00
Liao, Xuan	a560135516	[Inductor] Add new fused_attention pattern matcher (#107578 ) Add new fused_attention pattern matcher for Inductor, in order to make more models call the op SDPA. The following models would call SDPA due to the added pattern: For HuggingFace - AlbertForMaskedLM - AlbertForQuestionAnswering - BertForMaskedLM - BertForQuestionAnswering - CamemBert - ElectraForCausalLM - ElectraForQuestionAnswering - LayoutLMForMaskedLM - LayoutLMForSequenceClassification - MegatronBertForCausalLM - MegatronBertForQuestionAnswering - MobileBertForMaskedLM - MobileBertForQuestionAnswering - RobertaForCausalLM - RobertaForQuestionAnswering - YituTechConvBert For TorchBench - llama Pull Request resolved: https://github.com/pytorch/pytorch/pull/107578 Approved by: https://github.com/mingfeima, https://github.com/XiaobingSuper, https://github.com/jgong5, https://github.com/eellison, https://github.com/jansel	2023-08-24 01:45:41 +00:00
Michael Voznesensky	9b2d43df93	Handle empty lists properly (#107803 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107803 Approved by: https://github.com/ezyang	2023-08-24 01:42:29 +00:00
wz337	d707724ac9	[DeviceMesh] init_device_mesh dosctring update to include one d mesh initialization (#107805 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107805 Approved by: https://github.com/fduwjj, https://github.com/wanchaol	2023-08-24 01:28:22 +00:00
Xu Zhao	26ae48832e	Remove run torchbench. Torchbench runs are now part of the dynamo ci. (#107826 ) As the title says. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107826 Approved by: https://github.com/huydhn	2023-08-24 01:19:49 +00:00
dependabot[bot]	4fdfe33ae6	Bump scipy from 1.9.0 to 1.10.1 in /.github/requirements (#104763 ) * Bump scipy from 1.9.0 to 1.10.1 in /.github/requirements Bumps [scipy](https://github.com/scipy/scipy) from 1.9.0 to 1.10.0. - [Release notes](https://github.com/scipy/scipy/releases) - [Commits](https://github.com/scipy/scipy/compare/v1.9.0...v1.10.0) --- updated-dependencies: - dependency-name: scipy dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Update .github/requirements/pip-requirements-macOS.txt --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Nikita Shulga <nshulga@meta.com>	2023-08-24 10:10:40 +09:00
Jun Luo	96c27c2d81	Support is_mtia() in TensorBase.h (#107723 ) Summary: As title Test Plan: CI tests. Differential Revision: D48477102 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107723 Approved by: https://github.com/cx-yin, https://github.com/ezyang	2023-08-24 00:26:12 +00:00
Jane Xu	4fd42e62c6	Remove unnecessary import in python_variable.cpp (#107794 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107794 Approved by: https://github.com/albanD, https://github.com/ZainRizvi	2023-08-24 00:24:25 +00:00
Jane Xu	6e71ad0509	Add tensor post accumulate grad hook API (#107063 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107063 Approved by: https://github.com/albanD, https://github.com/soulitzer	2023-08-24 00:19:35 +00:00
fduwjj	3828cd4b79	[TP][EZ] Update doc for TP parallel style (#107819 ) We need to update the doc for PairwiseParallel and SequenceParallel so that users don't get wrong impressions that these working for ``nn.Transformer``. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107819 Approved by: https://github.com/awgu, https://github.com/wanchaol	2023-08-24 00:13:52 +00:00
PyTorch MergeBot	432fce4e0d	Revert "Add tensor post accumulate grad hook API (#107063 )" This reverts commit 3f655277d44909e0770e77e1b4fe1c9b0f39d7b9. Reverted https://github.com/pytorch/pytorch/pull/107063 on behalf of https://github.com/ZainRizvi due to Diff train weirdness. Need to temporarily revert this PR and will right land it soon afterwards ([comment](https://github.com/pytorch/pytorch/pull/107063#issuecomment-1690799057))	2023-08-24 00:12:34 +00:00
PyTorch MergeBot	bc0790559b	Revert "Remove unnecessary import in python_variable.cpp (#107794 )" This reverts commit 9d23b8b3eabe2cd38eb5a11cc49cda6970675595. Reverted https://github.com/pytorch/pytorch/pull/107794 on behalf of https://github.com/ZainRizvi due to Diff train weirdness. Need to temporarily revert this PR and will right land it soon afterwards ([comment](https://github.com/pytorch/pytorch/pull/107794#issuecomment-1690798855))	2023-08-24 00:10:18 +00:00
Antoni Viros i Martin	2c45a579ca	Add wait_tensor so print always has a correct result for AsyncCollectiveTensor (#107808 ) As the title says, I was trying to test the functional collectives, and, when printing the resulting tensors, sometimes they wouldn't have finished the Async operation yet. According to the comments in the file, "AsyncTensor wrapper applied to returned tensor, which issues wait_tensor() at the time of first use". This is true in most cases, but not when print() is your first use. This PR fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107808 Approved by: https://github.com/fduwjj	2023-08-24 00:00:23 +00:00
Catherine Lee	3d3f18260f	Move conda uploads into environment (#107807 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/107807 Approved by: https://github.com/atalman	2023-08-23 23:50:03 +00:00
Huy Do	9a365fe914	Use docker-build env to access GHCR_PAT (#107655 ) This will restrict the access to GHCR_PAT to only [docker-build](https://github.com/pytorch/pytorch/settings/environments/1258682414/edit) env. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107655 Approved by: https://github.com/clee2000, https://github.com/atalman	2023-08-23 23:45:41 +00:00
Simon Fan	b74b8e33db	[Redo] Enhance fakepg: alltoall and alltoall_base (#107798 ) [ghstack-poisoned] Redo https://github.com/pytorch/pytorch/pull/107624, previously tried to land via `ghstack land`, but that doesn't work with pytorch repo which has protected main branch. As a result, that PR was only merged to [gh/xmfan/1/base](https://github.com/pytorch/pytorch/tree/gh/xmfan/1/base). This PR manually merges [gh/xmfan/1/base](https://github.com/pytorch/pytorch/tree/gh/xmfan/1/base) into main, via pytorchbot Pull Request resolved: https://github.com/pytorch/pytorch/pull/107798 Approved by: https://github.com/wanchaol, https://github.com/fduwjj	2023-08-23 23:45:11 +00:00
Oren Leung	726b7ff608	Support integer implementations for padding(cpu and cuda) (#107755 ) Fixes #107733 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107755 Approved by: https://github.com/albanD	2023-08-23 23:45:00 +00:00
David Berard	8c66f97c9b	[profiler] move _enable_dynamo_cache_lookup_profiler (#107720 ) _enable_dynamo_cache_lookup_profiler used to get turned on when running `__enter__` or `__exit__` with the profiler. But it's possible to turn the profiler on and off without the context manager (e.g. with a schedule and calling `.step()`). Instead, we should put these calls (which are supposed to be executed when the profiler turns on/off) where `_enable_profiler()` and `_disable_profiler()` are called. This puts `_enable_dynamo_cache_lookup_profiler` and `_set_is_profiler_enabled` into `_run_on_profiler_(start\|stop)` and calls that on the 3 places where `_(enable\|disable)_profiler` get called. Differential Revision: [D48619818](https://our.internmc.facebook.com/intern/diff/D48619818) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107720 Approved by: https://github.com/wconstab	2023-08-23 23:41:35 +00:00
David Berard	cb107c74bb	[profiler] DISABLE_CUPTI_LAZY_REINIT for CUDA 12 as well (#107744 ) Summary: Apparently CUDA 12 + CUPTI can fail with an illegal memory access, similar to what we saw with CUDA 11 (https://github.com/pytorch/pytorch/issues/75504). For now we'll just turn on DISABLE_CUPTI_LAZY_REINIT, which will fix this internally. In OSS, this will probably still break - which will hopefully give us a repro. Differential Revision: D48568888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107744 Approved by: https://github.com/aaronenyeshi	2023-08-23 23:28:15 +00:00
ekamiti	4a022e2185	Update unary_ufuncs groupings to include primtorch types. (#107345 ) Fixes #107335. The skips were updated for the _ref ops to match those for eager mode where necessary. Part of breakdown of https://github.com/pytorch/pytorch/pull/104489. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107345 Approved by: https://github.com/ezyang	2023-08-23 22:45:19 +00:00
Jane Xu	9f86d85172	[optim] Make casting to match params a hook (#106725 ) Moves the logic to casting state to match parameters into a hook so that users can choose to enable their hooks before or after the casting has happened. With this, there is a little bit of redundancy of the id_map building and the check that the param groups are still aligned in length. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106725 Approved by: https://github.com/albanD	2023-08-23 22:25:33 +00:00
Angela Yi	92f6454ff8	[export][reland] ExportedProgram.transform updates graph_signature automatically (#107792 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/107080 Test Plan: CI Differential Revision: D48533622 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107792 Approved by: https://github.com/gmagogsfm	2023-08-23 22:16:56 +00:00
Andrew Gu	2515ab93c4	[FSDP][Docs] Add note on `NCCL_CROSS_NIC=1` for HSDP (#107784 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107784 Approved by: https://github.com/fegin ghstack dependencies: #106068, #106080	2023-08-23 22:00:50 +00:00
Jane Xu	c0ba9a7840	Fix docs, missed a // in LaTeX for nadam (#107736 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107736 Approved by: https://github.com/mikaylagawarecki	2023-08-23 21:36:27 +00:00
Zain Rizvi	36399d067a	Port existing heuristics to TD framework (#107071 ) This PR looks big, but it's mostly just refactorings with a bit of dead code deletion. Exceptions are: - Some metric emissions were changed to comply with the new TD format - Some logging changes - We now run tests in three batches (highly_relevant, probably_relevant, unranked_relevance) instead of the previous two (prioritized and general) Refactorings done: - Moves all test reordering code to the new TD framework - Refactors run_test.py to cleanly support multiple levels of test priorities - Deletes some dead code that was originally written for logging Pull Request resolved: https://github.com/pytorch/pytorch/pull/107071 Approved by: https://github.com/clee2000, https://github.com/huydhn	2023-08-23 21:23:23 +00:00
Huy Do	d7f943ec82	[mergebot] Flaky and broken trunk should take precedence over ic (#107761 ) I notice a curious case on https://github.com/pytorch/pytorch/pull/107508 where there was one broken trunk failure and the PR was merged with `merge -ic`. Because the failure had been classified as unrelated, I expected to see a no-op force merge here. However, it showed up as a force merge with failure. ![Screenshot 2023-08-22 at 20 01 10](https://github.com/pytorch/pytorch/assets/475357/b9c93e24-8da8-4fc6-9b3d-61b6bd0a8937) The record on Rockset reveals https://github.com/pytorch/pytorch/pull/107508 has: * 0 broken trunk check (unexpected, this should be 1 as Dr. CI clearly say so) * 1 ignore current check (unexpected, this should be 0 and the failure should be counted as broken trunk instead) * 3 unstable ROCm jobs (expected) It turns out that ignore current takes precedence over flaky and broken trunk classification. This might have been the expectation in the past but I think that's not the case now. The bot should be consistent with what is shown on Dr. CI. The change here is to make flaky, unstable, and broken trunk classification to take precedence over ignore current. Basically, we only need to ignore new or unrecognized failures that have yet been classified. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107761 Approved by: https://github.com/clee2000	2023-08-23 21:22:56 +00:00
Sherlock Huang	ee4b99cc3a	Decomp for aten.dropout (#106274 ) When exporting dropout with cpu tensor, we get following graph module ``` class GraphModule(torch.nn.Module): def forward(self, arg0_1: f32[512, 10]): empty_memory_format: f32[512, 10] = torch.ops.aten.empty.memory_format([512, 10], dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False, memory_format = torch.contiguous_format) bernoulli_p: f32[512, 10] = torch.ops.aten.bernoulli.p(empty_memory_format, 0.9); empty_memory_format = None div_scalar: f32[512, 10] = torch.ops.aten.div.Scalar(bernoulli_p, 0.9); bernoulli_p = None mul_tensor: f32[512, 10] = torch.ops.aten.mul.Tensor(arg0_1, div_scalar); arg0_1 = div_scalar = None return (mul_tensor,) ``` In addition, if we export with eval() mode, we will have an empty graph. However, when exporting with cuda tensor, we got ``` class GraphModule(torch.nn.Module): def forward(self, arg0_1: f32[512, 10]): native_dropout_default = torch.ops.aten.native_dropout.default(arg0_1, 0.1, True); arg0_1 = None getitem: f32[512, 10] = native_dropout_default[0]; native_dropout_default = None return (getitem,) ``` and exporting under eval() mode will still have a dropout node in graph. This PR make exporting with CPU tensor also produce aten.native_dropout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106274 Approved by: https://github.com/ezyang	2023-08-23 21:12:37 +00:00
angelayi	50024d04a8	[core aten] Add ops to core aten set (#107766 ) Update the list of core aten ops with the ones we determined [here](https://docs.google.com/spreadsheets/d/1u9jQ-uGlKu-fe9nLy-jS2AIPtpE8sGTmELOFYgKOhXU/edit#gid=1098862752). ``` aten::adaptive_avg_pool1d aten::_adaptive_avg_pool3d aten::_cdist_forward aten::_embedding_bag aten::_local_scalar_dense aten::_native_batch_norm_legit_no_training aten::_native_batch_norm_legit aten::_pdist_forward aten::any aten::any.dim aten::avg_pool1d aten::avg_pool3d aten::bitwise_and.Scalar aten::bitwise_or.Scalar aten::bitwise_xor.Scalar aten::ceil aten::clamp.Tensor aten::cumsum aten::embedding aten::floor aten::fmod.Scalar aten::index_put aten::index.Tensor aten::logical_xor aten::mean aten::mean.dim aten::pixel_shuffle aten::prod aten::prod.dim_int aten::rand aten::randperm aten::reflection_pad_1d aten::reflection_pad_2d aten::reflection_pad_3d aten::remainder.Scalar aten::roll aten::round aten::scatter.src aten::scatter.value aten::select_scatter aten::sort aten::split.Tensor aten::split_with_sizes aten::squeeze.dims aten::tan aten::unsqueeze aten::var.correction ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107766 Approved by: https://github.com/kirklandsign, https://github.com/SS-JIA, https://github.com/SherlockNoMad	2023-08-23 21:05:25 +00:00
Animesh Jain	8c62f01cb7	[dynamo][guards] Use dict for storing weakrefs (#107645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107645 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-08-23 20:52:38 +00:00
Jun Luo	221daeb1a7	Fix deepcopy for tensor with MTIA device key. (#107427 ) Summary: Tensor with MTIA device type doesn't have storage and we need to treat it same as other tensors which don't have storage. Test Plan: CI tests. Differential Revision: D48456004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107427 Approved by: https://github.com/cx-yin, https://github.com/ezyang	2023-08-23 20:47:36 +00:00
Edward Z. Yang	42b6ba3484	Use TORCH_SYM_CHECK for check_size_nonnegative on SymIntArrayRef (#107785 ) See https://github.com/pytorch/pytorch/pull/106788 for context. I think I don't actually need this for anything real, but this is pretty mild so might as well. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107785 Approved by: https://github.com/albanD	2023-08-23 20:46:02 +00:00
wz337	cdd0821f00	[2/N][DeviceMesh] Overriding __getitem__ for DeviceMesh to support Mesh Slicing (#107730 ) Add support for DeviceMesh slicing by overloading __getitem__ for DeviceMesh. With this change, you can do: ``` mesh_shape = (2, 4) mesh_dim_names = ("DP", "TP") two_d_mesh = init_device_mesh( self.device_type, mesh_shape, mesh_dim_names=mesh_dim_names ) tp_mesh = two_d_mesh["TP"] ``` cc. @wanchaol, @fduwjj Pull Request resolved: https://github.com/pytorch/pytorch/pull/107730 Approved by: https://github.com/wanchaol	2023-08-23 20:35:30 +00:00
gmagogsfm	652ccfadc1	Expose torch.export.constrain_as_{size,value} APIs (#107735 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107735 Approved by: https://github.com/avikchaudhuri	2023-08-23 20:13:40 +00:00
Jane Xu	9d23b8b3ea	Remove unnecessary import in python_variable.cpp (#107794 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107794 Approved by: https://github.com/albanD, https://github.com/ZainRizvi	2023-08-23 19:43:39 +00:00
Jason Ansel	79b3a9f945	[dynamo] Treat monkey patched .forward as dynamic (#107104 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107104 Approved by: https://github.com/anijain2305	2023-08-23 19:03:02 +00:00
lezcano	977aba7cfe	Revert the removal of a SampleInput for gather (#107776 ) As per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/107776 Approved by: https://github.com/peterbell10	2023-08-23 19:01:03 +00:00
Zachary DeVito	c9b5e9d7a8	[allocator] register oom observers on every device (#107399 ) This change is to match the behavior of _record_memory_history which was recently changed to enable history recording on all devices rather than the current one. It prevents confusing situations where the observer was registered before the device was set for the training run. It also ensures the allocators have been initialized in the python binding just in case this is the first call to the CUDA API. Fixes #107330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107399 Approved by: https://github.com/eellison ghstack dependencies: #107171	2023-08-23 18:57:24 +00:00
Zachary DeVito	cc54448a07	[memory snapshot] add 'address' key to block (#107171 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107171 Approved by: https://github.com/ngimel	2023-08-23 18:57:24 +00:00
Andrew Gu	2b964d6efd	[FSDP] Enable async all-reduce for HSDP (#106080 ) Overview This PR runs the HSDP all-reduce as async so that it can overlap with both all-gather and reduce-scatter, which can lead to slight end-to-end speedups when the sharding process group is fully intra-node. Previously, the all-reduce serializes with reduce-scatter, so it can only overlap with one all-gather. For some clusters (e.g. our AWS cluster), `NCCL_CROSS_NIC=1` improves inter-node all-reduce times when overlapped with intra-node all-gather/reduce-scatter. Experiment <details> <summary> Example 'before' trace </summary> <img width="559" alt="hsdp_32gpus_old" src="https://github.com/pytorch/pytorch/assets/31054793/15222b6f-2b64-4e0b-a212-597335f05ba5"> </details> <details> <summary> Example 'after' trace </summary> <img width="524" alt="hsdp_32gpus_new" src="https://github.com/pytorch/pytorch/assets/31054793/94f63a1d-4255-4035-9e6e-9e10733f4e44"> </details> For the 6-encoder-layer, 6-decoder layer transformer with `d_model=8192`, `nhead=64` on 4 nodes / 32 40 GB A100s via AWS, the end-to-end iteration times are as follows (with AG == all-gather, RS == reduce-scatter, AR == all-reduce; bandwidth reported as algorithmic bandwidth): - Reference FSDP: - 1160 ms / iteration - ~23 ms / encoder AG/RS --> 24.46 GB/s bandwidth - ~40 ms / decoder AG/RS --> 26.5 GB/s bandwidth - 50 GB/s theoretical inter-node bandwidth - Baseline 8-way HSDP (only overlap AR with AG) -- intra-node AG/RS, inter-node AR: - 665 ms / iteration - ~3 ms / encoder AG/RS --> 187.5 GB/s bandwidth - ~5 ms / decoder AG/RS --> 212 GB/s bandwidth - ~30 ms / encoder AR --> 2.34 GB/s bandwidth - ~55 ms / decoder AR --> 2.65 GB/s bandwidth - 300 GB/s theoretical intra-node bandwidth - New 8-way HSDP (overlap AR with AG and RS) -- intra-node AG/RS, inter-node AR: - 597 ms / iteration - ~3 ms / encoder AG/RS --> 187.5 GB/s bandwidth - ~6.2 ms / decoder AG/RS --> 170.97 GB/s bandwidth (slower) - ~23 ms / encoder AR (non-overlapped) --> 3.057 GB/s bandwidth (faster) - ~49 ms / decoder AR (non-overlapped) --> 2.70 GB/s bandwidth (faster) - ~100 ms / decoder AR (overlapped) --> 1.325 GB/s bandwidth (slower) - Overlapping with reduce-scatter reduces all-reduce bandwidth utilization even though the all-reduce is inter-node and reduce-scatter is intra-node! - New 8-way HSDP (overlap AR with AG and RS) with `NCCL_CROSS_NIC=1`: - 556 ms / iteration - Speedup comes from faster overlapped AR Thus, for this particular workload, the async all-reduce enables 16% iteration-time speedup compared to the existing HSDP and 52% speedup compared to FSDP. These speedups are pronounced due to the workload being communication bound, so any communication time reduction translates directly to speedup. Unit Test This requires >= 4 GPUs: ``` python -m pytest test/distributed/fsdp/test_fsdp_hybrid_shard.py -k test_fsdp_hybrid_shard_parity ``` Differential Revision: [D47852456](https://our.internmc.facebook.com/intern/diff/D47852456) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106080 Approved by: https://github.com/ezyang ghstack dependencies: #106068	2023-08-23 18:36:15 +00:00
Andrew Gu	50e1378680	[FSDP] Break up `_post_backward_hook` into smaller funcs (#106068 ) The post-backward hook has some complexity due to the different paths: {no communication hook, communication hook} x {`NO_SHARD`, `FULL_SHARD`/`SHARD_GRAD_OP`, `HYBRID_SHARD`/`_HYBRID_SHARD_ZERO2`} plus some options like CPU offloading and `use_orig_params=True` (requiring using sharded gradient views). The PR following this one that adds async all-reduce for HSDP further complicates this since the bottom-half after all-reduce must still be run in the separate all-reduce stream, making it more unwieldy to unify with the existing bottom-half. Nonetheless, this PR breaks up the post-backward hook into smaller logical functions to hopefully help readability. Differential Revision: [D47852461](https://our.internmc.facebook.com/intern/diff/D47852461) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106068 Approved by: https://github.com/ezyang, https://github.com/fegin	2023-08-23 18:36:15 +00:00
Evgeni Burovski	55d6b80188	torch._numpy: keep f16 CUDA tensors in f16 where possible (#107768 ) Contain workarounds for _RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'_ to CPU tensors, do computations on CUDA tensors in f16. Fixes https://github.com/Quansight-Labs/numpy_pytorch_interop/issues/170 We do not really systematically test CUDA tensors in torch._numpy, so I only spot-checked locally that the affected functions work with `tensor.to("cuda")`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107768 Approved by: https://github.com/lezcano	2023-08-23 18:35:47 +00:00
vasiliy	61fe49b8ed	pt2: make aot_eager backend handle basic float8 operations (#107783 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/107642 with a fix for tests on Windows. Makes aot_eager backend of torch.compile handle basic float8 operations. This is useful for float8 training UX. Test Plan: ``` python test/test_quantization.py -k test_pt2_traceable_aot_eager ``` Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/107783 Approved by: https://github.com/albanD	2023-08-23 18:10:53 +00:00
BowenBao	5b632bf7a6	[ONNX] More debug logging from fx to onnx (#107654 ) Summary: - Log fx graph name for 'fx-graph-to-onnx' diagnostic. - Log fx graph and onnx graph under DEBUG verbosity level for 'fx-graph-to-onnx' diagnostic. - Adjust unittest to run with diagnostics verbosity level logging.DEBUG. - Sarif logs will be saved for unittest when `TORCH_LOGS="onnx_diagnostics"` is set. <img width="640" alt="image" src="https://github.com/pytorch/pytorch/assets/9376104/a5718530-3594-46fb-85a2-b8bcc8ba01c7"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107654 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms ghstack dependencies: #107408, #107409, #107653	2023-08-23 18:05:15 +00:00
BowenBao	bb1852fb9e	[ONNX] Clean up diagnostic rules (#107653 ) Summary: - Remove experimental rules that were never used. - Fill in detailed rule descriptions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107653 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms ghstack dependencies: #107408, #107409	2023-08-23 18:05:14 +00:00
BowenBao	c3c1b68ae8	[ONNX] Enclose package info for modules exported as local functions (#107409 ) Enclose source package of modules that are exported as onnx local function in exported onnx model. GPT2 model example: <img width="350" alt="image" src="https://github.com/pytorch/pytorch/assets/9376104/5e361bdd-ca24-45e7-a9ba-191c35acf3bb"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107409 Approved by: https://github.com/justinchuby ghstack dependencies: #107408	2023-08-23 18:05:13 +00:00
BowenBao	7a8db57e37	[ONNX] Re-purpose 'name' field of GraphProto (#107408 ) Previously, the top level GraphProto is hardcoded with name "torch_jit", and the subgraphs "torch_jit_{count}". It does not offer any insight to the graph, but rather encodes the graph producer as jit (torchscript). This is no longer true now that the graph can also be produced from dynamo. As a naive first step, this PR re-purposes the name, to "main_graph", and "sub_graph_{count}" respectively. More delicate processing can be done to name the subgraphs with respect to their parent node or module. This can be done as follow ups. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107408 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms	2023-08-23 18:05:11 +00:00
Jackie (Jiaqi) Xu	398f4ae451	Back out "[inductor] make thread order consistent with loop order (#106827 )" (#107796 ) Summary: D48295371 cause batch fusion failure, which will block mc proposals on all mc models. e.g. cmf f470938179 Test Plan: Without revert, f469732293. With revert diff f472266199. Differential Revision: D48610062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107796 Approved by: https://github.com/yanboliang	2023-08-23 18:02:54 +00:00
kato8966	f7a51c4208	fix pad_sequence docstring (#107669 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107669 Approved by: https://github.com/mikaylagawarecki	2023-08-23 18:01:39 +00:00
Codle	42738c56a0	Skip the extra copy operation in broadcast_object_list if tensor_list has only one element (#107509 ) The `broadcast_object_list` function can easily broadcast the state_dict of models/optimizers. However, the `torch.cat` operation performed within `broadcast_object_list` consumes an additional double amount of memory space. This means that only objects with a maximum memory occupancy of half the device capacity can be broadcasted. This PR improves usability by skipping the `torch.cat` operation on object_lists with only a single element. Before (30G tensor)： <img width="607" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/c0c67931-0851-4f27-81c1-0119c6cd2944"> After (46G tensor): <img width="600" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/90cd1536-be7c-43f4-82ef-257234afcfa5"> Test Code: ```python if __name__ == "__main__": dist.init_process_group(backend='nccl') torch.cuda.set_device(dist.get_rank() % torch.cuda.device_count()) fake_tensor = torch.randn(30 * 1024 * 1024 * 1024 // 4) if dist.get_rank() == 0: state_dict = {"fake_tensor": fake_tensor} else: state_dict = {} object_list = [state_dict] dist.broadcast_object_list(object_list, src=0) print("Rank: ", dist.get_rank(), " Broadcasted Object: ", object_list[0].keys()) dist.barrier() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107509 Approved by: https://github.com/awgu	2023-08-23 17:19:10 +00:00
PyTorch MergeBot	ecde622649	Revert "reseed all Generators in Dataloader's _worker_loop() -- via GC (#107131 )" This reverts commit 42625da5e1c29d710abf6db01c2506898043fdb2. Reverted https://github.com/pytorch/pytorch/pull/107131 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/107131#issuecomment-1690325745))	2023-08-23 17:08:07 +00:00
Peter Bell	3f2ecf7755	[inductor] Separate to_{dtype,device} from lowering to avoid copying (#107640 ) These lowerings must copy even when they are no-ops in order to preserve correctness in the presense of mutations. However, `to_dtype` and `to_device` are also used in various lowerings as a helper function where it is okay to alias. So, I've split these into two functions and allow the helper functions to alias which saves some unnecessary copies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107640 Approved by: https://github.com/lezcano	2023-08-23 16:56:39 +00:00
Prachi Gupta	3022a395f3	test_memory_format test now passes on rocm (#107696 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107696 Approved by: https://github.com/pruthvistony, https://github.com/albanD	2023-08-23 16:39:19 +00:00
Bin Bao	469e7479e8	[CI] Delete .github/ci_commit_pins/huggingface.txt (#107729 ) Summary: .github/ci_commit_pins/huggingface.txt is not needed since CI now installs huggingface as a part of the CI docker image by looking up .ci/docker/ci_commit_pins/huggingface.txt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107729 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-08-23 16:26:28 +00:00
wz337	9f5c705806	[CODEOWNERS] Add wz337 as a reviewer for Distributed Package and Distributed Tests. (#107747 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/107747 Approved by: https://github.com/fduwjj, https://github.com/awgu	2023-08-23 15:58:22 +00:00
FFFrog	6f0d0b3850	fix type check of overflow (#107579 ) Fixes #95451 and remove duplicate check Code: ```python import torch import sys i = sys.maxsize + 1 input = torch.full((1, 32, 32,), 0.5) torch.max_pool1d(input, kernel_size=[i] , stride=[i], padding=0, dilation=[i], ceil_mode=True) ``` Result: ```shell Traceback (most recent call last): File "/root/Git.d/pytorch/samples/src/simple.py", line 13, in <module> torch.max_pool1d(input, kernel_size=[i] , stride=[i], padding=0, dilation=[i], ceil_mode=True) TypeError: max_pool1d(): argument 'dilation' failed to unpack the object at pos 1 with error "Overflow when unpacking long" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107579 Approved by: https://github.com/albanD	2023-08-23 15:34:40 +00:00
Mikayla Gawarecki	48b1208e05	Disable nn.MHA fastpath for floating point masks (#107641 ) Fixes https://github.com/pytorch/pytorch/issues/107084 by disabling the fast path when floating point masks (which should be additive) are passed - [We claim in our docs for MHA that float masks will be added to the attention](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) (be it `key_padding_mask` or `attn_mask`) - We always canonicalize any mask at the start of MHA in python by converting it to float - my understanding from Driss is that SDPA properly supports additive masking (but there are many special cases for mask shape for MHA that don't work properly currently (BxT, TxT) so [we're turning this off for now](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/cuda/attention.cu#L531-L532) - More broadly, the problem isn't with the SDPA path, but that things are broken for the path it falls back to - Right now mha "fast path" code with non-None masks is always going through [this path ](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/cuda/attention.cu#L554-L640) that has a call to `masked_softmax` that [converts the masks back to bool](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/attention.cpp#L154-L156) - the implication here is that additive floating point attn_mask and additive key_padding_mask to nn.MHA fastpath are broken - This wasn't broken for the user in [https://github.com/pytorch/pytorch/issues/107084](https://l.workplace.com/l.php?u=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fissues%2F107084&h=AT35qHIQavtxKtriTkrkPsWRB3eSRh4qH5PQUyiTzrPTshoztPL0593AmKCmSdEQ5O-5wib0Fd4mwztVu4YbMWb2ghZnZw1pvpJb9-FYWjDsPQ6_oHRVPzFfj8xYXC1TaFnJCkMYjrGXkIfzzxZvmcQYNnIPgsJSiWgjIw) in 1.13.1 because of [this check which bypassed the fast path if attn_mask was defined](https://github.com/pytorch/pytorch/blob/v1.13.1/torch/nn/modules/activation.py#L1096-L1097) (as Driss pointed out though additive key_padding_mask with the fast path were probably broken in 1.13.1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107641 Approved by: https://github.com/drisspg, https://github.com/jbschlosser	2023-08-23 15:08:18 +00:00
lezcano	207b06d099	[dynamo] Wrap ndarray dunder methods (#107689 ) Fixes https://github.com/pytorch/pytorch/issues/107437 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107689 Approved by: https://github.com/ezyang ghstack dependencies: #107687, #107688, #107710, #107711, #107746	2023-08-23 13:55:36 +00:00
lezcano	b5c90ba7e7	[dynamo] Fix ndarray.__pow__ (#107746 ) As per title. Tests in the next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/107746 Approved by: https://github.com/ezyang ghstack dependencies: #107687, #107688, #107710, #107711	2023-08-23 13:55:36 +00:00
lezcano	2b6249e209	Wrap indirect indexing on CUDA (#105055 ) Lifting this to CPU should be rather easy. @jgong5 Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well. This fix works with dynamic shapes as well. @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055 Approved by: https://github.com/peterbell10, https://github.com/jansel	2023-08-23 11:59:20 +00:00
Tugsbayasgalan Manlaibaatar	c81c217a2f	Make ExportedProgram valid tracing callable (#107657 ) In this PR, we make ExportedProgram valid callable to export for re-exporting. Note that we don't allow any new constraints specified from user as we don't have any way of handling it right now. There are some caveats that is worth mentioning in this PR. Today, graph_module.meta is not preserved (note that this is different from node level meta which we preserve). Our export logic relies on this meta to process the constraints. But if we skip dynamo, we will have to preserve the constraints stored in graph_module.meta. Once dynamo supports retracibility, we don't have to do this anymore. I currently manually save graph_module.meta at following places: 1. After ExportedProgram.module() 2. After ExportedProgram.transform() 3. At construction site of ExportedProgram. Jerry will add the update on the quantization side as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107657 Approved by: https://github.com/gmagogsfm	2023-08-23 08:01:57 +00:00
AllenTiTaiWang	400c4de53b	[ONNX] Add huggingface models into CI tests (#107247 ) 1. Add a list of HF models to CI tests. The PR intends to build them from Config, but some of them are not supported with Config. NOTE: Loaded from pre-trained model could potentially hit [uint8/bool conflict](https://github.com/huggingface/transformers/issues/21013) when a newer version of transformers is used. - Dolly has torch.fx.Node in OnnxFunction attribute, which is currently not supported. - Falcon and MPT has unsupported user coding to Dynamo. 2. Only update GPT2 exporting with real tensor to Config, as FakeMode rises unequal input errors between PyTorch and ORT. The reason is that [non-persistent buffer is not supported](https://github.com/pytorch/pytorch/issues/107211) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107247 Approved by: https://github.com/wschin, https://github.com/BowenBao	2023-08-23 07:28:26 +00:00
XiaobingSuper	610f64d72a	inductor: also check index_exp when select tiling var (#106765 ) For select tiling var, currently, we only consider load and store which do not consider index exp, and meet accuracy issues: before(the index exp ```i1-1``` can not be vectrized): ``` cpp_fused_constant_pad_nd_mul_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, float* out_ptr0) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(3136L); i1+=static_cast<long>(16L)) { #pragma GCC ivdep for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L)) { auto tmp0 = at::vec::Vectorized<int>(static_cast<int>((-1L) + i1)); auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0)); auto tmp2 = to_float_mask(tmp0 >= tmp1); auto tmp3 = [&] { auto tmp4 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>((-8L) + i2 + (8Li1) + (8Li1_inner) + (25088Li0))]; return at::vec::Vectorized<float>::loadu(tmpbuf); })(); auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>((-1L) + i1 + (3136Li2) + (25088Li0))); auto tmp6 = tmp4 tmp5; return tmp6; } ; auto tmp7 = decltype(tmp3())::blendv(at::vec::Vectorized<float>(0.0), tmp3(), to_float_mask(tmp2)); { __at_align__ float tmpbuf[16sizeof(float)/sizeof(float)]; tmp7.store(tmpbuf); for (long i1_inner = 0; i1_inner < 16; i1_inner++) out_ptr0[static_cast<long>(i2 + (8Li1) + (8Li1_inner) + (25096Li0))] = tmpbuf[i1_inner]; } } } #pragma GCC ivdep for(long i1=static_cast<long>(3136L); i1<static_cast<long>(3137L); i1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L)) { auto tmp0 = static_cast<long>((-1L) + i1); auto tmp1 = static_cast<long>(0); auto tmp2 = tmp0 >= tmp1; auto tmp3 = [&] { auto tmp4 = in_ptr0[static_cast<long>((-8L) + i2 + (8Li1) + (25088Li0))]; auto tmp5 = in_ptr1[static_cast<long>((-1L) + i1 + (3136Li2) + (25088Li0))]; auto tmp6 = decltype(tmp4)(tmp4 * tmp5); return tmp6; } ; auto tmp7 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); out_ptr0[static_cast<long>(i2 + (8Li1) + (25096Li0))] = tmp7; } } } } } } ``` after: ``` cpp_fused_constant_pad_nd_mul_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, float* out_ptr0) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(3137L); i1+=static_cast<long>(1L)) { #pragma omp simd simdlen(8) for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L)) { auto tmp0 = static_cast<long>((-1L) + i1); auto tmp1 = static_cast<long>(0); auto tmp2 = tmp0 >= tmp1; auto tmp3 = [&] { auto tmp4 = in_ptr0[static_cast<long>((-8L) + i2 + (8Li1) + (25088Li0))]; auto tmp5 = in_ptr1[static_cast<long>((-1L) + i1 + (3136Li2) + (25088Li0))]; auto tmp6 = decltype(tmp4)(tmp4 * tmp5); return tmp6; } ; auto tmp7 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); out_ptr0[static_cast<long>(i2 + (8Li1) + (25096Li0))] = tmp7; } } } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106765 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-08-23 07:16:14 +00:00
FFFrog	4a40e27583	Enable mypy check in torch/_inductor/config.py (#107448 ) Fixes #105230 ```shell $ lintrunner init && lintrunner -a torch/_inductor/config.py ... ok No lint issues. Successfully applied all patches. ``` ```shell $ mypy torch/_inductor/config.py Success: no issues found in 1 source file ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107448 Approved by: https://github.com/ezyang	2023-08-23 07:11:31 +00:00
AllenTiTaiWang	d0f8ee45bd	[ONNX] Exclude FXSymbolicTracer from _assert_fake_tensor_mode (#107712 ) Previous to this PR, `_assert_fake_tensor_mode` checks all of exporting tracer that they enable fake mode "from" exporter API whenever they have fake tensors in args/buffers/weights. However, FXSymbolicTracer doesn't use exprter API to create fake mode, so it hits the raise RuntimeError everytime we run it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107712 Approved by: https://github.com/BowenBao	2023-08-23 05:51:51 +00:00
ydwu4	31b0445702	Fix torch.compile with FakeTensor that has SymInt sizes (#107662 ) Motivation: When input FakeTensor to torch.compile has SymInt sizes (e.g. make_fx(opt_f, tracing_mode="symbolic"): 1. We cannot create a FakeTensor from that input in dynamo due to the SymInts. 2. We cannot check input tensors in guard check function and will abort due to tensor check calls sizes/strides. For 1, we specialize the FakeTensor's SymInts using their hints. This is mostly safe since inputs mostly have concrete shapes and not computed from some DynamicOutputShape ops. We'll throw a data dependent error if the symint is unbacked. For 2, we replace size/stride calls with the sym_* variants in TENSOR_CHECK guards' check function. Test Plan: See added tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107662 Approved by: https://github.com/ezyang	2023-08-23 05:27:57 +00:00
FFFrog	83517c8dba	Enable Mypy Check in torch/_inductor/virtualized.py (#107127 ) Fixes #105230 ```shell $ lintrunner init && lintrunner -a torch/_inductor/virtualized.py ... ok No lint issues. Successfully applied all patches. ``` ```shell $ mypy torch/_inductor/virtualized.py Success: no issues found in 1 source file ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107127 Approved by: https://github.com/eellison	2023-08-23 04:54:32 +00:00
Nikita Shulga	4cc05c41fa	[MPS] Fix `torch.std` for negative dimentions (#107754 ) By simply comparing output dimentions to a properly wrapped dim Add regression test to opinfo <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at ca98536</samp> > _`reduceTensor` bug_ > _negative dimensions wrapped_ > _autumn tests added_ Fixes https://github.com/pytorch/pytorch/issues/107116 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107754 Approved by: https://github.com/kit1980	2023-08-23 03:50:02 +00:00
PyTorch UpdateBot	17675cb1f5	[vision hash update] update the pinned vision hash (#107757 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107757 Approved by: https://github.com/pytorchbot	2023-08-23 03:24:29 +00:00
Huy Do	09c642bfc8	Fix the use of head_branch in filter-test-configs action (#107753 ) This addresses https://github.com/pytorch/pytorch/security/advisories/GHSA-hw6r-g8gj-2987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107753 Approved by: https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/malfet	2023-08-23 03:14:36 +00:00
ydwu4	cbcd551045	Fix torch.compile FunctionalTensor inputs for higherOrderOps (#107604 ) Before this PR, for the added [test](https://github.com/pytorch/pytorch/pull/107604/files#diff-c618f2274b6b5ccc533c580549d2e552edbd9fc5ac0da1aa4b00338525c8f78dR224), which feeds FunctionTensorWrapper inputs to higherOrderOperator, we have an assertion error in this line [code](https://github.com/pytorch/pytorch/pull/107604/files#diff-9f0663783bcd93e948e0491ef61b48123bdc9977bcc632fd707da578df13bfa1R1284). The key difference of this PR is this [line ](https://github.com/pytorch/pytorch/pull/107604/files#diff-9f0663783bcd93e948e0491ef61b48123bdc9977bcc632fd707da578df13bfa1L1263)of check: ```python elif ( isinstance(example_value, FakeTensor) and example_value.fake_mode is tx.fake_mode ): ``` The original intention of it seems to be dealing with case where we want to wrap an fx proxy for an intermediate fake tensor that's produced by some tensor ops and an example value is provided (as is the case for higherOrderOps [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/higher_order_ops.py#L85)). A fakified FunctionalTensorWrapper(FakeTensor) always fails this check. This PR changes it to checking whether it's already fakified by tx.fake_mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107604 Approved by: https://github.com/zou3519 ghstack dependencies: #107569	2023-08-23 02:42:18 +00:00
Tugsbayasgalan Manlaibaatar	fc380a2b5a	[ez] Minor refactors (#107656 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107656 Approved by: https://github.com/angelayi	2023-08-23 02:27:47 +00:00
Hansong Zhang	d395088dc8	Add _native_batch_norm_legit_no_training to core IR (#107732 ) Summary: Added due to how common the op is. For performance reasons users may not want to decompose batch_norm op. batch_norm is also part of StableHLO Test Plan: After adding to IR, we can enable _check_ir_validity in exir.EdgeCompileConfig for models like MV2, MV3, IC3, IC4 Reviewed By: guangy10 Differential Revision: D48576866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107732 Approved by: https://github.com/manuelcandales, https://github.com/guangy10	2023-08-23 02:24:43 +00:00
Ethan Doe	d34cf147d1	MatMul heuristics for aarch64 (#107167 ) This PR focuses on improving MatMul performance for aarch64 only. It introduces a light-weight heuristic that dispatches small or tall/flat MatMul operations to OpenBLAS while other shapes to MKLDNN/ACL. On average, the proposed heuristics improve MatMul operator latency by 1.03x / 1.04x / 1.05x / 1.09x / 1.22x for 1 / 2 / 4 / 8 / 16 threads, respectively (baseline is using ACL for all MatMuls on AWS Graviton c7g instances). Fixes #107168 <details> <summary>Full MatMul benchmark script and result</summary> Run this following script `run.sh` with `ABt.py` under the same directory: ```shell #!/bin/bash script=ABt.py prefix=acl OMP_NUM_THREADS=1 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=0 TORCH_MKLDNN_MATMUL_MIN_SIZE=0 python ${script} > ${prefix}_t1.csv OMP_NUM_THREADS=2 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=0 TORCH_MKLDNN_MATMUL_MIN_SIZE=0 python ${script} > ${prefix}_t2.csv OMP_NUM_THREADS=4 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=0 TORCH_MKLDNN_MATMUL_MIN_SIZE=0 python ${script} > ${prefix}_t4.csv OMP_NUM_THREADS=8 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=0 TORCH_MKLDNN_MATMUL_MIN_SIZE=0 python ${script} > ${prefix}_t8.csv OMP_NUM_THREADS=16 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=0 TORCH_MKLDNN_MATMUL_MIN_SIZE=0 python ${script} > ${prefix}_t16.csv prefix=heur OMP_NUM_THREADS=1 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=8 TORCH_MKLDNN_MATMUL_MIN_SIZE=8192 python ${script} > ${prefix}_t1.csv OMP_NUM_THREADS=2 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=8 TORCH_MKLDNN_MATMUL_MIN_SIZE=8192 python ${script} > ${prefix}_t2.csv OMP_NUM_THREADS=4 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=8 TORCH_MKLDNN_MATMUL_MIN_SIZE=8192 python ${script} > ${prefix}_t4.csv OMP_NUM_THREADS=8 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=8 TORCH_MKLDNN_MATMUL_MIN_SIZE=8192 python ${script} > ${prefix}_t8.csv OMP_NUM_THREADS=16 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=8 TORCH_MKLDNN_MATMUL_MIN_SIZE=8192 python ${script} > ${prefix}_t16.csv ``` `ABt.py`: ```python import argparse import timeit import torch import numpy as np BATCH = 1 DIM_MIN = 8 DIM_MAX = 256 M_MIN = DIM_MIN K_MIN = DIM_MIN N_MIN = DIM_MIN M_MAX = DIM_MAX K_MAX = DIM_MAX N_MAX = DIM_MAX min_iterations = 1000 min_time = 0.1 def get_stats(timings): times = np.array(timings) time_avg = np.average(times) * 1000 time_med = np.median(times) * 1000 time_90th = np.percentile(times, 90) * 1000 time_99th = np.percentile(times, 99) * 1000 return time_avg, time_med, time_90th, time_99th def bench(M, K, N, min_iterations, min_time): a = torch.randn(M, K, dtype=torch.float32) b = torch.randn(N, K, dtype=torch.float32) timings = [] with torch.no_grad(): for _ in range(max(1, min_iterations // 100)): c = torch.matmul(a, b.transpose(0, 1)) bench_time = timeit.default_timer() while True: for _ in range(min_iterations): start_time = timeit.default_timer() c = torch.matmul(a, b.transpose(0, 1)) end_time = timeit.default_timer() timings.append(end_time - start_time) if timeit.default_timer() - bench_time > min_time: break return get_stats(timings) if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('-l', '--loop', dest='loop', action='store_true') flags = parser.parse_args() if flags.loop: while True: for M in range(M_MAX//2, M_MAX+1, 8): for K in range(K_MAX//2, K_MAX+1, 8): for N in range(N_MAX//2, N_MAX+1, 8): stats = bench(M, K, N, min_iterations, min_time) else: torch.manual_seed(0) print(f"M, K, N, latency") for M in range(M_MIN, M_MAX+1, 8): for K in range(K_MIN, K_MAX+1, 8): for N in range(N_MIN, N_MAX+1, 8): stats = bench(M, K, N, min_iterations, min_time) print(f"{M}, {K}, {N}, {stats[2]}") ``` Here's the benchmark result collected on AWS Graviton c7g instance. Due to the size of the table, I can only attach the result in this text file: [table.txt](https://github.com/pytorch/pytorch/files/12374265/table.txt) </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107167 Approved by: https://github.com/malfet	2023-08-23 02:21:06 +00:00
lezcano	fada0527fa	Dispatch take_along_axis to gather (#107711 ) Gather does the same thing, but it's much better supported in the `torch.compile` stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/107711 Approved by: https://github.com/ezyang ghstack dependencies: #107687, #107688, #107710	2023-08-23 01:21:23 +00:00
lezcano	62113a2361	[dynamo] np.sort(complex) is not implemented (#107710 ) This issue was discovered once we were able to trace without breaking in https://github.com/pytorch/pytorch/pull/107689. Same for the next one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107710 Approved by: https://github.com/ezyang ghstack dependencies: #107687, #107688	2023-08-23 01:21:23 +00:00
lezcano	2fc828312c	Support negative indices in ndarray.__getitem__ (#107688 ) In this case, we copy, but this is part of the set of divergences described in https://github.com/Quansight-Labs/numpy_pytorch_interop/issues/73. This does not work with dynamic shapes, but it's not clear to me what would be the best fix Pull Request resolved: https://github.com/pytorch/pytorch/pull/107688 Approved by: https://github.com/ezyang ghstack dependencies: #107687	2023-08-23 01:21:23 +00:00
lezcano	db39a81e1e	Add a flag that allows breaking on NumPy ops (#107687 ) This was removed in `63d406a6a9` Resotiring, as it's rather useful for debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107687 Approved by: https://github.com/larryliu0820	2023-08-23 01:21:22 +00:00
PyTorch MergeBot	e573abec12	Revert "[ATen] Update pre-compiled header (#106915 )" This reverts commit 4f3284e3edd41b883f8bb347fbe33532b2485f53. Reverted https://github.com/pytorch/pytorch/pull/106915 on behalf of https://github.com/ZainRizvi due to reverting the full stack. I missed that the iostream pr was stacked under this one and it's builds are also failing internally ([comment](https://github.com/pytorch/pytorch/pull/106915#issuecomment-1689087860))	2023-08-23 00:25:31 +00:00
Jane Xu	874d1b18b0	[BE] reorganize opt disables in dynamo for clarity (#107709 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107709 Approved by: https://github.com/Skylion007, https://github.com/mlazos	2023-08-23 00:17:34 +00:00
Deepali Chourasia	0c4fa02296	fallback to cpu_kernel for VSX (#98511 ) Attempt to fix https://github.com/pytorch/pytorch/issues/97497 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98511 Approved by: https://github.com/quickwritereader, https://github.com/ezyang	2023-08-22 23:44:30 +00:00
PyTorch MergeBot	42897e8127	Revert "[inductor] Adjust dynamic SMEM limit when above default in AOT (#107601 )" This reverts commit 3920ce2f6ef7f93dd121f86371c1b35697e2e744. Reverted https://github.com/pytorch/pytorch/pull/107601 on behalf of https://github.com/ZainRizvi due to Sorry, but the test added in this PR breaks when run internally. See D48549503 for more details ([comment](https://github.com/pytorch/pytorch/pull/107601#issuecomment-1689049609))	2023-08-22 23:26:50 +00:00
Elias Ellison	68c941d228	[Mypy] move inductor to exclude list (#107741 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107741 Approved by: https://github.com/ezyang	2023-08-22 23:19:55 +00:00
Aaron Gokaslan	660e8060ad	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-22 23:16:38 +00:00
Tarun Karuturi	e8278d6058	Support graphs which return get_attr nodes directly as output (#107610 ) Summary: Currently serializing graphs which return get_attr's directly as output fails. This diff adds support for that only in EXIR serializer while we still support unlifted params. Test Plan: Added test case. Differential Revision: D48258552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107610 Approved by: https://github.com/angelayi	2023-08-22 23:16:10 +00:00
Wanchao Liang	979e706f8e	[dtensor] update some comments (#107608 ) This update some comments from the follow up of https://github.com/pytorch/pytorch/pull/107305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107608 Approved by: https://github.com/fduwjj ghstack dependencies: #107606	2023-08-22 23:08:13 +00:00
Wanchao Liang	945fa7e8a8	[dtensor] fix requires_grad in distribute_tensor (#107606 ) This PR fixes the requires_grad set when calling distribute_tensor, we should set the requires_grad of the local tensor after the detach call to make sure we create the leaf correctly, otherwise it would raise warnings Pull Request resolved: https://github.com/pytorch/pytorch/pull/107606 Approved by: https://github.com/fduwjj	2023-08-22 23:08:13 +00:00
Matt Tinnel	8367cf1220	Add Early Link To Intro Doc (#107344 ) Fixes #106600 by including an early link to the intro docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107344 Approved by: https://github.com/albanD, https://github.com/soulitzer	2023-08-22 23:00:52 +00:00
Nikita Shulga	b115da8361	[MPS][BE] Refactor atan2_out_mps (#107334 ) It's the only function at the moment that has an int64 exception, but check from the preprocessor define unnecessarily applied to all binary functions Also, rename `atan2_mps_out` to `atan2_out_mps` to match the common pattern Pull Request resolved: https://github.com/pytorch/pytorch/pull/107334 Approved by: https://github.com/kulinseth, https://github.com/albanD	2023-08-22 22:54:07 +00:00
Huy Do	d9460bb8f8	Update test_MaxUnpool_index_errors XFAIL after #107483 (#107658 ) After https://github.com/pytorch/pytorch/pull/107483 which reverted https://github.com/pytorch/pytorch/pull/95300, these tests are not XFAIL anymore. So now we know the root cause of https://github.com/pytorch/pytorch/issues/103854. As this is failing slow jobs in trunk atm, i.e. `6981bcbc35`, I'm moving these tests back. ### Testing Run locally and all tests passes. ``` PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1 python test/nn/test_pooling.py -k test_MaxUnpool_index_errors ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107658 Approved by: https://github.com/PaliC	2023-08-22 22:36:35 +00:00
Manuele Sigona	a711679527	Add skipLazy marker for tests and use it for tests not working with LazyTensor (#107382 ) [This PR](https://github.com/pytorch/pytorch/pull/80251/files#diff-87e1d4e98eab994c977a57be29c716d3dc0f76d5b5e98cbf23cfcbd48ae625a4) marked some tests in `test/test_view_ops.py` with `@onlyNativeDeviceTypes`, because they'd fail if run on the `'lazy'` device type. However, that marker is overly restrictive, because it prevents all devices outside of the native ones to run those tests. This PR adds a `@skipLazy` marker (analogous to the existing ones for the other devices), and marks the tests from the mentioned PR so that they're skipped only for the `'lazy'` device type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107382 Approved by: https://github.com/ezyang	2023-08-22 22:34:36 +00:00
FFFrog	4d13422997	fix errors about mypy check in torch/_inductor/compile_fx.py (#107508 ) the `compile_fx.py` blocked the merging of [PR1 ](https://github.com/pytorch/pytorch/pull/107127)and [PR2](https://github.com/pytorch/pytorch/pull/107448) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107508 Approved by: https://github.com/ezyang	2023-08-22 22:33:37 +00:00
PyTorch MergeBot	5025fb9213	Revert "pt2: make aot_eager backend handle basic float8 operations (#107642 )" This reverts commit 24147a8e1c6855489c1669c612ff5cb1b09a09dd. Reverted https://github.com/pytorch/pytorch/pull/107642 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it is failing Windows CPU test in trunk. The Windows failures on your PR looks related I think ([comment](https://github.com/pytorch/pytorch/pull/107642#issuecomment-1688999380))	2023-08-22 22:17:36 +00:00
Tugsbayasgalan Manlaibaatar	9c56ca80f3	Delete accidental print statement (#107745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107745 Approved by: https://github.com/angelayi	2023-08-22 22:09:36 +00:00
drisspg	c093fdf924	Fix wrong hardcoded value for _scaled_mm (#107719 ) ## Summary Sneaky lil bug where we were accidentally fusing in relu to the epilogue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107719 Approved by: https://github.com/vkuzo	2023-08-22 21:52:20 +00:00
Jacob Szwejbka	c14f4d66c3	[pytorch][export] Move is_param and get_param out of exir and into export (#107264 ) Summary: These doesn't feel edge specific so moving out of exir. Test Plan: ci Differential Revision: D48361384 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107264 Approved by: https://github.com/angelayi	2023-08-22 21:41:51 +00:00
PyTorch MergeBot	8fb6416bfa	Revert "Remove CUTLASS extensions merged upstream (#107612 )" This reverts commit cfd98d3c429b0de8a634a843c4551ee86d0084f3. Reverted https://github.com/pytorch/pytorch/pull/107612 on behalf of https://github.com/ZainRizvi due to Sorry, this breaks internal builds which still depend on these files ([comment](https://github.com/pytorch/pytorch/pull/107612#issuecomment-1688936837))	2023-08-22 21:11:41 +00:00
Jane Xu	bcee3d6fa4	[BE] Make nadam decoupled_weight_decay clearer, add missing setstate (#107706 ) Inspired by @blizard's changes in #107507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107706 Approved by: https://github.com/Skylion007	2023-08-22 20:48:28 +00:00
PyTorch MergeBot	b282787409	Revert "Wrap indirect indexing on CUDA (#105055 )" This reverts commit 85c673e6b25173e2697a0dd741a9b2ebb33dec1d. Reverted https://github.com/pytorch/pytorch/pull/105055 on behalf of https://github.com/peterbell10 due to Causes failure in inductor_torchbench ([comment](https://github.com/pytorch/pytorch/pull/105055#issuecomment-1688871947))	2023-08-22 20:24:41 +00:00
PyTorch MergeBot	d59a6864fb	Revert "[BE]: Update ruff to 0.285 (#107519 )" This reverts commit 88ab3e43228b7440a33bf534cde493446a31538c. Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))	2023-08-22 19:53:32 +00:00
Jeffrey Dunn	1e9b590df9	Optimize Net._get_next_net_name (#107479 ) Summary: This is surprisingly expensive and can be easily optimized. Differential Revision: D48440000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107479 Approved by: https://github.com/kit1980	2023-08-22 19:15:11 +00:00
vasiliy	24147a8e1c	pt2: make aot_eager backend handle basic float8 operations (#107642 ) Summary: Makes aot_eager backend of torch.compile handle basic float8 operations. This is useful for float8 training UX. Test Plan: ``` python test/test_quantization.py -k test_pt2_traceable_aot_eager ``` Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/107642 Approved by: https://github.com/albanD	2023-08-22 18:57:14 +00:00
David Berard	ba5eeed4ac	[inductor] Add CPU-side profiler event for triton kernels w/ python wrapper (#106351 ) This allows you to view the original kernel names (e.g. to reference the triton kernel implementation in the python wrapper code / TORCH_COMPILE_DEBUG logs). `torch._inductor.config.unique_kernel_names=True` does this too, but leaving unique_kernel_names=False will increase triton caching. Another benefit to this approach is that we can attach additional information to this profiler event in the future. For example, we could attach input shapes/strides (i.e. record_shapes=True for profiler), or possibly paths to the files where the code was dumped. <img width="435" alt="Screenshot 2023-07-31 at 5 34 25 PM" src="https://github.com/pytorch/pytorch/assets/5067123/839b752f-3907-4f29-9038-9d1822222b45"> ^ in the trace above, the pink "triton_poi_fused_add_cos_sin_0" kernel is the new trace event which is added by this PR. Performance impact: [dashboard run](https://hud.pytorch.org/benchmark/compilers?startTime=Thu%2C%2010%20Aug%202023%2000%3A52%3A06%20GMT&stopTime=Thu%2C%2017%20Aug%202023%2000%3A52%3A06%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/davidberard98/216/orig&lCommit=90c4212a7993c3660e7ea53bcd9d21160be31d1a&rBranch=main&rCommit=35cca799ff42182a1b7f1ee4d0225ee879b7c924). There are some regressions, including a 1.72x -> 1.71x on huggingface and 1.30x -> 1.29x on dynamic; however, locally I can't reproduce the results on any of the individual models (differences look like they are within noise). I think the perf impact is likely < 1% overall. Differential Revision: [D47941809](https://our.internmc.facebook.com/intern/diff/D47941809) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106351 Approved by: https://github.com/eellison, https://github.com/albanD ghstack dependencies: #107195	2023-08-22 18:48:30 +00:00
David Berard	614b865721	[profiler] _RecordFunctionFast - faster python bindings for record_function (#107195 ) torch.profiler.record_function is relatively slow; for example, in some benchmarks I was running, x.view_as(x) was ~2us, and ~16-17us when wrapped in a record_function context. The reasons for this are: dispatcher overhead from going through an op (the main source of overhead), python binding / python conversion overhead, and some overhead from the context manager. This new implementation is faster, but it won't work with torchscript. Based on the benchmarks I was running, it adds 0.5-0.7us overhead per call when the profiler is turned off. To use it, you can just: ```python with torch._C._profiler_manual._RecordFunctionFast("title"): torch.add(x, y) ``` It implements a context manager in python which directly calls the record_function utilities, instead of calling through an op. * The context manager is implemented directly in python because the overhead from calling a python function seems non-negligible * All the record_function calls, python object conversions are guarded on checks for whether the profiler is enabled or not. It seems like this saves a few hundred nanoseconds. For more details about the experiments I ran to choose this implementation, see [my record_functions experiments branch](https://github.com/pytorch/pytorch/compare/main...davidberard98:pytorch:record-function-fast-experiments?expand=1). This also adds a `torch.autograd.profiler._is_profiler_enabled` global variable that can be used to check whether a profiler is currently enabled. It's useful for further reducing the overhead, like this: ```python if torch.autograd.profiler._is_profiler_enabled: with torch._C._profiler_manual._RecordFunctionFast("title"): torch.add(x, y) else: torch.add(x, y) ``` On BERT_pytorch (CPU-bound model), if we add a record_function inside CachedAutotuning.run: * Naive torch.profiler.record_function() is a ~30% slowdown * Always wrapping with RecordFunctionFast causes a regression of ~2-4%. * Guarding with an if statement - any regression is within noise Selected benchmark results: these come from a 2.20GHz machine, GPU build but only running CPU ops; running `x.view_as(x)`, with various record_functions applied (with profiling turned off). For more detailed results see "record_functions experiments branch" linked above (those results are on a different machine, but show the same patterns). Note that the results are somewhat noisy, assume 0.05-0.1us variations ``` Baseline:: 1.7825262546539307 us # Just running x.view_as(x) profiled_basic:: 13.600390434265137 us # torch.profiler.record_function(x) + view_as precompute_manual_cm_rf:: 2.317216396331787 us # torch._C._profiler_manual._RecordFunctionFast(), if the context is pre-constructed + view_as guard_manual_cm_rf:: 1.7994389533996582 us # guard with _is_profiler_enabled + view_as ``` Differential Revision: [D48421198](https://our.internmc.facebook.com/intern/diff/D48421198) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107195 Approved by: https://github.com/albanD, https://github.com/aaronenyeshi	2023-08-22 18:48:30 +00:00
gmagogsfm	137d96a26e	Expose torch.export.dynamic_dim() API (#107635 ) With updated doc Pull Request resolved: https://github.com/pytorch/pytorch/pull/107635 Approved by: https://github.com/avikchaudhuri	2023-08-22 18:40:49 +00:00
Jane Xu	515aa993e3	Document post acc grad hooks in backward hooks execution (#107323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107323 Approved by: https://github.com/soulitzer, https://github.com/albanD	2023-08-22 18:37:03 +00:00
Huy Do	b0e93e206c	Grant upload-stats jobs access to S3 (#107717 ) These jobs have write access to S3 when they are running on our self-hosted runners. On the other hand, they would need the AWS credential to run if they are run on GitHub ephemeral runner. ### Testing Use the AWS credential in upload-stats environment to run the test command successfully (currently failing in trunk due to the lack of permission `a5f83245fd`) ``` python3 tools/alerts/upload_alerts_to_aws.py --alerts '[{"AlertType": "Recurrently Failing Job", "AlertObject": "Upload Alerts to AWS/Rockset / upload-alerts", "OncallTeams": [], "OncallIndividuals": [], "Flags": [], "sha": "c8a6c74443f298111fd6568e2828765d87b69c98", "branch": "main"}, {"AlertType": "Recurrently Failing Job", "AlertObject": "inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 1, linux.g5.4xlarge.nvidia.gpu)", "OncallTeams": [], "OncallIndividuals": [], "Flags": [], "sha": "f13101640f548f8fa139c03dfa6711677278c391", "branch": "main"}, {"AlertType": "Recurrently Failing Job", "AlertObject": "slow / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (slow, 1, 2, linux.g5.4xlarge.nvidia.gpu)", "OncallTeams": [], "OncallIndividuals": [], "Flags": [], "sha": "6981bcbc35603e5d8ac7d00a2032925239009db5", "branch": "main"}]' --org "pytorch" --repo "pytorch" Writing 138 documents to S3 Done! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107717 Approved by: https://github.com/clee2000	2023-08-22 18:31:02 +00:00
Alexander Jipa	2e054037da	fixing named tensor unflatten example (#106921 ) Fixes an example from the documentation [here](https://pytorch.org/docs/stable/named_tensor.html#manipulating-dimensions). Pull Request resolved: https://github.com/pytorch/pytorch/pull/106921 Approved by: https://github.com/zou3519	2023-08-22 18:00:10 +00:00
PyTorch MergeBot	28dc1a093f	Revert "Remove some unnecessary <iostream> includes from headers (#106914 )" This reverts commit 60936e4c296e79f56cac2431a560970bb4529d03. Reverted https://github.com/pytorch/pytorch/pull/106914 on behalf of https://github.com/ZainRizvi due to Sorry, but this is breaking internal builds. Seems like a lot of internal code depends on some of the removed imports ([comment](https://github.com/pytorch/pytorch/pull/106914#issuecomment-1688605975))	2023-08-22 17:16:48 +00:00
Catherine Lee	c8a6c74443	Remove aws ossci metrics upload keys from rocm (#107613 ) @huydhn Our current workflow is to upload to GH and then upload from GH to S3 when uploading test stats at the end of a workflow. I think these keys could be used to directly upload from the runner to S3 but we don't do that right now. I'm not sure how high priority they keys are. Rocm artifacts can still be seen on the HUD page Pull Request resolved: https://github.com/pytorch/pytorch/pull/107613 Approved by: https://github.com/huydhn	2023-08-22 17:12:49 +00:00
Huy Do	a5f83245fd	Access ROCKSET_API_KEY from ephemeral runners (#107652 ) Hardening the access to ROCKSET_API_KEY by only using this key from ephemeral runners `ubuntu-22.04` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107652 Approved by: https://github.com/clee2000	2023-08-22 17:02:44 +00:00
Jack Taylor	de8a91f40a	[ROCm] Remove expected inductor UT fails for batch norm (#107027 ) Removing expected failures relating to inductor batch_norm on ROCm Also removing the addition of `tanh` to expected failures list as this is a cuda exclusive failure already captured here (cc: @peterbell10) ``` if not TEST_WITH_ROCM: inductor_gradient_expected_failures_single_sample["cuda"]["tanh"] = {f16} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107027 Approved by: https://github.com/peterbell10	2023-08-22 16:39:11 +00:00
Catherine Lee	e0238577b6	Always import test selection tools (#107644 ) https://github.com/pytorch/pytorch/pull/107070 made emit_metrics importable without boto3, so we could just import all the files without the try catch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107644 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-08-22 16:36:20 +00:00
Catherine Lee	4dc9df2f87	Slightly more flexible naming system for disable + slow tests (#104002 ) Sometimes test suite names include file/module names since they were imported from another file (ex _nvfuser.test_dynamo.TestNvFuserDynamo etc). This can sometimes make the autogenerated named by disable bot and the disable test button on hud incorrect which is annoying to track down, which leads to issues that are open but don't actually do anything, so my solution is to make the check between the issue name + the test more flexible. Instead of checking the entire test suite name, we chop off the file/module names and only look for the last part (ex TestNvFuserDynamo) and check if those are equal. Also bundle both the check against the names in the slow test json and disable test issue names into one function for no reason other than less code. Looked through logs to see what tests are skipped with this vs the old one and it looked the same. Diff looks like a big change but its mostly a change in the indentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/104002 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2023-08-22 16:35:54 +00:00
vlad-scherbich	e740491674	[caffe2][cuda] Trace `allocate` and `local_raw_delete` events with PyTorch USDTs (#107322 ) Summary: Adds new tracepoints to CUDA allocator code for tracking alloc and dealloc events in the allocator code. Test Plan: This change simply adds static tracepoints to CUDA allocator code, and does not otherwise change any logic. Testing is not required. Reviewed By: chaekit Differential Revision: D48229150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107322 Approved by: https://github.com/chaekit	2023-08-22 16:31:30 +00:00
ydwu4	a408920817	Reland fakify FunctionalTensor (#107569 ) Try to rebase and reland https://github.com/pytorch/pytorch/pull/107062 . One difference compared with previous is to make the DTensor logic same as previously in _clone_input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107569 Approved by: https://github.com/zou3519	2023-08-22 15:46:25 +00:00
Brian Hirsh	02d41b7afd	allow result of at::for_blob to advertise as resizeable (for tensor subclasses) (#107416 ) Previously, the first overload of `_make_wrapper_subclass` returned a tensor that always advertised as having a non-resizeable storage. Eventually, we'll need it be advertise as resizeable for functionalization to work (since functionalization occasionally needs to resize storages). Not directly tested in this PR (tested more heavily later in aot dispatch, but if someone wants me to write a more direct test I can add one). Pull Request resolved: https://github.com/pytorch/pytorch/pull/107416 Approved by: https://github.com/ezyang, https://github.com/albanD ghstack dependencies: #107417	2023-08-22 15:25:31 +00:00
Brian Hirsh	2c8759df9d	Allow storage() to work on python tensor subclasses, but error on future data accesses (#107417 ) This was discussed in feedback from the original version of my "reorder proxy/fake" PR. This PR allows calls to `tensor.untyped_storage()` to always return a python storage object to the user. Previously, we would error loudly if we detected that the storage had a null dataptr. Instead, I updated the python bindings for the python storage methods that I saw involve data access, to throw an error later, only if you try to access those methods (e.g. `storage.data_ptr()` will now raise an error if the data ptr is null). Pull Request resolved: https://github.com/pytorch/pytorch/pull/107417 Approved by: https://github.com/albanD, https://github.com/ezyang, https://github.com/zou3519	2023-08-22 15:25:31 +00:00
Richard Zou	df42f15e28	Improve `generate_opcheck_tests`, add opcheck utility (#107597 ) Summary: This PR improves `generate_opcheck_tests`: - We shouldn't run automated testing through operators called in torch.jit.trace / torch.jit.script - I improved the error message and added a guide on what to do if one of the tests fail. - While dogfooding this, I realize I wanted a way to reproduce the failure without using the test suite. If you pass `PYTORCH_OPCHECK_PRINT_REPRO`, it will now print a minimal repro on failure. This involves serializing some tensors to disk. - The minimal repro includes a call to a new API called `opcheck`. The opcheck utility runs the same checks as the tests generated by `generate_opcheck_tests`. It doesn't have a lot of knobs on it for simplicity. The general workflow is: if an autogenerated test fails, then the user may find it easier to reproduce the failure without the test suite by using opcheck Test Plan: - new tests Differential Revision: D48485013 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107597 Approved by: https://github.com/ezyang	2023-08-22 15:16:04 +00:00
Jane Xu	3f655277d4	Add tensor post accumulate grad hook API (#107063 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107063 Approved by: https://github.com/albanD, https://github.com/soulitzer	2023-08-22 15:15:57 +00:00
Yukio Siraichi	bcede143bd	Do not mutate `SymNode` expression. (#107492 ) This PR stops `SymNode` from mutating (i.e. simplifying) its expression. Instead, the simplification (without mutation) is deferred to the `SymNode.maybe_as_int` method. ```python - FakeTensor(size=(s0,), ...) - FakeTensor(size=(s1, s2, s3), ...) - Eq(s0, s1 + s2 + s3) - FakeTensor(size=(s0,), ...) - FakeTensor(size=(s1, s2, s3), ...) ``` In summary, this PR: - Replaces `SymNode._expr` by `SymNode.expr`, removing the old property function - This makes it so `SymNode` instances never update their expression - Creates `SymNode.simplified_expr()` method for actually calling `ShapeEnv.replace` on its expression. Note that this doesn't updates `SymNode.expr` - Changes how `tensor.size()` gets converted to its Python `torch.Size` type - Instead of calling `SymInt::maybe_as_int()` method, we create a new `SymInt::is_symbolic()` method for checking whether it is actually a symbolic value - This is needed so that when we call `tensor.size()` in the Python side, the returned sequence is faithful to the actual data, instead of possibly simplifying it and returning an integer - 2 files needs this modification: - _torch/csrc/Size.cpp_: for handling `torch.Tensor.size` Python calls - _torch/csrc/utils/pybind.cpp_: for handling `symint.cast()` C++ calls Pull Request resolved: https://github.com/pytorch/pytorch/pull/107492 Approved by: https://github.com/ezyang ghstack dependencies: #107523	2023-08-22 12:38:05 +00:00
Yukio Siraichi	d2215f14ba	Fix: transactional translation validation insertion. (#107523 ) This PR fixes transactional behavior of translation validation insertion. Previously, this transactional behavior was implemented by removing the FX node if any issues occurred until the end of `evaluate_expr`. However, since we cache FX nodes, we might end up removing something that wasn't inserted in the same function call. Solution: when creating an FX node for `call_function`, we also return whether this is a fresh FX node or not. Then, we can appropriately handle each case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107523 Approved by: https://github.com/ezyang	2023-08-22 12:38:05 +00:00
Xu Han	3f3479e85e	reduce header file to boost cpp_wrapper build. (#107585 ) 1. Reduce cpp_wrapper un-used header files. 2. Clean pch cache, when use_pch is False. The first change will reduce the build time from 7.35s to 4.94s. Before change: ![image](https://github.com/pytorch/pytorch/assets/8433590/fc5c1d37-ec40-44f3-8d4d-bf26bdc674bb) After change: ![image](https://github.com/pytorch/pytorch/assets/8433590/c7ccadd2-bf3a-4d30-bf56-6e3b0230a194) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107585 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/jgong5	2023-08-22 11:58:47 +00:00
FFFrog	94d85f18c9	Enable Mypy Check in torch/_inductor/triton_heuristics.py (#107135 ) Fixes #105230 ```shell $ lintrunner init && lintrunner -a torch/_inductor/triton_heuristics.py ... ok No lint issues. Successfully applied all patches. ``` ```shell $ mypy torch/_inductor/triton_heuristics.py Success: no issues found in 1 source file ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107135 Approved by: https://github.com/ezyang	2023-08-22 09:51:30 +00:00
angelayi	431d25a141	[export] Add save/load function (#107309 ) Added the following APIs: ``` def save( ep: ExportedProgram, f: Union[str, pathlib.Path, io.BytesIO], extra_files: Optional[Dict[str, Any]] = None, opset_version: Optional[Dict[str, int]] = None, ) -> None: """ Saves a version of the given exported program for use in a separate process. Args: ep (ExportedProgram): The exported program to save. f (str): A file-like object (has to implement write and flush) or a string containing a file name. extra_files (Optional[Dict[str, Any]]): Map from filename to contents which will be stored as part of f. opset_version (Optional[Dict[str, int]]): A map of opset names to the version of this opset """ def load( f: Union[str, pathlib.Path, io.BytesIO], extra_files: Optional[Dict[str, Any]] = None, expected_opset_version: Optional[Dict[str, int]] = None, ) -> ExportedProgram: """ Loads an ExportedProgram previously saved with torch._export.save Args: ep (ExportedProgram): The exported program to save. f (str): A file-like object (has to implement write and flush) or a string containing a file name. extra_files (Optional[Dict[str, Any]]): The extra filenames given in this map would be loaded and their content would be stored in the provided map. expected_opset_version (Optional[Dict[str, int]]): A map of opset names to expected opset versions Returns: An ExportedProgram object """ ``` Example usage: ``` # With buffer buffer = io.BytesIO() torch._export.save(ep, buffer) buffer.seek(0) loaded_ep = torch._export.load(buffer) # With file with tempfile.NamedTemporaryFile() as f: torch._export.save(ep, f) f.seek(0) loaded_ep = torch._export.load(f) # With Path with TemporaryFileName() as fname: path = pathlib.Path(fname) torch._export.save(ep, path) loaded_ep = torch._export.load(path) # Saving with extra files buffer = io.BytesIO() save_extra_files = {"extra.txt": "moo"} torch._export.save(ep, buffer, save_extra_files) buffer.seek(0) load_extra_files = {"extra.txt": ""} loaded_ep = torch._export.load(buffer, extra_files) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107309 Approved by: https://github.com/avikchaudhuri, https://github.com/gmagogsfm, https://github.com/tugsbayasgalan	2023-08-22 08:25:19 +00:00
Tugsbayasgalan Manlaibaatar	134d415615	Unlift mutated buffers (#107643 ) In this PR, we extend ExportedProgram.module() functionality by also unlifting the mutated buffers. We only really care about top level buffers as we don't allow any buffer mutation inside HigherOrderOps. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107643 Approved by: https://github.com/avikchaudhuri	2023-08-22 05:16:27 +00:00
sanchitintel	8ed169b162	Re-enable AVX512 ATen kernels for compute-intensive ops (#104165 ) ## Summary Enables AVX512 dispatch by default for some kernels, for which AVX512 performs better than AVX2. For other kernels, their AVX2 counterparts are used. ## Implementation details `REGISTER_DISPATCH` should now only be used for non-AVX512 dispatch. `ALSO_REGISTER_AVX512_DISPATCH` should be used when AVX512 dispatch should also be done for a kernel. ## Benchmarking results with #104655 [Raw data at GitHub Gist (Click on `Download ZIP`)](https://gist.github.com/sanchitintel/87e07f84774fca8f6b767aeeb08bc0c9) \| Op \| Speedup of AVX512 over AVX2 \| \|----\|------------------------------------\| \|sigmoid\|~27% with FP32\| \|sign\| ~16.6%\| \|sgn\|~15%\| \|sqrt\|~4%\| \|cosh\|~37%\| \|sinh\|~37.5%\| \|acos\| ~8% with FP32 \| \|expm1\| ~30% with FP32\| \|log\|~2%\| \|log1p\|~16%\| \|erfinv\|~6% with FP32\| \|LogSigmoid\|~33% with FP32\| \|atan2\|~40% with FP32\| \|logaddexp\| ~24% with FP32\| \|logaddexp2\| ~21% with FP32\| \|hypot\| ~24% with FP32\| \|igamma\|~4% with FP32\| \|lgamma\| ~40% with FP32\| \|igammac\|3.5%\| \|gelu\|~3% with FP32\| \|glu\|~20% with FP32\| \|SiLU\|~35% with FP32\| \|Softplus\|~33% with FP32\| \|Mish\|~36% with FP32\| \|Hardswish\|~7% faster with FP32 when tensor can fit in L2 cache\| \|Hardshrink\|~8% faster with FP32 when tensor can fit in L2 cache\| \|Softshrink\|~10% faster with FP32 when tensor can fit in L2 cache\| \|Hardtanh\|~12.5% faster with FP32 when tensor can fit in L2 cache\| Hardsigmoid\|~7% faster with FP32 when tensor can fit in L2 cache\| \|hypot\|~35%\| \|atan2\|~37%\| \|dequantize per channel\|~10%\| ## Insights gleaned through collected data (future action-items): 1. Inplace variants of some ops are faster with AVX512 although the functional variant may be slower for FP32. Will enable AVX512 dispatch for the inplace variants of such kernels. 2. Almost all BF16 kernels are faster with AVX512, so after PyTorch 2.1 release, will enable AVX512 dispatch for BF16 kernels whose corresponding FP32 kernel doesn't perform well with AVX512. 3. Some kernels rely on auto-vectorization & might perform better with AVX512 once explicit vectorization would be enabled for them. Data was collected with 26 physical threads of one socket of Intel Xeon 8371HC. Intel OpenMP & tcmalloc were preloaded. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104165 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/kit1980	2023-08-22 04:26:28 +00:00
Tugsbayasgalan Manlaibaatar	ee72071fc7	Avoid executing side-effectful graph_module as validation step (#107271 ) Dynamo currently runs the real graph module with real inputs as a way to match the return result of graph module with the eager return type. This is unsafe when graph module is side effectful. In the long term, we will get rid of this step. But in the short term, we just fakify the graph module again and run it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107271 Approved by: https://github.com/ezyang	2023-08-22 04:22:31 +00:00
Dávid Majerčák	155d12856c	Update utils.h and correct misleading error messages (#107602 ) Fixes #107410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107602 Approved by: https://github.com/ezyang	2023-08-22 03:55:46 +00:00
BowenBao	f9f88f2d31	[ONNX] Add unittest for exporting embedding_bag (#105862 ) Issue list: * Unsupported FX nodes: {'call_function': ['aten.embedding_renorm.default', ~~'aten._embedding_bag_forward_only.default'~~]}. * aten._embedding_bag.default not captured by test. Hence this test is not reflecting the pattern seen in model from onnxbench. Update: need validation again, unsure if this is still the case. * `padding_idx` is always emitted for `aten._embedding_bag` and `aten._embedding_bag_forward_only`. This overload is unsupported by Torchlib. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105862 Approved by: https://github.com/justinchuby	2023-08-22 03:52:38 +00:00
PyTorch UpdateBot	849fbc6929	[vision hash update] update the pinned vision hash (#107649 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107649 Approved by: https://github.com/pytorchbot	2023-08-22 03:17:50 +00:00
Animesh Jain	a506d0ad8f	[dynamo] Store originating source in the Guard object (#107634 ) Many times, I find myself wanting to know the source for the guard. This PR adds that as a field of guard itself. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107634 Approved by: https://github.com/voznesenskym ghstack dependencies: #107622	2023-08-22 02:16:31 +00:00
Animesh Jain	12b0372a75	[dynamo] Continue on fbgemm import fail (#107622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107622 Approved by: https://github.com/voznesenskym	2023-08-22 02:16:31 +00:00
Brian	3361fae89b	Fix FP16Planner documentation (#107620 ) Fixes #107619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107620 Approved by: https://github.com/awgu	2023-08-22 02:05:27 +00:00
lezcano	f13101640f	Quick return when there's nothing to bound in bound_sympy (#107549 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107549 Approved by: https://github.com/ezyang, https://github.com/eellison ghstack dependencies: #105055	2023-08-22 01:06:35 +00:00
lezcano	85c673e6b2	Wrap indirect indexing on CUDA (#105055 ) Lifting this to CPU should be rather easy. @jgong5 Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well. This fix works with dynamic shapes as well. @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055 Approved by: https://github.com/peterbell10, https://github.com/jansel	2023-08-22 01:06:35 +00:00
Edward Z. Yang	8292b03c47	Use fast traceback for symbolic shapes (#107439 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107439 Approved by: https://github.com/voznesenskym ghstack dependencies: #107505, #107516, #107530, #107532, #107562, #107471	2023-08-22 01:03:13 +00:00
Edward Z. Yang	072bb06117	Change how caching/cleanup for CapturedTraceback works (#107471 ) CapturedTraceback is fast but one downside is that it has strong references to code objects, which via `co_extra` can cause un-collectable cycles. This means that it is important to clear out CapturedTraceback when you are done with it; e.g., if you tracebacks during compilation, you need to explicitly clear them out at the end of compilation to actually make sure they promptly deallocate. Instead of caching `summary` on the CapturedTraceback, we simply allow for tracebacks to have `tb = None`. Tracebacks get dropped if you pickle the traceback, or if you explicitly call cleanup(). Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107471 Approved by: https://github.com/voznesenskym ghstack dependencies: #107505, #107516, #107530, #107532, #107562	2023-08-22 01:03:13 +00:00
gmagogsfm	bbb216bca4	Move torch.export() to torch.export.export() (#107609 ) New plan: torch.export.export() as the main API All other utilities will be torch.export.foo_utilities Pull Request resolved: https://github.com/pytorch/pytorch/pull/107609 Approved by: https://github.com/tugsbayasgalan, https://github.com/msaroufim	2023-08-22 00:38:32 +00:00
Shiyan Deng	2e73c86d45	[fx][split] make sure we copy node.meta over during split (#107248 ) Summary: Previously when we create placeholder nodes for sub graph modules, we didn't copy node.meta over. Test Plan: CI Differential Revision: D48330866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107248 Approved by: https://github.com/zyan0, https://github.com/houseroad, https://github.com/Neilblaze	2023-08-22 00:06:45 +00:00
Wanchao Liang	9c2b4a35a3	[dtensor] group all dynamo tests together (#107487 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107487 Approved by: https://github.com/fduwjj ghstack dependencies: #107472, #107473	2023-08-21 23:56:00 +00:00
Wanchao Liang	42f25d49f8	[dynamo] enable 2d torch.compile test (#107473 ) This PR adds 2d parallel torch.compile test on a simple MLP model and test that the dynamo changes works, once @bdhirsh aot autograd enablement done we can switch this test to test the e2e torch.compile workflow Pull Request resolved: https://github.com/pytorch/pytorch/pull/107473 Approved by: https://github.com/fduwjj ghstack dependencies: #107472	2023-08-21 23:56:00 +00:00
Howard Huang	8c10be28a1	Update reduce_scatter_tests to work for world_size > 1 (#104424 ) These tests only worked since `world_size==1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104424 Approved by: https://github.com/awgu	2023-08-21 23:13:56 +00:00
Jane Xu	1641d671e5	[optim] FusedAdam/W accepts lr: Tensor without h2ds (#106916 ) Starts addressing #106802 This PR also conveniently does some BE: - Fixes a bug in adamw where we use amsgrad instead of per group amsgrad - Brings the impls of adamw and adam closer to correctness and to each other I couldn't fully remove the .pyi's because mypy was going to complain about the entire files which scared me and shouldn't go in this PR anyway. Test plan: - Add tests to ensure that lr could be passed as a Tensor - Did some profiling of the below code (runs 1k iterations of step for Adam) ``` import torch from torch.testing._internal.common_utils import TestCase param = torch.rand(2, 3, dtype=torch.float, device='cuda:0', requires_grad=True) param.grad = torch.rand_like(param) lr = torch.tensor(.001, device='cuda:0') opt = torch.optim.Adam([param], lr=lr, fused=True) with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ] ) as p: for _ in range(1000): opt.step() print(p.key_averages().table(sort_by="cpu_time_total")) ``` Before my change: <img width="1381" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/cfc5175a-0f41-4829-941f-342554f3b152"> After my change (notice there are no d2h syncs and the CPU time is lower!): ![image](https://github.com/pytorch/pytorch/assets/31798555/726d7e66-dcff-4a4f-8a75-e84329961989) Next steps long term: - have all capturable foreach + forloop impls in Adam(W) handle tensor LR - have all capturable impls handle tensor LR - have all impls handle tensor LR Pull Request resolved: https://github.com/pytorch/pytorch/pull/106916 Approved by: https://github.com/albanD	2023-08-21 23:00:44 +00:00
Jane (Yuan) Xu	350fb16f47	Add space to merge cancel comment (#107603 ) Minor QoL improvement Pull Request resolved: https://github.com/pytorch/pytorch/pull/107603 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi	2023-08-21 21:43:15 +00:00
Evgeni Burovski	da67b414d9	torch._numpy: remove noops and half-implemented nan-functions (#107596 ) As discussed in the review of https://github.com/pytorch/pytorch/pull/106211, remove several noops (https://github.com/pytorch/pytorch/pull/106211#pullrequestreview-1559806543 and https://github.com/pytorch/pytorch/pull/106211#pullrequestreview-1559809287). Pull Request resolved: https://github.com/pytorch/pytorch/pull/107596 Approved by: https://github.com/lezcano	2023-08-21 21:17:55 +00:00
wz337	f5d1df3c2f	[1/N] Introduce init_device_mesh() (#107254 ) This PR introduces init_device_mesh() as an API to standardize UX device_mesh initialization. The functionality of slicing out a submesh from a given mesh would come in later PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107254 Approved by: https://github.com/wanchaol	2023-08-21 21:13:47 +00:00
Zain Rizvi	5ddb8ef827	Make emit_metrics importable without having boto3 installed (#107070 ) Make it so that scripts can import and run the `emit_metrics` function even if they don't have boto3 installed, in which case it will still validate the inputs but skip the actual metric emission part. It's purely a refactor without any real logic changes Motivation: So that run_test.py and the target determination code can use this library easily without worrying about if it was imported or if it's dependencies are installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107070 Approved by: https://github.com/huydhn	2023-08-21 21:13:01 +00:00
Adnan Akhundov	3920ce2f6e	[inductor] Adjust dynamic SMEM limit when above default in AOT (#107601 ) Summary: When AOT Inductor runs a Triton matmul kernel (generated from the Triton mm template) on large inputs of particular shape, the `RuntimeError: CUDA driver error: 1` may happen. E.g., when `x @ y` is compiled with AOT Inductor and run on the input shapes `[10285, 96]` and `[96, 1]`. Digging deeper into the generated AOT Inductor wrapper code, we see this line: ``` launchKernel(triton_unk_fused_mm_0, 81, 1, 1, 4, 55296, kernel_args_var_0, stream); ``` `55296` is the required amount (in bytes) of dynamic shared memory. This is larger than the default dynamic shared memory on A100: `49152` bytes. In these cases, `cudaFuncSetAttribute` must be called explicitly to set the`cudaFuncAttributeMaxDynamicSharedMemorySize` attribute of the kernel before launching it. Or, because AOT Inductor wrapper relies on the CUDA Driver API, the equivalent [`cuFuncSetAttribute`](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g0e37dce0173bc883aa1e5b14dd747f26) function can be called to set the `CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES` attribute. This PR adds the above call in the AOT Inductor codegen for every case when the required amount of dynamic SMEM is > 0. The call is done within the `launchKernel` function, meaning that it will happen only once per kernel and not affect the subsequent AOT Inductor-compiled model performance (after the first run). P.S. One could, in principle, call the `cuFuncSetAttribute` only when the required amount of dynamic SMEM is above the default limit, but that would require detecting the default limit which is different on different devices. Assuming that the `cuFuncSetAttribute` is relatively lightweight and because it's performed only once per kernel, for simplicity, the suggestion is to call the function in every non-zero dynamic SMEM case. Test Plan: ``` $ python test/inductor/test_aot_inductor.py ... ---------------------------------------------------------------------- Ran 5 tests in 100.177s OK ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/107601 Approved by: https://github.com/jansel	2023-08-21 21:06:09 +00:00
Aleksandar Samardžić	cfd98d3c42	Remove CUTLASS extensions merged upstream (#107612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107612 Approved by: https://github.com/cpuhrsch, https://github.com/albanD	2023-08-21 20:55:21 +00:00
HDCharles	6981bcbc35	fixing bug with non-contiguous mixed_mm [inductor] (#107495 ) Summary: this PR detects https://github.com/pytorch/pytorch/issues/107423 and falls back to the non-triton kernel. It also adds a check for non-contiguous issues in uint4x2 in the unit tests though its not an issue in this case. Test Plan: python pytorch/test/inductor/test_pattern_matcher.py -k "test_mixed_mm_bad_cases" python pytorch/test/inductor/test_pattern_matcher.py -k "test_uint4x2_mixed_mm" Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/107495 Approved by: https://github.com/davidberard98	2023-08-21 20:44:19 +00:00
Kshiteej K	977a77ca2c	Manually enable `capture_func_transforms` for testing (#107122 ) Manually enable `capture_func_transforms` for testing as plan is to default `capture_func_transforms` to False in 2.1. (enable it so that we still test the support on release branch). Pull Request resolved: https://github.com/pytorch/pytorch/pull/107122 Approved by: https://github.com/zou3519	2023-08-21 20:38:33 +00:00
Pearu Peterson	a816aa785b	Implement autograd support for sparse compressed tensor constructors (#107384 ) Fixes https://github.com/pytorch/pytorch/issues/107126 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107384 Approved by: https://github.com/cpuhrsch ghstack dependencies: #107447	2023-08-21 20:26:39 +00:00
Catherine Lee	04a7915dbc	Run check api rate limit on ephemeral runner (#107621 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/107621 Approved by: https://github.com/huydhn	2023-08-21 20:20:31 +00:00
moto	a250cc9bd7	Update persons_of_interest.rst (#107592 ) Updating the state of PyTorch Audio. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/107592 Approved by: https://github.com/cpuhrsch	2023-08-21 20:01:46 +00:00
Pearu Peterson	d7c0c5de2d	Set crow_indices outputs as non-differentiable. (#107447 ) Fixes https://github.com/pytorch/pytorch/issues/107083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107447 Approved by: https://github.com/cpuhrsch	2023-08-21 19:52:32 +00:00
AllenTiTaiWang	a4eae43315	[ONNX] Update xfail reasons in fx runtime tests (#107257 ) 1. Update xfail reasons in fx runtime 2. Enable bloom-560m in runtime test. However, it's blocked by the unsupported constant tensor case. The previous error was because the when the model loads with external data, it surpasses 2GB, and couldn't be inlined. The fix is to inline the model it self, and then replace the original one. Pointing ORT to the path allows it to load with external data into model in runtime. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107257 Approved by: https://github.com/justinchuby	2023-08-21 19:21:56 +00:00
lezcano	612c8a8c84	Guard numpy imports in the dynamo folder (#107299 ) Fixes https://github.com/pytorch/pytorch/issues/107228 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107299 Approved by: https://github.com/atalman	2023-08-21 19:07:20 +00:00
Aaron Gokaslan	79d35bfc01	[BE]: Add PYI files to ruff lintrunner (#107524 ) Due to a bug with the lintrunner yaml, PYI files were not convered. This PR builds off #107519 covers all PYI files and adds one noqa to fix a B006 bugprone error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107524 Approved by: https://github.com/ezyang	2023-08-21 18:55:41 +00:00
Animesh Jain	e201e3ffa1	[dynamo][eval frame] Make CacheEntry a PyObject (#107405 ) This PR makes CacheEntry a PyObject. This is prep PR for cache size changes. As CacheEntry is a py object, we can now traverse the linked list in Python and write cache size policies. It was possible to do in C, but Python is just easier to iterate upon. We call convert_frame only when we (re)compile, so a small bump in latency going from C to Python is acceptable here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107405 Approved by: https://github.com/ezyang ghstack dependencies: #106917, #107117	2023-08-21 18:47:53 +00:00
Catherine Lee	3b2c5d47c0	Use default build env and test config for test times (#107325 ) Redo of #107312 Pairs with https://github.com/pytorch/test-infra/pull/4476 If build env and test config combo cannot be found in the test times, use default. Then we don't have to go manually change the test-times.json a new job is added or we update the jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107325 Approved by: https://github.com/huydhn	2023-08-21 18:39:55 +00:00
Edward Z. Yang	ad07a4bc56	Print per-tensor guard messages for TENSOR_MATCH (#107562 ) The new guard messages look like: ``` check_tensor(L['y'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[3], stride=[1]) # _dynamo/variables/builder.py:1237 in wrap_fx_proxy_cls ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107562 Approved by: https://github.com/anijain2305, https://github.com/jansel ghstack dependencies: #107505, #107516, #107530, #107532	2023-08-21 18:00:00 +00:00
Anupam Bhatnagar	3336aa191c	Adding allocated and reserved memory values to memory timline view. (#107056 ) Summary: This diff adds the max allocated and max reserved memory values to the memory timeline plot. Test Plan: Executed `buck run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_resnet_integration_test -- --enable_profiling --profile_memory --trace_handler=auto_trace --with_stack --record_shapes` on my devgpu. The generated output is at https://www.internalfb.com/manifold/explorer/ai_efficiency/tree/traces/dynocli/devgpu020.odn1.facebook.com/rank-0/rank-0.Aug_10_16_50_50.236946.pt.memorytl.html {F1067885545} Screenshot of the html above {F1067886350} Reviewed By: aaronenyeshi Differential Revision: D48251791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107056 Approved by: https://github.com/aaronenyeshi, https://github.com/davidberard98	2023-08-21 17:20:13 +00:00
Wanchao Liang	da765995fb	[2d] remove ShardedTensor from fsdp extension (#107472 ) 2D Parallel won't use ShardedTensor, and it causes headable for dynamo to recoginize it, removing it from the runtime flatten/unflatten path Pull Request resolved: https://github.com/pytorch/pytorch/pull/107472 Approved by: https://github.com/fduwjj	2023-08-21 17:16:07 +00:00
PyTorch MergeBot	e0f1fe102a	Revert "Add scalar conversion using avx instructions for half (#102140 )" This reverts commit 1d6a44656755c89f4f9a878865dcb0ac39af9a74. Reverted https://github.com/pytorch/pytorch/pull/102140 on behalf of https://github.com/ZainRizvi due to Sorry, this is still breaking internal builds. Specifically, the dynamo test test_repros.py::DynamicShapesReproTests::test_odict_get_item_index_name ([comment](https://github.com/pytorch/pytorch/pull/102140#issuecomment-1686684117))	2023-08-21 16:51:50 +00:00
Jordan Fix	df16b1ed53	[dynamo+aten] Enable embedding_bag_byte_rowwise_offsets + meta kernel impl (#106105 ) Differential Revision: D47007550 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106105 Approved by: https://github.com/gmagogsfm	2023-08-21 16:33:42 +00:00
angelayi	d5b8c71112	[inductor] Revert inductor changes in #105977 (#107468 ) Reverts inductor changes in #105977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107468 Approved by: https://github.com/jansel	2023-08-21 15:50:03 +00:00
angelayi	a5efb5eb84	[export] Serialize constrain_as_size ops (#107386 ) Since constrain_as_size has been fixed, I tried serializing it, but ran into some issues. Notably, after each `.transform` call, I added a helper `_get_updated_range_constraints` to update the range constrains list. This is because when we retrace in a pass, the symbolic values being used changes, so we need to update this dictionary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107386 Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17	2023-08-21 15:24:11 +00:00
Richard Zou	5f56c4fb32	[torch.compile x autograd.Function] More test cases (#107467 ) I pulled a bunch of autograd.Function from test_autograd.py and added a smoke test for them. Ideally we would actually run test_autograd.py as a part of the Dynamo test suite, but we have excluded it due to there being too many errors and I don't have time to figure that out at the moment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107467 Approved by: https://github.com/ydwu4 ghstack dependencies: #107459, #107461	2023-08-21 13:39:36 +00:00
Richard Zou	72de9b2ec2	[HigherOrderOp] stop erroring out on non-Tensor returns (#107461 ) If map or autograd.Function have an input that returns a non-Tensor, then the code just errors out. Instead of erroring out we should graph break by raising Unsupported so users aren't confused. The better thing to do is actually support non-Tensor returns but that requires more work. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/107461 Approved by: https://github.com/ydwu4 ghstack dependencies: #107459	2023-08-21 13:39:36 +00:00
Richard Zou	c5c41f9601	[HigherOrderOps] Saner error message (#107459 ) Sometimes the Unsupported error messages can be pretty opaque (see https://github.com/pytorch/pytorch/issues/106390 for an example). This PR ensures the error message says something sane by raising a new Unsupported exception (that includes the older one in the stack trace) with a description of what's going on. Test Plan: - new test utility to check that a dictionary matches a regex so we don't need to write out this super long error message every time. - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/107459 Approved by: https://github.com/ydwu4, https://github.com/kshitij12345	2023-08-21 13:39:34 +00:00
Edward Z. Yang	796ce67229	Single source of truth for guard logging (#107532 ) Instead of (poorly) reconstructing the guard list from the guards on OutputGraph, we log them at the horses mouth: when we actually codegen the guard. This only requires very modest refactoring: as we translate guards into code parts, we also have to pass the source guard along so we can use it to give stack information. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107532 Approved by: https://github.com/anijain2305 ghstack dependencies: #107505, #107516, #107530	2023-08-21 13:02:12 +00:00
Edward Z. Yang	8316affc45	Add frame/recompile counter to all log messages in tracing context (#107530 ) All log messages that occur while running Dynamo compilation now have `[X/Y]` added to the beginning of their message. X represents the frame being compiled, while Y says which compilation of the frame. For example, if you are debugging a frame that is repeatedly recompiling, you can look for N/0, N/1, N/2, etc. for the same N. Here is what the logs look like as you transition from one frame to another: <img width="1372" alt="image" src="https://github.com/pytorch/pytorch/assets/13564/4897e368-1e50-4807-b342-54e911bcf087"> To accurately get this prefix added to all messages, I had to expand the scope of the `tracing` context manager. Its scope now coincides with `log_compilation_event`. To do this, I had to populate fake mode lazily in the TracingContext, since it isn't created until later, inside the OutputGraph. This subsumes the previous X.Y logging that was solely for dynamic shapes. Unfortunately I had to reindent some stuff. Review the diff with whitespace off. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107530 Approved by: https://github.com/anijain2305 ghstack dependencies: #107505, #107516	2023-08-21 13:02:12 +00:00
Han, Xu	5ed60477a7	Optimize load inline via pch (#106696 ) Add PreCompiled Header(PCH) to reduce load_inline build time. PCH is gcc built-in mechanism: https://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/Precompiled-Headers.html Add PCH for '#include <torch/extension.h>'. This file will used in all load_inline modules. All load_inline modules can take benifit from this PR. Changes: 1. Add PCH signature to guarantee PCH(gch) file take effect. 2. Unification get cxx compiler funtions. 3. Unification get build flags funtions. Before this PR: ![image](https://github.com/pytorch/pytorch/assets/8433590/f190cdcb-236c-4312-b165-d419a7efafe3) Added this PR: ![image](https://github.com/pytorch/pytorch/assets/8433590/b45c5ad3-e902-4fc8-b450-743cf73505a4) Compiling time is reduced from 14.06s to 7.36s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106696 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-08-21 10:08:30 +00:00
Brian	24968383b5	Fix RenamePlanner documentation (#107535 ) Fixes #107490 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107535 Approved by: https://github.com/awgu, https://github.com/fduwjj	2023-08-21 07:51:57 +00:00
Chien-Chin Huang	7ba513b6e4	[FSDP][state_dict] Expose optimizer state_dict config (#105949 ) Optimizer state_dict config are not exposed. This PR exposes the 2 dataclass. Differential Revision: [D47766024](https://our.internmc.facebook.com/intern/diff/D47766024/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105949 Approved by: https://github.com/rohan-varma	2023-08-21 07:29:49 +00:00
angelayi	63e9b5481d	[export] Add schema version to serializer/deserializer (#107420 ) Added a version number to the schema for BC issues. We will add this number to the serialized ExportedProgram and then when deserializing, if the number does not match up with the existing deserializer, we will error. We should update the number of there are any major changes to the schema. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107420 Approved by: https://github.com/zhxchen17	2023-08-21 06:56:46 +00:00
lezcano	6dea9927a8	Don't use thrust::log(complex) in CUDA as it takes a FOREVER to compile (#107559 ) As per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/107559 Approved by: https://github.com/peterbell10	2023-08-21 05:47:49 +00:00
Xilun Wu	5ce88e7e71	remove unnecessary import introduced in PR 106535 (#107440 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107440 Approved by: https://github.com/fduwjj ghstack dependencies: #106535	2023-08-21 05:29:31 +00:00
blzheng	b9befc53a6	benchmark: higher tolerance for RobertaForQuestionAnswering (#107376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107376 Approved by: https://github.com/kit1980, https://github.com/XiaobingSuper, https://github.com/jansel ghstack dependencies: #107375	2023-08-21 04:34:24 +00:00
blzheng	1ea83f04d2	benchmark: convert output of fp64 to torch.float64 (#107375 ) This PR adds converting the output of fp64 to torch.float64 before checking for accuracy. Why we need this change? For llama of torchbench, it converts output to float before returning it. `bad4e9ac19/torchbenchmark/models/llama/model.py (L241)` While in the correctness checker, it will not compare the res results with fp64_ref if the fp64_ref.dtype is not torch.float64. So llama fails the accuracy check in the low-precision case, even though res is closer to fp64_ref than ref. `e108f33299/torch/_dynamo/utils.py (L1025)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107375 Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper, https://github.com/jansel	2023-08-21 04:34:23 +00:00
Jason Ansel	d77e95c3bf	[Compiled Autograd] Improve nyi error messages (#106176 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106176 Approved by: https://github.com/eellison	2023-08-21 04:31:13 +00:00
Peter Bell	59c5424654	[inductor] Improve handling of index_expr with floating point dtypes (#105021 ) I found that the upsample bicubic lowering was generating this line ```python ops.index_expr(0.244094488188976*x0, torch.float32) ``` which is not good because triton's `ops.index_expr` expects integer expressions and dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105021 Approved by: https://github.com/lezcano	2023-08-21 03:09:53 +00:00
FFFrog	3b160ecc71	Fix wrong error messages with torch.nn.AdaptiveMaxPool1d (#107450 ) Fixes #104822 A duplicate check is introduced into function `adaptive_max_pool1d`, but this is probably a relatively good approach. Of course, it is also possible to transparently pass a flag in function `adaptive_max_pool1d` to function `adaptive_max_pool2d`(no need to add new parameter), and then supplement relevant Checks in `adaptive_max_pool2d`, but this approach is not clear enough first, and secondly, the amount of modification is relatively large. At the same time, there is currently a duplicate check for `output_size`,which is cheched in both functions(`adaptive_max_pool1d` && `adaptive_max_pool2d`) If you have better advice, please let me know, thank you Pull Request resolved: https://github.com/pytorch/pytorch/pull/107450 Approved by: https://github.com/ezyang	2023-08-21 02:04:37 +00:00
PyTorch MergeBot	96c5be8bc4	Revert "Fakify leaf of FunctionalTensor (#107062 )" This reverts commit 3349725766c229b3ead0fb692197d11bd8a85957. Reverted https://github.com/pytorch/pytorch/pull/107062 on behalf of https://github.com/ydwu4 due to This appears to have broken the test TestDTensorCompile.test_dtensor_fullgraph. Probably a land race ([comment](https://github.com/pytorch/pytorch/pull/107062#issuecomment-1685447747))	2023-08-21 00:30:16 +00:00
lezcano	c1cc74c7da	Enable a number inductor of tests on CPU (#107465 ) There were many test that their `_cuda` variants were not running on cuda. I fixed a few of these, but I'm sure there's plenty more. It'd be great to have a way to test that we're indeed compiling something in these tests, but I don't know how to do this off the top of my head. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107465 Approved by: https://github.com/ezyang	2023-08-20 21:44:21 +00:00
Liao, Xuan	71632d4d24	[cpu] add sdpa choice and UT (#105131 ) Feature RFC: https://github.com/pytorch/rfcs/pull/56. Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen. ## Performance of the stack ### NanoGPT's SDPA kernel Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket. Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64. Machine: SPR. \| Dtype \| Causal \| Mode \| SDPA \| Time (ms per iter) \| Speedup \| \| -------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| float32 \| FALSE \| Inference \| Unfused \| 3.081 \| \| \| \| \| \| Flash attention \| 1.665 \| 1.85045 \| \| float32 \| TRUE \| Inference \| Unfused \| 3.463 \| \| \| \| \| \| Flash attention \| 1.662 \| 2.083634\| \| bfloat16 \| FALSE \| Inference \| Unfused \| 1.203 \| \| \| \| \| \| Flash attention \| 1.154 \| 1.042461\| \| bfloat16 \| TRUE \| Inference \| Unfused \| 1.543 \| \| \| \| \| \| Flash attention \| 1.154 \| 1.337088\| \| float32 \| FALSE \| Training \| Unfused \| 54.938 \| \| \| \| \| \| Flash attention \| 23.029 \| 2.385601\| \| float32 \| TRUE \| Training \| Unfused \| 58.266 \| \| \| \| \| \| Flash attention \| 17.835 \| 3.266947\| \| bfloat16 \| FALSE \| Training \| Unfused \| 18.924 \| \| \| \| \| \| Flash attention \| 18.886 \| 1.002012\| \| bfloat16 \| TRUE \| Training \| Unfused \| 21.08 \| \| \| \| \| \| Flash attention \| 14.172 \| 1.48744 \| ### Stable Diffusion Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md). Mode: Inference; Machine: SPR. \| Dtype \| SDPA \| Throughput (fps) \| Speedup SDPA \| Total Time (ms) \| Speedup \| \| -------- \| -------- \| ------- \| ------- \| ------- \| ------- \| \| float32 \| Unfused \| 1.63 \| \| 1139 \| \| \| \| Flash attention \| 1.983 \| 1.216564 \| 547.488 \| 2.080411\| \| bfloat16 \| Flash attention in IPEX \| 4.784 \| \| 429.051 \| \| \| \| Flash attention \| 4.857 \| 1.015259 \| 408.823 \| 1.049479\| ### LLM models of Torchbench Dtype: float32; Mode: Inference, single socket; Machine: CPX. Model name \| SDPA \| Inductor_new \| Inductor_old \| Inductor Ratio(old/new) -- \| -- \| -- \| -- \| -- hf_Albert \| Unfused -> Flash attention \| 0.048629309 \| 0.05591545 \| 1.14983024 hf_Bert \| Unfused -> Flash attention \| 0.053156243 \| 0.060732115 \| 1.142520841 hf_Bert_large \| Unfused -> Flash attention \| 0.141089502 \| 0.155190077 \| 1.099940636 llama \| Unfused -> Flash attention \| 0.033250106 \| 0.033720745 \| 1.01415451 Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR. Model name \| SDPA \| Inductor_new \| Inductor_old \| Inductor Ratio(old/new) -- \| -- \| -- \| -- \| -- hf_Albert \| Unfused -> Flash attention \| 0.020681298 \| 0.020718282 \| 1.001788324 hf_Bert \| Unfused -> Flash attention \| 0.019932816 \| 0.019935424 \| 1.000130842 hf_Bert_large \| Unfused -> Flash attention \| 0.047949174 \| 0.048312502 \| 1.007577355 llama \| Unfused -> Flash attention \| 0.018528057 \| 0.01861126 \| 1.0044907 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105131 Approved by: https://github.com/drisspg ghstack dependencies: #104583, #104584, #103826, #104693, #104863, #107128	2023-08-20 08:56:21 +00:00
Liao, Xuan	a46217d2ef	[CPU] Enable fused_attention pattern matcher (#107128 ) Feature RFC: https://github.com/pytorch/rfcs/pull/56. Enable the SDPA graph rewriting for Inductor CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107128 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/eellison ghstack dependencies: #104583, #104584, #103826, #104693, #104863	2023-08-20 08:53:24 +00:00
Liao, Xuan	6d647762d0	[cpu] enable bfloat16 and refactor for flash attention (#104863 ) Feature RFC: https://github.com/pytorch/rfcs/pull/56. The support for BF16 is added in flash attention CPU kernel, for both forward and backward paths. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104863 Approved by: https://github.com/jgong5, https://github.com/drisspg ghstack dependencies: #104583, #104584, #103826, #104693	2023-08-20 08:50:56 +00:00
Liao, Xuan	3fc321f342	[cpu] implement flash attention backward (#104693 ) Feature RFC: https://github.com/pytorch/rfcs/pull/56. The flash attention CPU kernel is added, for backward path FP32. Parallelization is on the dimensions of batch size and head number. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104693 Approved by: https://github.com/jgong5, https://github.com/drisspg ghstack dependencies: #104583, #104584, #103826	2023-08-20 08:48:12 +00:00
Liao, Xuan	5516fe12ec	[cpu] implement scaled dot product flash attention (#103826 ) Feature RFC: https://github.com/pytorch/rfcs/pull/56. The flash attention CPU kernel is added, for forward path FP32. Blocking is applied on dimensions of query length and kv length and the fusion of gemm + softmax update + gemm is done at once for each block. Parallelization is on the dimensions of batch size, head number and query length. In addition, the causal attention mask is supported. As the attention is masked for the unseen tokens, early termination is applied and we only calculate the blocks in the lower triangular part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103826 Approved by: https://github.com/drisspg, https://github.com/jgong5 ghstack dependencies: #104583, #104584	2023-08-20 08:43:48 +00:00
Liao, Xuan	02dfacb1ec	expand functional map for reduced floating points on CPU (#104584 ) Return output in accumulated dtype in vec::reduce functions when input is float16 or bfloat16, to reduce rounding error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104584 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #104583	2023-08-20 08:40:56 +00:00
Edward Z. Yang	68b9bf9671	Simplify verbose error guard printing (#107516 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107516 Approved by: https://github.com/anijain2305 ghstack dependencies: #107505	2023-08-20 06:50:27 +00:00
Edward Z. Yang	d6d485fa8c	Revamp guard debug logging (#107505 ) The new guard printout looks like this: ``` [DEBUG] GUARDS: [DEBUG] ___check_type_id(L['name'], 7605632) # if name == "special_attr": # test/dynamo/test_misc.py:1155 in __getattribute__ [DEBUG] L['name'] == '_backward_pre_hooks' # if name == "special_attr": # test/dynamo/test_misc.py:1155 in __getattribute__ [DEBUG] ___check_obj_id(L['self'], 139746432564960) # return super().__getattribute__(name) # test/dynamo/test_misc.py:1157 in __getattribute__ [DEBUG] ___check_obj_id(L['__class__'], 1451499216) # return super().__getattribute__(name) # test/dynamo/test_misc.py:1157 in __getattribute__ [DEBUG] ___is_grad_enabled() # _dynamo/output_graph.py:346 in init_ambient_guards [DEBUG] not ___are_deterministic_algorithms_enabled() # _dynamo/output_graph.py:342 in init_ambient_guards [DEBUG] ___is_torch_function_enabled() # _dynamo/output_graph.py:350 in init_ambient_guards [DEBUG] utils_device.CURRENT_DEVICE == None # _dynamo/output_graph.py:348 in init_ambient_guards ``` Along with the guards, we also print what line of user code caused the guard to be added, or what line of Dynamo internal code added the guard (if there is no user stack trace, which is typically the case for ambient guards.) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107505 Approved by: https://github.com/mlazos, https://github.com/voznesenskym, https://github.com/anijain2305	2023-08-20 06:50:27 +00:00
Avik Chaudhuri	db3a199b2c	fix symint meta val (#107491 ) `aot_export` adds metadata for int inputs as symints. This diff turns such metadata into ints since they will be specialized anyway. We don't turn these into runtime assertions yet (but should, as future work). Differential Revision: D48487562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107491 Approved by: https://github.com/gmagogsfm	2023-08-20 06:05:04 +00:00
haozhe.zhu	4d0e7908c3	disable multi_linear_share_same_input for dynamic shape case (#107123 ) `reshape_linear_reshape_pattern` will fail for dynamic shapes and break the fusion. We will disable this optimization for dynamic shapes, since the shape might be changed during runtime so we cannot compare the size hint. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107123 Approved by: https://github.com/XiaobingSuper, https://github.com/jgong5, https://github.com/jansel	2023-08-20 05:58:13 +00:00
Aaron Gokaslan	e21ca06f46	[BE]: Update cudnn_frontend submodule to v0.9.2. (#107525 ) Updates the cudnn_frontend submodule to v0.9.2, which mainly consists of bugfixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107525 Approved by: https://github.com/ezyang	2023-08-20 05:26:42 +00:00
Avik Chaudhuri	2c3d2fa2d2	do not raise constraint violation on trivial guards (#107470 ) Differential Revision: D48475543 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107470 Approved by: https://github.com/tugsbayasgalan	2023-08-20 03:35:27 +00:00
Aaron Gokaslan	b1e8e01e50	[BE]: Apply PYI autofixes to various types (#107521 ) Applies some autofixes from the ruff PYI rules to improve the typing of PyTorch. I haven't enabled most of these ruff rules yet as they do not have autofixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107521 Approved by: https://github.com/ezyang	2023-08-20 02:42:21 +00:00
Edward Z. Yang	24f0b552e1	[EASY] Use runtime_var_to_range for guards (#107329 ) We sometimes allow compile-time reasoning to diverge from runtime reasoning. When we check guards, we are testing for runtime properties. Thus we should use those ranges, not the compile time ones. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107329 Approved by: https://github.com/tugsbayasgalan	2023-08-20 02:16:56 +00:00
Aaron Gokaslan	88ab3e4322	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-20 01:36:18 +00:00
Michael Voznesensky	02c2b750c5	Add support for GET_YIELD_FROM_ITER, YIELD_FROM, SEND (#106986 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106986 Approved by: https://github.com/jansel	2023-08-19 20:38:16 +00:00
Peter Bell	4f3284e3ed	[ATen] Update pre-compiled header (#106915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106915 Approved by: https://github.com/lezcano ghstack dependencies: #106914	2023-08-19 20:21:58 +00:00
Peter Bell	60936e4c29	Remove some unnecessary <iostream> includes from headers (#106914 ) In almost all cases this is only included for writing the output formatter, which only uses `std::ostream` so including `<ostream>` is sufficient. The istream header is ~1000 lines so the difference is non-trivial. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106914 Approved by: https://github.com/lezcano	2023-08-19 20:21:58 +00:00
Will Constable	eee2f57257	Raise TypeError for calling moduletype in dynamo (#107393 ) Fixes #107314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107393 Approved by: https://github.com/williamwen42	2023-08-19 20:04:33 +00:00
ydwu4	3349725766	Fakify leaf of FunctionalTensor (#107062 ) This PR allows dynamo to fakify FunctionalTensorWrapper by unwrapping, replacing and wrapping again for FunctionalTensorWrapper so that FunctionalTensorWrapper can be passed in as input for dynamo.optimize and we can support something like this ```python ff = torch.func.functionalize(f) torch.compile(ff)(x) ``` This PR didn't follow the \_\_tensor_flatten\_\_ and \_\_tensor_unflatten\_\_ protocol right now because we're not sure the plan of doing that for FunctionalTensorWrapper (it's implemented in C++). Test Plan: Add a new test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107062 Approved by: https://github.com/zou3519 ghstack dependencies: #107042	2023-08-19 17:33:42 +00:00
kshitij12345	11602ac564	[dynamo] fix disable_saved_tensors_hooks - graph break (#106875 ) ```python def wrapper_fn(x): with torch.autograd.graph.disable_saved_tensors_hooks("ERROR"): y = x + 1 print("HI") return y + 2 x = torch.randn(()) a = wrapper_fn(x) opt = torch.compile(wrapper_fn, backend='eager', fullgraph=False) e = opt(x) ``` Without the fix fails with, ``` Traceback (most recent call last): File "/home/kshiteej/Pytorch/pytorch_functorch/test/test_trace_grad.py", line 182, in <module> e = opt(x) File "/home/kshiteej/Pytorch/pytorch_functorch/torch/_dynamo/eval_frame.py", line 333, in _fn return fn(args, *kwargs) File "/home/kshiteej/Pytorch/pytorch_functorch/test/test_trace_grad.py", line 165, in wrapper_fn def wrapper_fn(x): AttributeError: module 'torch.autograd.graph' has no attribute 'disable_saved_tensors_hook' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106875 Approved by: https://github.com/zou3519	2023-08-19 11:41:40 +00:00
lezcano	4eac43d046	Trace through Tensor slots (#107159 ) Namely ``` __delattr__ __delitem__ __getattribute__ __getitem__ __setattr__ __setitem__ __str__ ``` We don't trace through `__init__`. Fixes https://github.com/pytorch/pytorch/issues/106648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107159 Approved by: https://github.com/Skylion007	2023-08-19 08:56:25 +00:00
kshitij12345	8df298bc1e	[functorch] vmap-dynamo: run vmap_impl under fake_mode (#107462 ) Fixes https://github.com/pytorch/pytorch/issues/107050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107462 Approved by: https://github.com/zou3519	2023-08-19 08:32:01 +00:00
FFFrog	871d7d242d	Silu support Complex for CUDA (#106854 ) Fixes #89382 Silu support Complex for CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/106854 Approved by: https://github.com/albanD	2023-08-19 06:57:09 +00:00
Charles David Hernandez	3ddf30505f	fixing internal test failure on non sm_80 machines (#107340 ) Summary: These tests were failing on non sm_80+ machines used for internal CI, added check to skip this. D48295360 added new tests that work in OSS but not in phabricator CI https://www.internalfb.com/intern/test/562950057441807?ref_report_id=0 https://www.internalfb.com/intern/test/281475080709193?ref_report_id=0 Test Plan: see phabricator result Differential Revision: D48417499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107340 Approved by: https://github.com/davidberard98	2023-08-19 04:27:15 +00:00
PyTorch UpdateBot	b5642f0b02	[vision hash update] update the pinned vision hash (#107498 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107498 Approved by: https://github.com/pytorchbot	2023-08-19 03:22:33 +00:00
Simon Fan	302278b4d5	[pytorch][fakepg] enhance fakepg: broadcast and scatter (#107480 ) Summary: Add support for broadcast and scatter in FakeProcessGroup. As a side note, we can't easily support broadcast_object_list or scatter_object_list since they rely on actual broadcasted/scattered values for pickle object deserialization. We could add support for rank 0, but other to support ranks may need additional changes outside of FakeProcessGroup. Test Plan: `buck2 run mode/dev-nosan -c fbcode.enable_gpu_sections=true //caffe2/test/distributed:fake_pg`, on of TARGETS diff: D48481513 `python test/distributed/test_fake_pg.py` after github sync Differential Revision: D48481512 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107480 Approved by: https://github.com/wanchaol	2023-08-19 02:36:45 +00:00
ekamiti	017499b078	Update reduction_ops groupings to include primtorch types (#107338 ) Fixes https://github.com/pytorch/pytorch/issues/107335. The skips were updated for the _ref ops to match those for eager mode where necessary. Part of breakdown of https://github.com/pytorch/pytorch/pull/104489. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107338 Approved by: https://github.com/ezyang	2023-08-19 02:09:11 +00:00
Aaron Gokaslan	93f2a64d4d	Update submodule NCCL to v2.18.3 (#104993 ) Update NCCL submodule to v2.18.3 which fixes numerous bugs and performance issues, particularly on newer GPUs: https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-18-3.html#rel_2-18-3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104993 Approved by: https://github.com/malfet	2023-08-18 23:43:01 +00:00
Edward Z. Yang	64e02de93c	Revert "Use CUDA DSA in ATen (#95300 )" (#107483 ) This reverts commit 93b0410eef57ebf038c12ed2fa1d4018a24096b7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107483 Approved by: https://github.com/ngimel	2023-08-18 23:33:07 +00:00
ekamiti	2d7a062db0	Update shape_funcs to test primtorch operators (#107336 ) Fixes #107335. Part of breakdown of https://github.com/pytorch/pytorch/pull/104489. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107336 Approved by: https://github.com/ezyang	2023-08-18 23:18:48 +00:00
Masaki Kozuki	5814380e7b	Revert "Revert "Reland "Add forward mode AD to out-place foreach functions (#102409 ) (#106043 )""" (#106320 ) Fixed a typo specifying the number of tensors and elements in the test having failed in slow gradcheck Pull Request resolved: https://github.com/pytorch/pytorch/pull/106320 Approved by: https://github.com/soulitzer	2023-08-18 23:01:42 +00:00
lcskrishna	bc662ffff9	[ROCm] Update ROCm skip decorators (#106138 ) This PR adds a msg argument for skipIfRocm and skipCUDAIfRocm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106138 Approved by: https://github.com/jataylo, https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/albanD	2023-08-18 22:02:06 +00:00
Jerry Zhang	28be2c674a	[quant][pt2e] Move specific quantizer related things outside of main quant code base (#106806 ) (#107259 ) Summary: Currently in quantizer/quantize_pt2e we import things from specific quantizers (XNNPACKQuantizer, QuantizationConfig) etc. this PR removes them so it's clearer that they are not part of the core quantization code base This PR also removed get_supported_operators from main Quantizer since we haven't seen a clear need for this API Test Plan: CIs Imported from OSS Differential Revision: D48340367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107259 Approved by: https://github.com/kimishpatel	2023-08-18 21:29:09 +00:00
Edward Z. Yang	4ee6224767	Remove jbschlosser from symbolic-shapes auto request list (#107482 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107482 Approved by: https://github.com/jbschlosser	2023-08-18 20:51:19 +00:00
FFFrog	35e222e152	Enable mypy check in torch/_inductor/fx_passes/post_grad.py (#107449 ) Fixes #105230 ```shell $ lintrunner init && lintrunner -a torch/_inductor/fx_passes/post_grad.py ... ok No lint issues. Successfully applied all patches. ``` ```shell $ mypy torch/_inductor/fx_passes/post_grad.py Success: no issues found in 1 source file Pull Request resolved: https://github.com/pytorch/pytorch/pull/107449 Approved by: https://github.com/ezyang	2023-08-18 20:48:19 +00:00
Nikita Karetnikov	77f080ee29	[pt2] test if core decomps are differentiable (#107241 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107241 Approved by: https://github.com/ezyang	2023-08-18 20:47:58 +00:00
ekamiti	5b7b9e7896	Update binary_ufuncs groupings to include primtorch types (#107419 ) Fixes #107335. The skips were updated for the _ref ops to match those for eager mode where necessary. Part of breakdown of https://github.com/pytorch/pytorch/pull/104489. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107419 Approved by: https://github.com/ezyang	2023-08-18 20:45:36 +00:00
박종의 (PARK, Jongeui)	af0ed25ea8	Change >= in the GRU and the LSTM document to \ge (#107379 ) Change >= in the GRU document to \ge Pull Request resolved: https://github.com/pytorch/pytorch/pull/107379 Approved by: https://github.com/ezyang	2023-08-18 20:44:51 +00:00
FFFrog	c2706e5b5d	Enable mypy check in torch/_inductor/kernel/unpack_mixed_mm.py (#107445 ) Fixes #105230 ```shell $ lintrunner init && lintrunner -a torch/_inductor/kernel/unpack_mixed_mm.py ... ok No lint issues. Successfully applied all patches. ``` ```shell $ mypy torch/_inductor/kernel/unpack_mixed_mm.py Success: no issues found in 1 source file ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107445 Approved by: https://github.com/ezyang	2023-08-18 20:44:21 +00:00
FFFrog	2d2d43d9fb	add more check on LSTMCell (#107380 ) Just like #107223, operator ``LSTMCell`` have the same problems as ``GRUCell``, and add some check and tests related to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107380 Approved by: https://github.com/ezyang	2023-08-18 20:44:17 +00:00
Jason Ansel	bdecdfd202	[Compiled Autograd] Fix duplicate visits of same node (#105887 ) The error fixed here happened when we had multiple autograd::Edge objects pointing to the same autograd::Node, causing before() to get called multiple times on the same object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105887 Approved by: https://github.com/albanD	2023-08-18 19:47:34 +00:00
Edward Z. Yang	67bb3c05b0	Add verbose_guards logging artifact (#107388 ) It looks like this: ``` [DEBUG] GUARD: ___check_type_id(L['z'][L["MyEnum"].BAR], 7640416) and L['z'][L["MyEnum"].BAR] == 10 [DEBUG] Stack: [DEBUG] File "/data/users/ezyang/b/pytorch/test/dynamo/test_misc.py", line 6657, in <module> [DEBUG] run_tests() [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/test_case.py", line 38, in run_tests [DEBUG] run_tests() [DEBUG] File "/data/users/ezyang/b/pytorch/torch/testing/_internal/common_utils.py", line 985, in run_tests [DEBUG] unittest.main(argv=argv) [DEBUG] File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/main.py", line 101, in __init__ [DEBUG] self.runTests() [DEBUG] File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/main.py", line 271, in runTests [DEBUG] self.result = testRunner.run(self.test) [DEBUG] File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/runner.py", line 184, in run [DEBUG] test(result) [DEBUG] File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/suite.py", line 84, in __call__ [DEBUG] return self.run(args, kwds) [DEBUG] File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/suite.py", line 122, in run [DEBUG] test(result) [DEBUG] File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/suite.py", line 84, in __call__ [DEBUG] return self.run(args, *kwds) [DEBUG] File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/suite.py", line 122, in run [DEBUG] test(result) [DEBUG] File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/case.py", line 650, in __call__ [DEBUG] return self.run(args, *kwds) [DEBUG] File "/data/users/ezyang/b/pytorch/torch/testing/_internal/common_utils.py", line 2521, in run [DEBUG] self._run_with_retry( [DEBUG] File "/data/users/ezyang/b/pytorch/torch/testing/_internal/common_utils.py", line 2450, in _run_with_retry [DEBUG] super_run(result=result) [DEBUG] File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/case.py", line 591, in run [DEBUG] self._callTestMethod(testMethod) [DEBUG] File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/case.py", line 549, in _callTestMethod [DEBUG] method() [DEBUG] File "/data/users/ezyang/b/pytorch/torch/testing/_internal/common_utils.py", line 2377, in wrapper [DEBUG] method(args, *kwargs) [DEBUG] File "/data/users/ezyang/b/pytorch/test/dynamo/test_misc.py", line 2529, in test_enum_as_dict_key_with_overloaded_str [DEBUG] res = opt_fn(x) [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/eval_frame.py", line 333, in _fn [DEBUG] return fn(args, *kwargs) [DEBUG] File "/data/users/ezyang/b/pytorch/test/dynamo/test_misc.py", line 2519, in fn [DEBUG] torch._dynamo.graph_break() [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/eval_frame.py", line 493, in catch_errors [DEBUG] return callback(frame, cache_size, hooks, frame_state) [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/convert_frame.py", line 637, in _convert_frame [DEBUG] result = inner_convert(frame, cache_size, hooks, frame_state) [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/convert_frame.py", line 133, in _fn [DEBUG] return fn(args, *kwargs) [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/convert_frame.py", line 371, in _convert_frame_assert [DEBUG] return _compile( [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/convert_frame.py", line 567, in _compile [DEBUG] guarded_code = compile_inner(code, one_graph, hooks, transform) [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/utils.py", line 181, in time_wrapper [DEBUG] r = func(args, kwargs) [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/convert_frame.py", line 466, in compile_inner [DEBUG] out_code = transform_code_object(code, transform) [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object [DEBUG] transformations(instructions, code_options) [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/convert_frame.py", line 416, in transform [DEBUG] tracer = InstructionTranslator( [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 2018, in __init__ [DEBUG] self.symbolic_locals = collections.OrderedDict( [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 2021, in <genexpr> [DEBUG] VariableBuilder( [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 211, in __call__ [DEBUG] vt = self._wrap(value).clone(self.options()) [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 404, in _wrap [DEBUG] result = { [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 405, in <dictcomp> [DEBUG] k: VariableBuilder( [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 211, in __call__ [DEBUG] vt = self._wrap(value).clone(*self.options()) [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 354, in _wrap [DEBUG] return type_dispatch(self, value) [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 837, in wrap_literal [DEBUG] return self.wrap_unspecialized_primitive(value) [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 1073, in wrap_unspecialized_primitive [DEBUG] guards=self.make_guards(GuardBuilder.CONSTANT_MATCH), [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 269, in make_guards [DEBUG] return {source.make_guard(guard) for guard in guards} [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 269, in <setcomp> [DEBUG] return {source.make_guard(guard) for guard in guards} [DEBUG] File "/data/users/ezyang/b/pytorch/torch/_guards.py", line 641, in make_guard [DEBUG] return Guard(self.name(), self.guard_sou ``` One downside is I can't report why* the guard was added. I'm not entirely sure how to do this; the problem is guards will propagate to a bunch of variables before finally getting included as part of the final set. Maybe a very very verbose version could report stack traces at every handoff point. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107388 Approved by: https://github.com/mlazos ghstack dependencies: #107438, #107358	2023-08-18 19:05:54 +00:00
Edward Z. Yang	36bb7a1f42	Add fast traceback utilities (#107358 ) This adds some utilities for conveniently working with fast combined CapturedTraceback from Python. The main goal of these utilities is to make it easier for people to use CapturedTraceback as a drop-in replacement for `traceback.extract_stack`, which is 20x slower than CapturedTraceback. I port symbolic shapes to use the new CapturedTraceback code, to validate that the APIs work and are useful. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107358 Approved by: https://github.com/zdevito, https://github.com/albanD ghstack dependencies: #107438	2023-08-18 19:05:54 +00:00
Edward Z. Yang	d5f7df3b8a	Hand bind CapturedTraceback (#107438 ) I do this instead of pybind11 because I need a custom tp_dealloc to promptly free PyFrames. I also add GC traverse/clear support. This is required to avoid leaking memory from co_extra on code objects in some obscure situations. This is indirectly tested by #107388 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107438 Approved by: https://github.com/albanD	2023-08-18 19:05:52 +00:00
Wanchao Liang	d8f2ef10a6	[dtensor][1/n] refactor op dispatch logic to reduce overhead (#107305 ) This PR is the first change of a series of refactors to the op dispatch logic to: 1. remove the redundant logic in the op dispatch, simplify the error checking 2. reduce the number of tree_map/tree_flatten/unflatten needed to reduce the overhead coming from those operations 3. remove the CachedShardingPropagator by using lru_cache from functools directly, this makes it not only helps TP, but general DTensor operations could be faster! 4. change the view ops behavior by inplace changing the op_schema, which is dangerous for sharding prop caching, model the view op as one type of resharding too 5. enrich output sharding to include whether the op needs redistribute so that we don't need explicit op schema comparison to know it. This should help with further reducing the CPU overhead, benchmark results: before (without this change), aten.addmm latency: 0.476ms ![Screenshot 2023-08-16 at 10 46 26 AM](https://github.com/pytorch/pytorch/assets/9443650/7692e6c1-1936-4c7f-bf9c-6c8c9b8f6c76) after (with this change), aten.addmm latency: 0.341ms ![Screenshot 2023-08-16 at 11 05 49 AM](https://github.com/pytorch/pytorch/assets/9443650/15a53f0b-7a95-444e-ab2f-3ee0ad2fa47f) overall one layer of mlp time reduced from 13.535 -> 9.665ms Apart from overhead reduction, this PR simplifies the op dispatching logic and the resharding logic (more refactor needed to make things more clean, which will be done in later PRs) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107305 Approved by: https://github.com/fduwjj	2023-08-18 18:30:46 +00:00
zhxchen17	8d6a487d69	[dynamo] Make KeyedJaggedTensor a variable. (#107319 ) This is extracted from https://github.com/pytorch/pytorch/pull/107156/ to model KeyedKaggedTensor as a first class concept in dynamo. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/107319 Approved by: https://github.com/ezyang	2023-08-18 17:15:46 +00:00
Jason Ansel	ea3381d92c	Make StarDep.index throw an error (#107092 ) There was an issue where `hasattr(dep, "index")` would incorrectly be True because it was picking up `NamedTuple.index` (a method). We were also comparing that method to a `sympy.Exper` in one place. As far as I can tell this wasn't actually causing any bugs (the comparison actually did the right thing), but still good to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107092 Approved by: https://github.com/eellison	2023-08-18 17:15:43 +00:00
JackCaoG	139437bb84	Make Openxla dynamo backend take boxed input (#107260 ) Fixes https://github.com/pytorch/xla/issues/5454 Also adding the inference(non-aot) backend back since we see a speed regression when using the aot-backend compared to the non-aot openxla backend. It is being tracked in https://github.com/pytorch/xla/issues/5430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107260 Approved by: https://github.com/shunting314, https://github.com/jansel	2023-08-18 16:58:05 +00:00
PyTorch MergeBot	3c11184ca8	Revert "Fakify leaf of FunctionalTensor (#107062 )" This reverts commit 6cb0128c8a07d626ab84516df3c9727943469d49. Reverted https://github.com/pytorch/pytorch/pull/107062 on behalf of https://github.com/ZainRizvi due to This appears to have broken the test TestDTensorCompile.test_dtensor_fullgraph. Probably a land race ([comment](https://github.com/pytorch/pytorch/pull/107062#issuecomment-1684124230))	2023-08-18 16:02:54 +00:00
chunyuan	c21e9de25d	Inductor cpp wrapper: fix optional tensor input (#106847 ) Fix cpp wrapper failure on `clip` in Torchbench: ``` RuntimeError: tensor does not have a device ``` An `optional<at::Tensor>` variable with value equal to `at::Tensor()` will be considered as _contains value_. When it's converted to `bool`, it returns `true`. While for `None` in python, when converting it to `bool`, `false` is returned. Fix it to be an optional variable that _does not contain a value_. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106847 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-08-18 13:20:19 +00:00
mingfeima	e10791c0bd	enable mkl_gemm_bf16bf16f32 in cpublas::gemm (#107196 ) This one is a wrapper upon `mkl_gemm_bf16bf16f32` which is used in flash attention kernel on intel 4th gen xeon. Fallback path has also been implemented on cpublas::gemm in case `mkl_gemm_bf16bf16f32` is not available. The primary target of this change is to help build kernels in `scaled_dot_product_attention`, e.g. flash attention and efficient attention. In the attention kernel, `q @ k.T = attn`, q and k will be given as bfloat16 and attn is float32. This is actually both beneficial for both performance and accuracy, since attn will be used to compute lazy softmax which has to be done in float32. This patch also adds routine from OpenBlas `sbgemm_` which also has a signature of bf16 * bf16 -> fp32; but since OpenBlas routine has different name from MKL's, we can not use `sbgemm_` in MKL. In the fallback path, it takes two steps to do the computation: first do gemm with beta = 0; then add beta * C in full precision. Idea from @peterbell10 not to truncate C to bfloat16, so as to avoid unnecessary accuracy loss. ref: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2023-0/cblas-gemm-bf16bf16f32.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/107196 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2023-08-18 12:48:10 +00:00
Nicolas Hug	42625da5e1	reseed all Generators in Dataloader's _worker_loop() -- via GC (#107131 ) Alternative to https://github.com/pytorch/pytorch/pull/107034, implements @ezyang 's suggestion from https://github.com/pytorch/pytorch/pull/107034#discussion_r1292857201. This PR addresses https://fb.workplace.com/groups/pytorch.oss.dev/posts/1699944830430051 and does a bunch of stacked changes: - Make `Generator` class support GC;this makes all `Generator` instances tracked and accessile through Python's GC. - Use the GC to retrieve all existing Generator instances in Dataloader's `_worker_loop` and re-seed them: this extends what is already applied to the global/default Generator, which is already re-seeded. ~TODO: a bit of docs and justification, which I'll do if this PR is mergeable.~ -- Done CC @albanD @ezyang as previously discussed BC-Breaking Note ------------------- We now re-seed all `Generator` instances within the `Dataloader` workers' loop to ensure that their RNG is different across workers. Previously, the RNG of user-defined `Generators` would be the same across workers, which could lead to wrong training procedures. This only affects user-defined `Generators`, not the default `Generator` (which was already re-seeded). Pull Request resolved: https://github.com/pytorch/pytorch/pull/107131 Approved by: https://github.com/ezyang	2023-08-18 10:23:23 +00:00
Avik Chaudhuri	95f1591acb	error on bad input to equality constraint (#107311 ) Differential Revision: D48401664 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107311 Approved by: https://github.com/angelayi	2023-08-18 09:01:51 +00:00
Jez Ng	9c9982a0aa	Turn on typechecking for _inductor/kernel/conv.py (#106258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106258 Approved by: https://github.com/Skylion007 ghstack dependencies: #106252	2023-08-18 08:49:18 +00:00
Peter Bell	18b1c2907d	[inductor] Add ir.WelfordReduction with multiple outputs (#104725 ) This replaces `var_unnormalized` reduction type with `welford_reduce` which takes the input data and outputs not just the variance, but also the mean and weights which account for the full welford accumulator state. Thus we can avoid re-computing the mean, and we now have enough information to create a multilayer reduction which I implement here by adding a second reduction type called `welford_combine` which reduces over all three inputs simultaneously. Multi-layer support is particularly important as normalization operators like BatchNorm are being split in many timm models, which meant `var_unnormalized` had to fall back to two-pass variance calculation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104725 Approved by: https://github.com/lezcano	2023-08-18 08:18:01 +00:00
Xilun Wu	3699c6adaa	[DTensor][random] add DTensor constructor: rand (#106535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106535 Approved by: https://github.com/fduwjj, https://github.com/wanchaol	2023-08-18 07:39:34 +00:00
David Berard	d465d6a838	[inductor] scatter_reduce - skip .item() in backward if GradMode is not enabled (#107353 ) Repeats #106429 for scatter_reduce so that the backward will pass for PT2. The .item() call is only needed to make double-backward work, which isn't supported anyway for PT2; so an easy fix is to just skip the .item() call if we know we won't need double-backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107353 Approved by: https://github.com/eellison	2023-08-18 07:17:29 +00:00
Jez Ng	a815e719e8	Turn on typechecking for _inductor/utils.py (#106252 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106252 Approved by: https://github.com/Skylion007	2023-08-18 04:11:34 +00:00
CaoE	1d6a446567	Add scalar conversion using avx instructions for half (#102140 ) ### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up. ### Testing Test maxpool, and compared with the results of #98819. Single socket (28 cores): shape \| fp16 forward / ms \| bf16 forward / ms \| fp16 backward / ms \| bf16 backward / ms \| speedup ratio (fp16 forward) \| speedup ratio (fp16 backward) -- \| -- \| -- \| -- \| -- \| -- \| -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig \| 5.07165 \| 5.418 \| 0.5798 \| 0.5123 \| 1.373694951 \| 3.430786 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL \| 1.37455 \| 1.2505 \| 8.8336 \| 9.7684 \| 1.373635008 \| 4.132924 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig \| 28.72 \| 30.7069 \| 3.813 \| 3.75 \| 1.31977124 \| 2.783006 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL \| 4.5783 \| 4.703 \| 4.703 \| 5.1 \| 1.028980189 \| 3.1293 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig \| 13.896 \| 14.8138 \| 1.6635 \| 1.6274 \| 1.298704663 \| 2.982699 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL \| 2.11291 \| 2.1158 \| 2.26778 \| 2.272 \| 0.951105348 \| 3.179012 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig \| 0.4204 \| 0.3843 \| 0.0649 \| 0.0633 \| 2.102711703 \| 1.779492 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d \| 0.1134 \| 0.11 \| 0.1476 \| 0.143 \| 2.23042328 \| 3.612398 Single core: shape \| fp16 forward / ms \| bf16 forward / ms \| fp16 backward / ms \| bf16 backward / ms \| speedup ratio (fp16 forward) \| speedup ratio (fp16 backward) -- \| -- \| -- \| -- \| -- \| -- \| -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig \| 124.413 \| 114.44 \| 10.553 \| 11.2486 \| 1.31395433 \| 3.923844 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL \| 28.99 \| 28.0781 \| 9.5092 \| 10.9258 \| 1.324296999 \| 3.888377 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig \| 640.8276 \| 591.964 \| 59.18776 \| 60.854 \| 1.334956391 \| 3.704458 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL \| 88.57 \| 90.214 \| 54.358 \| 59.205 \| 1.031258214 \| 3.75285 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig \| 318.6197 \| 285.155 \| 28.4999 \| 29.4387 \| 1.315298144 \| 3.759747 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL \| 31.3981 \| 34.0544 \| 25.6557 \| 28.7811 \| 1.068505738 \| 3.841587 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig \| 8.87882 \| 8.207 \| 0.386056 \| 0.3939 \| 1.567866 \| 3.50387 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d \| 2.4167 \| 2.38295 \| 0.3769 \| 0.4066 \| 1.39402491 \| 3.30061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102140 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/cpuhrsch	2023-08-18 04:07:59 +00:00
Animesh Jain	b31a357eaa	[dynamo][eval_frame] Set destroy_extra_state deleter as part of co_extra (#107117 ) Using the `freefunc` facility to free the ExtraState objects - https://peps.python.org/pep-0523/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/107117 Approved by: https://github.com/jansel ghstack dependencies: #106917	2023-08-18 03:52:08 +00:00
Animesh Jain	4608b9422c	[dynamo][eval_frame] Unify cache entry and frame_state on the same co_extra index (#106917 ) Handling follow up from https://github.com/pytorch/pytorch/pull/106413#discussion_r1288971923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106917 Approved by: https://github.com/ezyang	2023-08-18 03:52:08 +00:00
Sam Larsen	fcd1a0e93e	[inductor] Use divisor_override param in in aten.divisor_override lowering (#107401 ) Summary: Just mirrored the treatment of this param from here: https://codebrowser.bddppq.com/pytorch/pytorch/aten/src/ATen/native/cuda/AveragePool2d.cu.html#171 Test Plan: `python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_nn_functional_avg_pool2d` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107401 Approved by: https://github.com/eellison	2023-08-18 03:52:00 +00:00
Michael Lazos	2bb59a9ac6	Fix vscode test discovery (#107404 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/107404 Approved by: https://github.com/wconstab	2023-08-18 03:50:46 +00:00
ydwu4	6cb0128c8a	Fakify leaf of FunctionalTensor (#107062 ) This PR allows dynamo to fakify FunctionalTensorWrapper by unwrapping, replacing and wrapping again for FunctionalTensorWrapper so that FunctionalTensorWrapper can be passed in as input for dynamo.optimize and we can support something like this ```python ff = torch.func.functionalize(f) torch.compile(ff)(x) ``` This PR didn't follow the \_\_tensor_flatten\_\_ and \_\_tensor_unflatten\_\_ protocol right now because we're not sure the plan of doing that for FunctionalTensorWrapper (it's implemented in C++). Test Plan: Add a new test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107062 Approved by: https://github.com/zou3519 ghstack dependencies: #107042	2023-08-18 03:05:45 +00:00
Kurt Mohler	36141de427	Throw error if `stateless.functional_call` called with `nn.DataParallel` (#107403 ) Part of #77576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107403 Approved by: https://github.com/mikaylagawarecki	2023-08-18 03:02:04 +00:00
Cao Doan	600f9ef2ad	[nullability] Suppress -Wnullable-to-nonnull-conversion errors in caffe2 (#107418 ) Summary: Changelog: Suppresses the nullable to nonnull conversion errors from caffe2. Test Plan: ``` buck2 build //xplat/caffe2:caffe2Apple ``` Differential Revision: D48453395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107418 Approved by: https://github.com/seemethere	2023-08-18 02:04:37 +00:00
atalman	9ca2106e5f	Use CUDA 12.1.1 patch version in CI (#107295 ) Update cuda 12.1.1 After : Nightly Linux - https://github.com/pytorch/builder/pull/1476 Nightly Windows - https://github.com/pytorch/builder/pull/1485 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107295 Approved by: https://github.com/ZainRizvi	2023-08-18 01:28:16 +00:00
Alexander Pivovarov	35b2b3ee47	Fix rst formatting in torch.compiler_troubleshooting.rst (#107360 ) Fix some rst formatting - mostly around ``. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107360 Approved by: https://github.com/kit1980	2023-08-18 01:04:24 +00:00
Muralidhar Andoorveedu	608afe8083	Added xla friendly codepath to single_tensor_adamw (#102858 ) There are extra graph compilations on XLA when beta{1,2} ** step get too small. This PR addresses this issue by making the `capturable` interface enabled for XLA, as well as switching to `torch.float_power` which preserves the same behaviour as the non-capturable flow on XLA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102858 Approved by: https://github.com/janeyx99, https://github.com/albanD	2023-08-18 00:16:28 +00:00
Nikita Shulga	89de048563	[BE] Use allocator to allocate workspace (#107178 ) As suggested in https://github.com/pytorch/pytorch/pull/106844#discussion_r1293839247 it's better to just allocate DataPtr than the whole tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/107178 Approved by: https://github.com/albanD ghstack dependencies: #106977, #106844	2023-08-18 00:15:34 +00:00
Shan19900305	3c3874d623	Align formula in Impl::General mode with Impl::Contiguous mode in batch_norm_elementwise_cuda op. (#106943 ) Fixes #106941 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106943 Approved by: https://github.com/colesbury	2023-08-17 23:46:42 +00:00
PyTorch MergeBot	02bcaf45f6	Revert "Add backward check for test_memory_format (#106104 )" This reverts commit 2e44adb06608d09a36b899ffdb375cb7d46a78d2. Reverted https://github.com/pytorch/pytorch/pull/106104 on behalf of https://github.com/huydhn due to Sorry for reverting this but it is failing inductor job in trunk `2e44adb066`. I will add ciflow/inductor label to the PR make sure that the test runs there ([comment](https://github.com/pytorch/pytorch/pull/106104#issuecomment-1683119990))	2023-08-17 23:45:31 +00:00
PyTorch MergeBot	d3f92ca9e9	Revert "[C10D] Implement new libuv backend for TCPStore. (#105870 )" This reverts commit 3c841163cef9167ea50adbcfc4384b63c0b6e93a. Reverted https://github.com/pytorch/pytorch/pull/105870 on behalf of https://github.com/huydhn due to I think the distributed failure is related as this is now failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/105870#issuecomment-1683117192))	2023-08-17 23:41:00 +00:00
Kale Kundert	266772472e	Describe the 'float32_matmul_precision' settings in more detail (#107169 ) The documentation for `torch.set_float32_matmul_precision()` mentions a datatype called "bfloat16_3x". This doesn't appear to be a very standard term, and I had a hard time figuring out what exactly it meant. I now assume it refers to [[Henry2019]](http://arxiv.org/abs/1904.06376), which describes an algorithm by which a float32 multiplication is approximated via three bfloat16 multiplications. This PR updates the documentation to include this reference and to briefly describe how this algorithm works. Note that I just learned everything that I wrote here, so I'd appreciate if someone more expert in this topic could check to make sure that I didn't get anything significantly wrong. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107169 Approved by: https://github.com/colesbury	2023-08-17 22:41:22 +00:00
Ilya Sherstyuk	2b3917dc63	[ONNX] Fix memory leak when exporting models (#107244 ) This commit fixes a memory leak caused by creating a new PyListObject using PyDict_Items() and not releasing that list later. This often prevented the entire model from being de-allocated even when all python references to it have gone out of scope. Here is a repro script: ```python import psutil, torch, transformers, gc, os, sys import math # Size in MB model_size = 512 kB = 1024 MB = kB * kB precision_size = 4 # bytes per float activation_size = math.floor(math.sqrt(model_size * MB / precision_size)) class Net(torch.nn.Module): def __init__(self, activation_size): super(Net, self).__init__() self.linear = torch.nn.Linear(activation_size, activation_size) def forward(self, x): return {"result": self.linear(x)} def collect_and_report(s): gc.collect() print(s) #print("psutil: ", psutil.virtual_memory().percent) print("CPU MB used by this process: ", psutil.Process(os.getpid()).memory_info().rss / 1024 2) print("GPU MB allocated by pytorch: ", torch.cuda.memory_allocated(0) / 1024 2) print() def run_test(device_str): device = torch.device(device_str) dummy_input = torch.zeros(activation_size, requires_grad=True).to(device) collect_and_report("Before loading model: ") model = Net(activation_size).to(device) collect_and_report("After loading model: ") torch.onnx.export(model, dummy_input, "dummy.onnx") collect_and_report("After exporting model: ") del model collect_and_report("After deleting model:") print("Running CPU test: ") run_test("cpu") print("Running GPU test: ") run_test("cuda") ``` Results with this commit: ``` Running CPU test: Before loading model: CPU MB used by this process: 346.5 GPU MB allocated by pytorch: 0.0 After loading model: CPU MB used by this process: 861.078125 GPU MB allocated by pytorch: 0.0 After exporting model: CPU MB used by this process: 880.12890625 GPU MB allocated by pytorch: 0.0 After deleting model: CPU MB used by this process: 880.12890625 GPU MB allocated by pytorch: 0.0 Running GPU test: Before loading model: CPU MB used by this process: 991.9375 GPU MB allocated by pytorch: 0.04443359375 After loading model: CPU MB used by this process: 992.19140625 GPU MB allocated by pytorch: 512.0888671875 After exporting model: CPU MB used by this process: 1026.64453125 GPU MB allocated by pytorch: 520.25830078125 After deleting model: CPU MB used by this process: 1026.64453125 GPU MB allocated by pytorch: 520.25830078125 ``` With this commit: ``` Running CPU test: Before loading model: CPU MB used by this process: 372.7734375 GPU MB allocated by pytorch: 0.0 After loading model: CPU MB used by this process: 887.18359375 GPU MB allocated by pytorch: 0.0 After exporting model: CPU MB used by this process: 918.96875 GPU MB allocated by pytorch: 0.0 After deleting model: CPU MB used by this process: 407.3671875 GPU MB allocated by pytorch: 0.0 Running GPU test: Before loading model: CPU MB used by this process: 516.6875 GPU MB allocated by pytorch: 0.04443359375 After loading model: CPU MB used by this process: 516.75390625 GPU MB allocated by pytorch: 512.0888671875 After exporting model: CPU MB used by this process: 554.25390625 GPU MB allocated by pytorch: 520.2138671875 After deleting model: CPU MB used by this process: 554.25390625 GPU MB allocated by pytorch: 8.16943359375 ``` Fixes #106976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107244 Approved by: https://github.com/BowenBao, https://github.com/kit1980	2023-08-17 22:15:28 +00:00
Kefei Lu	d8dadb0f25	aot_inductor: fix compile returning None if cache hits (#107020 ) Summary: Seems like a bug in D47998435, where when cache hits it returns None Repro: ``` class TestModule(torch.nn.Module): def __init__(self): super().__init__() def forward(self, x): return x + 1 mod = TestModule() inp = torch.rand(1) out = mod(inp) mod2 = torch.fx.symbolic_trace(mod, concrete_args=[inp]) so, _ = torch._export.aot_compile(mod2, tuple([inp])) # 2nd time, it will return None so, _ = torch._export.aot_compile(mod2, tuple([inp])) assert so is not None # FAIL ``` Test Plan: Run the repro Differential Revision: D48258375 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107020 Approved by: https://github.com/angelayi	2023-08-17 22:12:24 +00:00
fduwjj	37eb969939	Update test name in multiGPU test (#107397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107397 Approved by: https://github.com/wanchaol ghstack dependencies: #107313, #106583	2023-08-17 21:40:50 +00:00
Zain Rizvi	b9c86c521d	Make mergebot work with review comments (#107390 ) Fixes https://github.com/pytorch/pytorch/issues/100406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107390 Approved by: https://github.com/clee2000 ghstack dependencies: #107385	2023-08-17 21:31:41 +00:00
Zain Rizvi	4874b02379	[BE] Remove deprecated github gql param and disable inconsistent test (#107385 ) Two fixes: - Stop querying `pushDate`, which [has been deprecated ](https://docs.github.com/en/graphql/reference/objects) and now always returns null - Disables the test `test_merge_ghstack_into` which was recently added in https://github.com/pytorch/pytorch/pull/105251. This test used the results of another person's ghstack PR for testing, but as the dev submitted chunks of their PR this test's assumptions have been broken. cc @izaitsevfb for a long term fix here Pull Request resolved: https://github.com/pytorch/pytorch/pull/107385 Approved by: https://github.com/clee2000	2023-08-17 21:31:41 +00:00
Driss Guessous	8ccfd801be	Introduce CUDA-only `_scaled_mm` op (#107341 ) Summary: Based on D48377631 with updates to guard the utilization of cublas features only found after 11.8 According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed. Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix. See table below for supported input and output types: \| Mat1 type \| Mat2 type \| Bias type \| Output types \| \| ----------- \| ----------- \| ----------- \| ----------- \| \| Float8_e4m3 \| Float8_e4m3 \| Float16 \| Float8_e4m3, Float16 \| \| Float8_e4m3 \| Float8_e4m3 \| BFloat16 \| Float8_e4m3, BFloat16, Float \| \| Float8_e5m2 \| Float8_e4m3 \| Float16 \| Float8_e4m3, Float8_e5m2, Float16 \| \| Float8_e5m2 \| Float8_e4m3 \| BFloat16 \| Float8_e4m3, Float8_e5m2, BFloat16, Float \| \| Float8_e4m3 \| Float8_e5m2 \| Float16 \| Float8_e4m3, Float8_e5m2, Float16 \| \| Float8_e4m3 \| Float8_e5m2 \| BFloat16 \| Float8_e4m3, Float8_e5m2, BFloat16, Float \| \| Float8_e4m3 \| Float8_e5m2 \| Not supported \| Not supported \| Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following: ```python register_decomposition(aten._scaled_mm) def _scaled_mm( mat1: Tensor, mat2: Tensor, , dtype: Optional[torch.dtype] = None, scale_a: Optional[Tensor] = None, scale_b: Optional[Tensor] = None, scale_result: Optional[Tensor] = None, ) -> Tuple[Tensor, Tensor]: rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32)) rc = scale_a rc if scale_a is not None else rc rc = scale_b * rc if scale_b is not None else rc rc = scale_result * rc if scale_result is not None else rc rc = rc.to(dtype if dtype is not None else mat1.dtype) return rc, torch.tensor(0.0, device=mat1.device) ``` Known limitations: - Only works for matrix sizes divisible by 16 - 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work) Test Plan: Tests in test_matmul_cda.py Differential Revision: D48415871 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107341 Approved by: https://github.com/vkuzo	2023-08-17 21:24:43 +00:00
CaoE	2e44adb066	Add backward check for test_memory_format (#106104 ) Add backward check for test_memory_format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106104 Approved by: https://github.com/mikaylagawarecki	2023-08-17 21:19:34 +00:00
Richard Zou	c69514ccb2	Update `generate_opcheck_tests`, also use it to test some internal tests (#107328 ) Summary: We change `generate_opcheck_tests` to be a bit more user-friendly. Note that there are some internal-only changes, go review them there. Test Plan: - tests Differential Revision: D47965247 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107328 Approved by: https://github.com/ezyang	2023-08-17 21:18:14 +00:00
Rodrigo Kumpera	bbf03561a9	[functional collectives] Move back to registering finalizers on wrappers. (#107250 ) We cannot use inner tensors for finalizers as they are uncollective until waited. This PR adds a bunch of tests for the observable behavior we want, including the necessary scafold for us to test code for their waitiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107250 Approved by: https://github.com/wconstab	2023-08-17 21:08:28 +00:00
Rodrigo Kumpera	3c841163ce	[C10D] Implement new libuv backend for TCPStore. (#105870 ) The new backend is currently under a flag 'use_libuv' in TCPStore constructor to reduce the impact on existing users as we test it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105870 Approved by: https://github.com/H-Huang	2023-08-17 20:40:32 +00:00
Valentin Andrei	d86445a506	[cuda] vectorized gamma and beta loading in vectorized_layer_norm (#107287 ) Improves the performance of `vectorized_layer_norm` by vectorizing access to `gamma` and `beta` buffers. This uses 128 bit load instructions which improves memory bandwidth. The speedup is ~3% on average and there are no obvious regressions on any problem sizes. Used the following code to test: ```python import torch from torch.utils.benchmark import Compare, Timer # @manual l_inputs = [ (32, 32), (64, 32), (256, 128), (512, 1024), (1024, 2048), (2048, 2048), (4096, 16384), (70000, 64), (131072, 512), (1000, 520), (4005, 4005), (10000, 1000), (1024, 10000), (8192, 4096), (10000, 10000), (3072, 10000), (6144, 10000), (1024, 20000), (1024, 20000), (512, 1536), (512, 6144), (512, 10240), (1000, 1000), (2000, 2000), (10240, 10240), (384, 128), (2048, 1024), (267, 513), (67, 123479), (1024, 123479), (2048, 66679), (200, 256), (1000, 256), (6000, 256), (6272, 256), (200, 512), (1000, 512), (6000, 512), (6272, 512), (200, 1024), (1000, 1024), (6000, 1024), (6272, 1024), (200, 2048), (1000, 2048), (6000, 2048), (6272, 2048), (200, 3072), (1000, 3072), (6000, 3072), (6272, 3072), ] def run_model_on_device(fs, X, gO, device_string, numeric_type): ln = torch.nn.LayerNorm((fs,), device=device_string, dtype=numeric_type) ln.reset_parameters() X.grad = None ln.zero_grad(set_to_none=True) out = ln(X) out.backward(gO) return (ln.weight.grad, ln.bias.grad) def run_correctness_test(eps_weight, eps_bias): dtype = torch.float for val in l_inputs: bs = val[0] fs = val[1] mean_adjustment = torch.randn(fs, device="cpu", dtype=torch.float) X = mean_adjustment * torch.randn( bs, fs, device="cpu", dtype=torch.float, requires_grad=True ) X = X.detach().requires_grad_() gO = torch.rand_like(X) X_gpu = X.to("cuda") X_gpu = X_gpu.detach().requires_grad_() gO_gpu = gO.to("cuda") gO_gpu = gO_gpu.detach().requires_grad_() grad_cpu_ref = run_model_on_device(fs, X, gO, "cpu", dtype) grad_gpu = run_model_on_device(fs, X_gpu, gO_gpu, "cuda", dtype) weight_grad_gpu_target = grad_gpu[0].detach().to("cpu") bias_grad_gpu_target = grad_gpu[1].detach().to("cpu") weight_delta = torch.abs(grad_cpu_ref[0] - weight_grad_gpu_target) weight_mismatches = (weight_delta >= eps_weight).nonzero() weight_mismatch_pct = len(weight_mismatches) / len(weight_delta) * 100 bias_delta = torch.abs(grad_cpu_ref[1] - bias_grad_gpu_target) bias_mismatches = (bias_delta >= eps_bias).nonzero() bias_mismatch_pct = len(bias_mismatches) / len(bias_delta) * 100 print( "Size ({} x {}) mismatch percentage: weight {:3.2f} bias {:3.2f}".format( fs, bs, weight_mismatch_pct, bias_mismatch_pct ) ) # Run the correctness tests run_correctness_test(0.01, 0.01) # Run the performance tests. We need to run this at global scope because otherwise # the `ln` and `gO` objects are likely removed by the JIT compiler results = [] for dtype in (torch.float, torch.half): for val in l_inputs: bs = val[0] fs = val[1] ln = torch.nn.LayerNorm((fs,), device="cuda", dtype=dtype) X = torch.randn(bs, fs, device="cuda", dtype=dtype, requires_grad=True) gO = torch.rand_like(X) stmtfwd = "ln(X)" stmtfwdbwd = ( "X.grad=None; ln.zero_grad(set_to_none=True); out = ln(X); out.backward(gO)" ) tfwd = Timer( stmt=stmtfwd, label="ln", sub_label=f"{bs:5}, {fs:5}", description=f"fwd, {dtype}", globals=globals(), ) tfwdbwd = Timer( stmt=stmtfwdbwd, label="ln", sub_label=f"{bs:5}, {fs:5}", description=f"fwdbwd, {dtype}", globals=globals(), ) for t in (tfwd, tfwdbwd): results.append(t.blocked_autorange()) print(fs, end="\r") c = Compare(results) c.print() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107287 Approved by: https://github.com/malfet	2023-08-17 19:57:45 +00:00
Angela Yi	8a0425fdd6	[export] Remove setter for graph_module (#106651 ) Summary: The ExportedProgram should be immutable Test Plan: CI Differential Revision: D48086375 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106651 Approved by: https://github.com/zhxchen17	2023-08-17 18:38:21 +00:00
Alan Ji	2d727c8c3f	remove the duplicate method `is_private_use1` in class Device (#107198 ) In the `Device` class, there are two methods with similar functions called `is_private_use1` and `is_privateuseone`. `ddf36c82b8/c10/core/Device.h (L84-L87)` `ddf36c82b8/c10/core/Device.h (L159-L162)` The former is not being utilized and therefore, this PR removes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107198 Approved by: https://github.com/bdhirsh	2023-08-17 18:23:29 +00:00
Simon Fan	aca3d1433c	Estimate Scheduler node runtimes (#106426 ) Working as starter task with @Chillee This PR adds a method under BaseSchedulerNode to estimate the node's runtime in seconds. We use a heuristic based approach, first by considering whether the operation is memory bandwidth bounded or compute bounded: - memory bandwidth bounded: we compute the number of bytes that are read/written to - compute bounded: we compute the FLOPS required by the operation One use case could be to be used as a cost model for scheduling: https://github.com/pytorch/pytorch/pull/100762 ``` (pytorch-3.10) [14:08:02] ~/local/pytorch (xmfan/estimate_snode_runtime) > python3 test/inductor/test_perf.py -k EstimateSnodeRuntimeTests [(ExternKernelSchedulerNode(name='buf0'), 400)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-27)] .[(ExternKernelSchedulerNode(name='buf0'), 3000), (SchedulerNode(name='buf1'), 3000)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-26), (SchedulerNode(name='buf1'), 7.187055238190188e-09)] .[(ExternKernelSchedulerNode(name='buf0'), 3000)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-26)] .[(ExternKernelSchedulerNode(name='buf0'), 34600)] [(ExternKernelSchedulerNode(name='buf0'), 3.22687496698039e-24)] .[(ExternKernelSchedulerNode(name='buf0'), 396)] [(ExternKernelSchedulerNode(name='buf0'), 1.88046326747109e-27)] .[(ExternKernelSchedulerNode(name='buf0'), 396)] [(ExternKernelSchedulerNode(name='buf0'), 1.88046326747109e-27)] .[(ExternKernelSchedulerNode(name='buf0'), 7776176)] [(ExternKernelSchedulerNode(name='buf0'), 4.63240241413653e-21)] .[(FusedSchedulerNode(nodes=buf0_buf1), 210)] [(FusedSchedulerNode(nodes=buf0_buf1), 5.030938666733132e-10)] .[(ExternKernelSchedulerNode(name='buf0'), 300)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-27)] .[(SchedulerNode(name='buf0'), 20)] [(SchedulerNode(name='buf0'), 4.7913701587934585e-11)] . ---------------------------------------------------------------------- Ran 10 tests in 14.311s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106426 Approved by: https://github.com/Chillee	2023-08-17 17:23:30 +00:00
soulitzer	aa04b0536b	Fix inference_mode decorator pass mode as kwarg (#107349 ) Fixes https://fb.workplace.com/groups/1405155842844877/permalink/7330520550308347/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/107349 Approved by: https://github.com/albanD ghstack dependencies: #107296	2023-08-17 17:12:31 +00:00
arunppsg	4bfc55ba8b	[MPS] Enable forward test for renorm (#106666 ) Enabled forward test for renorm Pull Request resolved: https://github.com/pytorch/pytorch/pull/106666 Approved by: https://github.com/kulinseth, https://github.com/albanD	2023-08-17 16:46:06 +00:00
eellison	8298720299	Enable Lowering Channels last Conv1x1 when max autotune is set (#107004 ) This can lead to a large speedup when max autotune is set, e.g. resnet 2.1x -> 2.5x, particularly in combination with freezing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107004 Approved by: https://github.com/jansel, https://github.com/shunting314, https://github.com/int3 ghstack dependencies: #106911, #106912	2023-08-17 16:05:32 +00:00
Catherine Lee	f96617fdcd	Add deployment environment for docs and upload test stats (#107318 ) Many thanks to this discussion comment for explaining why we don't need an environment for the calling workflow but the secret can still be used https://github.com/orgs/community/discussions/25238#discussioncomment-3247035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107318 Approved by: https://github.com/huydhn, https://github.com/atalman	2023-08-17 15:47:18 +00:00
Mikayla Gawarecki	aa9f6a4335	Fix native_batch_norm_backward returning non-channels_last_3d grad (#107270 ) Fix #107199 Checked out https://github.com/pytorch/pytorch/pull/106104 which caught this locally and verified that `551124f670/torch/testing/_internal/common_modules.py (L2635-L2642)` with the `p['device'] == 'cuda'` part shifted to `device_type = 'cuda'` now succeeds Pull Request resolved: https://github.com/pytorch/pytorch/pull/107270 Approved by: https://github.com/albanD	2023-08-17 14:58:56 +00:00
FFFrog	e108f33299	Update distutils.Version to packaging.version due to the deprecation … (#107207 ) Update distutils.Version to packaging.version due to the deprecation warning. ```python /root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17136: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"), /root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17138: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"), /root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17140: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"), ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107207 Approved by: https://github.com/soulitzer	2023-08-17 11:19:44 +00:00
Alexander Pivovarov	a98f745c80	Use compiled model in torch.compiler_get_started (#107267 ) - Text says `Next, let’s try a real model like resnet50 from the PyTorch` but the code example uses `resnet18`. Fixed code to use `resnet50` for consistency. - One of the examples in TorchDynamo Overview uses uncompiled model - fixed it - now it uses compiled model. - Removed unused import to `_dynamo` in one of the examples Pull Request resolved: https://github.com/pytorch/pytorch/pull/107267 Approved by: https://github.com/soulitzer	2023-08-17 09:26:54 +00:00
FFFrog	f21b9cb954	Enable mypy check in torch/_inductor/kernel/mm_common.py (#106776 ) Fixes #105230 ```shell $ lintrunner init && lintrunner -a torch/_inductor/kernel/mm_common.py ... ok No lint issues. Successfully applied all patches. ``` ```shell $ mypy torch/_inductor/kernel/mm_common.py Success: no issues found in 1 source file ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106776 Approved by: https://github.com/eellison	2023-08-17 09:19:45 +00:00
Alexander Pivovarov	11e366943d	Fix rst formatting in dynamo/guards-overview doc (#107275 ) Fix rst formatting in dynamo/guards-overview doc Pull Request resolved: https://github.com/pytorch/pytorch/pull/107275 Approved by: https://github.com/soulitzer	2023-08-17 09:04:44 +00:00
PyTorch UpdateBot	384e0d104f	[vision hash update] update the pinned vision hash (#107342 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107342 Approved by: https://github.com/pytorchbot	2023-08-17 05:58:27 +00:00
XiaobingSuper	29813c61ea	enable conv+bn folding for mixed-dtype when bn has post activation (#107142 ) For conv+bn+relu6, the joint-graph pass will remove one type of conversion and the graph will be like this: ``` def forward(self, arg0_1: bf16[32, 3, 3, 3], arg1_1: bf16[32], arg2_1: bf16[32], ...) convolution: bf16[3, 32, 15, 15] = aten..convolution.default(arg6_1, arg0_1, None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1); arg6_1 = arg0_1 = None # weight upcasting convert_element_type: f32[32] = torch.ops.prims.convert_element_type.default(arg3_1, torch.float32); arg3_1 = None convert_element_type_1: f32[32] = torch.ops.prims.convert_element_type.default(arg4_1, torch.float32); arg4_1 = None ... # end of batch norm add_1: f32[3, 32, 15, 15] = aten..add.Tensor(mul_2, unsqueeze_7); mul_2 = unsqueeze_7 = None # output downcast convert_element_type_2: bf16[3, 32, 15, 15] = torch.ops.prims.convert_element_type.default(add_1, torch.float32); add_1 = None clamp_min: f32[3, 32, 15, 15] = torch.ops.aten.clamp_min.default(convert_element_type_2, 0.0); convert_element_type_2 = None clamp_max: f32[3, 32, 15, 15] = torch.ops.aten.clamp_max.default(clamp_min, 6.0); clamp_min = None convert_element_type_3: bf16[3, 32, 15, 15] = torch.ops.prims.convert_element_type.default(clamp_max, torch.bfloat16); clamp_max = None ``` the conv+bn folding will be failed, this PR will move the joint-graph pass's dtype conversion removing to after of conv_bn folding pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107142 Approved by: https://github.com/eellison	2023-08-17 04:17:35 +00:00
Edward Z. Yang	f53ecfbcc6	Correctly format original traceback for delayed CUDA error (#107297 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107297 Approved by: https://github.com/albanD	2023-08-17 03:13:31 +00:00
David Berard	e9af315e02	Fix torch.bucketize docs for "right" (#104474 ) The docs correctly (i.e matching actual op behavior) state that `right = False` means `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]`. However they previously stated that `If 'right' is False (default), then the left boundary is closed.` which contradicts the `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]` statement. This modifies the docs to say `... then the left boundary is OPEN.` and also clarifies that this is the opposite behavior of numpy.digitize. Fixes #91580 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104474 Approved by: https://github.com/aakhundov, https://github.com/svekars	2023-08-17 03:08:07 +00:00
David Berard	25d87c8301	torch.ops.aten.: sort aten ops before jit overloads (#107138 ) Summary: In fbcode, aten and jit ops can get registered in different orders depending on build mode. In dev mode, aten is registered first; in opt mode, jit is registered first. This causes problems in torch.ops.aten. calls; these calls use `torch._C._jit_get_operation`, which selects an overload based on the inputs to the call. It searches through the overloads for the op with the given name, and chooses the first one that matches the input types. "First" depends on whether aten or jit ops were registered first - e.g. in `test_both_scalars_cuda` in opt mode, it chooses `add.complex` and returns a complex value. We also saw this issue in https://github.com/pytorch/pytorch/pull/103576. This PR sorts the list of overloads first, putting the aten ops first. Differential Revision: D48304930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107138 Approved by: https://github.com/ezyang, https://github.com/eellison	2023-08-17 03:05:59 +00:00
fduwjj	983fd5ba79	[2D][TP] Enable DDP TP integration with unit test (#106583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106583 Approved by: https://github.com/kumpera, https://github.com/fegin, https://github.com/wanchaol ghstack dependencies: #107313	2023-08-17 02:54:17 +00:00
Huy Do	4979a1b8f9	Fix trymerge broken trunk detection when the merge base job was retried (successfully) (#107333 ) This fixes a discrepancy bug between Dr.CI and trymerge when detecting broken trunk failures. Take https://github.com/pytorch/pytorch/pull/107160 as an example: * Dr.CI correctly identifies the broken trunk failure * while trymerge records it as a new failure The issue is that the merge base [failure](https://github.com/pytorch/pytorch/actions/runs/5833057579/job/15820504498) was flaky. It was retried successfully and its conclusion went from a failure to a success. The Rockset query returns all run attempts and while Dr.CI correctly records the failure, trymerge overwrites it with the successful retry. Thus, the latter saw a new failure. This change makes trymerge keep the merge base failure similar to what Dr.CI does https://github.com/pytorch/test-infra/blob/main/torchci/pages/api/drci/drci.ts#L158-L168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107333 Approved by: https://github.com/clee2000	2023-08-17 02:09:31 +00:00
Edward Z. Yang	5b9b816b17	WAR by avoid querying device before env mutation (#107301 ) We should probably fix https://github.com/pytorch/pytorch/issues/107300 properly but this works around the problem Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107301 Approved by: https://github.com/bdhirsh, https://github.com/H-Huang, https://github.com/albanD	2023-08-17 00:31:16 +00:00
Masaki Kozuki	b234b94760	Add in-place `_foreach_copy` (#107226 ) Fixes #107162 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107226 Approved by: https://github.com/janeyx99	2023-08-17 00:11:18 +00:00
peterjc123	8507b22fea	propagate _GLIBCXX_USE_CXX11_ABI to NVCC (#107209 ) Fixes #107161 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107209 Approved by: https://github.com/malfet	2023-08-16 22:41:52 +00:00
FFFrog	a4229690e3	Add Some Checks about dim (#107223 ) Fixes #106769 As mentioned in [GRUCell](https://pytorch.org/docs/stable/generated/torch.nn.GRUCell.html#grucell), `hidden` should have the same dimension as `input`, and the dimension should be either `1D` or `2D`. As for other aspects, it has been verified in `C++`, such as the batch of `Input` and `hidden` are the same, `Input`'s Dim1 and `input_size` are the same, `hidden`'s Dim1 and `hidden_size` are the same, etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107223 Approved by: https://github.com/albanD	2023-08-16 22:03:31 +00:00
fduwjj	f3b0d83fe3	[EZ][TP] Refactor FSDP 2D integration extension code so that it can re-used (#107313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107313 Approved by: https://github.com/wz337	2023-08-16 22:01:17 +00:00
Mikayla Gawarecki	b08b0c915f	[easy] Fix docs for sd calculation in BatchNorm1d/3d for consistency with BatchNorm2d (#107308 ) Fixes https://github.com/pytorch/pytorch/issues/100048 BatchNorm2d docs were updated in https://github.com/pytorch/pytorch/pull/97974. There have been a number of issues filed due to confusion about this so I think we should fix before branch cut Pull Request resolved: https://github.com/pytorch/pytorch/pull/107308 Approved by: https://github.com/albanD	2023-08-16 21:51:02 +00:00
Jerry Zhang	d3c4ec767b	[quant][pt2e] Fix handling for SharedQuantizationSpec (#106922 ) Summary: Previously if we have: ``` conv1 -> cat conv2 / ``` and configure output of conv1/conv2 to be int8 quantized, and cat also int8 quantized and with shared inputs, it will not produce expected results (input of cat will not be shared) The problem is that there is some missing checks when inserting observers for input for cat This PR fixes the problem. Fixes: https://github.com/pytorch/pytorch/issues/106760 Test Plan: python tes/test_quantization.py TestQuantzePT2E.test_shared_qspec Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/106922 Approved by: https://github.com/kimishpatel	2023-08-16 21:16:45 +00:00
Xiao Wang	6bfb4f7c4b	[CUDA][Linalg} Patch crash of `linalg.eigh` when input matrix is ill-conditioned, in some cusolver version (#107082 ) Related: https://github.com/pytorch/pytorch/issues/94772, https://github.com/pytorch/pytorch/issues/105359 I can locally reproduce this crash with pytorch 2.0.1 stable pip binary. The test already passes with the latest cuda 12.2 release. Re: https://github.com/pytorch/pytorch/issues/94772#issuecomment-1658909998 > From discussion in triage review: - [x] we should add a test to prevent regressions - [x] properly document support wrt different CUDA versions - [x] possibly add support using MAGMA Pull Request resolved: https://github.com/pytorch/pytorch/pull/107082 Approved by: https://github.com/lezcano	2023-08-16 21:15:15 +00:00
Rohan Varma	0434a2c7c8	[BE][PG NCCL] Improve input mismatch error msg (#107281 ) Test Plan: CI Differential Revision: D48363238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107281 Approved by: https://github.com/awgu, https://github.com/H-Huang, https://github.com/fegin	2023-08-16 20:22:22 +00:00
v-s-2	ba6fcc4eae	[caffe2][SDT] Check whether `TORCH_DISABLE_SDT` macro is defined before referencing it (#107149 ) Summary: Some jobs in the next diff in stack (D48229150) fail with the following message: ``` stderr: In file included from xplat/caffe2/c10/cuda/CUDACachingAllocator.cpp:9: xplat/caffe2/c10/util/static_tracepoint.h:4:6: error: 'TORCH_DISABLE_SDT' is not defined, evaluates to 0 [-Werror,-Wundef] !TORCH_DISABLE_SDT ``` When porting USDT macros to PyTorch in D47159249, I must have not hit a codepath which treated warnings as errors during testing. This diff fixes the issue by first checking whether the `TORCH_DISABLE_SDT` macro is defined before trying to access it in the `static_tracepoint.h` header. Test Plan: Similar to D47159249, tested the following macro on test scripts with `libbpf` USDTs: * `CAFFE_DISABLE_SDT` Reviewed By: chaekit Differential Revision: D48251736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107149 Approved by: https://github.com/chaekit	2023-08-16 19:52:12 +00:00
Catherine Lee	f16be5e0d4	Reordering tests experiment (#106347 ) Companion with https://github.com/pytorch/test-infra/pull/4424 Uses the file rating generated by the test infra PR to re order tests. For each test file, sum the file ratings from the changed files in the PR, and put the tests in order of sum. A lot of tests are probably going to end up as "prioritized" since it takes anything with a rating > 0 right now. Sharding is done twice, once on the prioritized tests, and once on the general/non prioritized tests. Prioritized tests have an order, so they should be sharded according to that order, while general tests don't have an order and are sharded by test time, which should result in more balanced shards. I'll change the metric name before I merge, i want to quarantine my testing stuff from actual results Pull Request resolved: https://github.com/pytorch/pytorch/pull/106347 Approved by: https://github.com/ZainRizvi	2023-08-16 18:23:09 +00:00
soulitzer	884c03d240	Improve activation checkpoint docs wording (#107296 ) This helps eliminate some confusion around "intermediates" and whether module outputs are handled as well. See this internal post https://fb.workplace.com/groups/1405155842844877/permalink/7327505913943144/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/107296 Approved by: https://github.com/albanD	2023-08-16 17:36:52 +00:00
eellison	e9ae820279	Unfuse bias add before pointwise ops (#106912 ) I get a 2% inference speedup in HF with this PR. I checked to see if there any models where unfusing was slower than the cublas gelu fusion, and I did not see any, which was surprising to me. Sorry for the cublas-activation api churn 😬 Kicking off another run in cublas 12, it's possible that the results have changed since. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106912 Approved by: https://github.com/jansel ghstack dependencies: #106911	2023-08-16 17:22:24 +00:00
eellison	c88775b937	Make Nd tensors hit fused addmm pass (#106911 ) Replace https://github.com/pytorch/pytorch/pull/106433 since I had a bad cla commit. Speeds up eager convnext bfloat16 inference by 35%., and eager timm bfloat16 inference average by `.5%` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106911 Approved by: https://github.com/ezyang	2023-08-16 17:12:11 +00:00
ydwu4	1c6f39098f	[export] avoid calling the callable during export. (#107249 ) We avoid calling user's function f again in export. It's error prone (due to side effects in f) and time-consuming. Instead, we directly manipulate the out_spec of the graph module to make sure the graph module outputs a tuple so that aot_export is happy. The out_spec of gm_torch_level is computed from dynamo traced result and is guaranteed to be the same output as eagerly running user's original callable f. Test Plan: existing tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107249 Approved by: https://github.com/tugsbayasgalan	2023-08-16 17:08:30 +00:00
Sam Gross	d0e50d9094	Move overloaded_args from FunctionSignature to PythonArgs (#106983 ) This moves the `overloaded_args` field from FunctionSignature to PythonArgs. FunctionSignature is shared by all calls and should be immutable. PythonArgs contains the parsing results for an single call to the PyTorch API. I did not measure a difference in performance in the "overrides_benchmark", although I expect there to be a bit more work in the common case. Note that the noise factor for the benchmark is much larger than the differences reported below: Before: ``` Type tensor had a minimum time of 2.3615360260009766 us and a standard deviation of 0.7833134150132537 us. Type SubTensor had a minimum time of 10.473251342773438 us and a standard deviation of 0.1973132457351312 us. Type WithTorchFunction had a minimum time of 5.484819412231445 us and a standard deviation of 0.13305981701705605 us. Type SubWithTorchFunction had a minimum time of 11.098146438598633 us and a standard deviation of 0.15598918253090233 us. ``` After: ``` Type tensor had a minimum time of 2.2134780883789062 us and a standard deviation of 0.802064489107579 us. Type SubTensor had a minimum time of 10.625839233398438 us and a standard deviation of 0.15155907021835446 us. Type WithTorchFunction had a minimum time of 5.520820617675781 us and a standard deviation of 0.23115111980587244 us. Type SubWithTorchFunction had a minimum time of 11.227846145629883 us and a standard deviation of 0.23032321769278497 us. ``` Fixes #106974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106983 Approved by: https://github.com/zou3519, https://github.com/ezyang, https://github.com/albanD	2023-08-16 15:59:26 +00:00
Will Constable	1f6c1d9beb	Fix inductor torch.cat edge case for empty tensor (#107193 ) Align with eager behavior on this edge case- essentially, the empty tensor is ignored by the operator. Fixes #107118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107193 Approved by: https://github.com/wanchaol, https://github.com/eellison, https://github.com/peterbell10	2023-08-16 15:30:44 +00:00
Animesh Jain	7cb2a6bfab	[dynamo][fallback] Fallback to eager when backend fails with fake tensor exceptions (#107179 ) Example (I think we should fix this test case for real, but using this to test the ux around fallbacks) ~~~ @torch.compile(backend="aot_eager") def fn(x): return torch.sum(x, dim=1).tolist() print(fn(torch.rand(4, 4).to(dtype=torch.int64))) ~~~ Running the script as is ~~~ [2023-08-14 14:53:48,863] torch._dynamo.output_graph: [WARNING] Backend compiler failed with a fake tensor exception at [2023-08-14 14:53:48,863] torch._dynamo.output_graph: [WARNING] File "/data/users/anijain/pytorch/examples/spl.py", line 5, in fn [2023-08-14 14:53:48,863] torch._dynamo.output_graph: [WARNING] return torch.sum(x, dim=1).tolist() [2023-08-14 14:53:48,863] torch._dynamo.output_graph: [WARNING] Falling back to eager for this frame. Please use TORCH_LOGS=graph_breaks to see the full stack trace. [0, 0, 0, 0] ~~~ Running the script with TORCH_LOGS="graph_breaks" ~~~ [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] WON'T CONVERT fn /data/users/anijain/pytorch/examples/spl.py line 3 [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] ========== TorchDynamo Stack Trace ========== [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] Traceback (most recent call last): [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/_dynamo/output_graph.py", line 995, in call_user_compiler [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] compiled_fn = compiler_fn(gm, self.example_inputs()) [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] compiled_gm = compiler_fn(gm, example_inputs) [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/__init__.py", line 1586, in __call__ [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] return self.compiler_fn(model_, inputs_, self.kwargs) [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/_dynamo/backends/common.py", line 55, in compiler_fn [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] cg = aot_module_simplified(gm, example_inputs, kwargs) [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/_functorch/aot_autograd.py", line 3795, in aot_module_simplified [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] compiled_fn = create_aot_dispatcher_function( [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/_dynamo/utils.py", line 194, in time_wrapper [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] r = func(args, kwargs) [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/_functorch/aot_autograd.py", line 3283, in create_aot_dispatcher_function [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] fw_metadata = run_functionalized_fw_and_collect_metadata( [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/_functorch/aot_autograd.py", line 757, in inner [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] flat_f_outs = f(flat_f_args) [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/_functorch/aot_autograd.py", line 3400, in functional_call [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] out = Interpreter(mod).run(args[params_len:], kwargs) [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/fx/interpreter.py", line 138, in run [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] self.env[node] = self.run_node(node) [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/fx/interpreter.py", line 195, in run_node [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] return getattr(self, n.op)(n.target, args, kwargs) [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/fx/interpreter.py", line 289, in call_method [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] return getattr(self_obj, target)(args_tail, *kwargs) [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/utils/_stats.py", line 20, in wrapper [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] return fn(args, *kwargs) [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/_subclasses/fake_tensor.py", line 1233, in __torch_dispatch__ [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] return self.dispatch(func, types, args, kwargs) [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/_subclasses/fake_tensor.py", line 1470, in dispatch [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] op_impl_out = op_impl(self, func, args, **kwargs) [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/torch/_subclasses/fake_tensor.py", line 501, in local_scalar_dense [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] raise DataDependentOutputException(func) [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] torch._subclasses.fake_tensor.DataDependentOutputException: aten._local_scalar_dense.default [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] While executing %item : [num_users=1] = call_method[target=item](args = (%getitem,), kwargs = {}) [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] Original traceback: [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] File "/data/users/anijain/pytorch/examples/spl.py", line 5, in fn [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] return torch.sum(x, dim=1).tolist() [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] [2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] ~~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/107179 Approved by: https://github.com/ezyang	2023-08-16 14:57:42 +00:00
Howard Huang	3577ae3e53	Fix TestDistBackendWithSpawn.test_backend_group and test_backend_full_group (#107231 ) Fixes https://github.com/pytorch/pytorch/issues/107078 and allows tests to be run with 2 GPUs only. testing command: `BACKEND=gloo WORLD_SIZE=2 pytest test/distributed/test_distributed_spawn.py -vs -k test_backend_group` `BACKEND=nccl WORLD_SIZE=2 pytest test/distributed/test_distributed_spawn.py -vs -k test_backend_full_group` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107231 Approved by: https://github.com/rohan-varma	2023-08-16 12:01:09 +00:00
gmagogsfm	ddba7a5a55	Expose torch.export() API (#106904 ) Other class definitions and utilities will be moved in subsequent PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/106904 Approved by: https://github.com/avikchaudhuri	2023-08-16 10:47:26 +00:00
Jun Luo	528a2c0aa9	Fix bug: not creating empty tensor with correct sizes and device. (#106734 ) Summary: logical_add and logical_add_ are reusing implementation of logical_add_out. But the `comparison_op` doesn't create an empty tensor with correct sizes and device type. ``` Tensor& logical_and_out(const Tensor& self, const Tensor& other, Tensor& result) { return comparison_op_out(result, self, other, logical_and_stub); } Tensor logical_and(const Tensor& self, const Tensor& other) { return comparison_op(self, other, static_cast<OutFunc>(at::logical_and_out)); } Tensor& logical_and_(Tensor& self, const Tensor& other) { return comparison_op_(self, other, static_cast<OutFunc>(at::logical_and_out)); } ``` Test Plan: CI tests. Differential Revision: D48134169 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106734 Approved by: https://github.com/jackm321	2023-08-16 09:48:35 +00:00
Tobias Ringwald	4de5e1775a	Improved log1p implementation for complex inputs (#107100 ) This PR improves the implementation of `torch.log1p` for complex inputs as mentioned in issue #107022. The new implementation is based on the insights provided in https://github.com/numpy/numpy/pull/22611#issuecomment-1667945354. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107100 Approved by: https://github.com/lezcano	2023-08-16 07:19:11 +00:00
PyTorch UpdateBot	35cca799ff	[vision hash update] update the pinned vision hash (#107272 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107272 Approved by: https://github.com/pytorchbot	2023-08-16 04:47:59 +00:00
Michael Lazos	e0d6072f69	Add API to mark input tensors static for cudagraphs (#107154 ) Adds API to mark tensor as a static input - To make this trigger recompiles properly, I'll need to update tensor match checks to also check for this new attribute Additional concern is memory - the tensors will be kept alive, but this is the current behavior for nn modules and parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107154 Approved by: https://github.com/eellison	2023-08-16 04:38:19 +00:00
Wang, Eikan	9921b48558	Extend Inductor to support the third-party backend (#106874 ) ## Summary This is re-land PR for https://github.com/pytorch/pytorch/pull/100706 to address the compilation latency performance regression. ## Root Cause Regarding the C++/OpenMP backend, `codecache.pick_vec_isa()` to check vectorization ISA is a time-consuming and one-shot operation. It leads to taking a longer time to import `codegen.cpp` package because the `LoopLevel` of the package is decorated by `@dataclasses.dataclass` while the decorator will invoke `codecache.pick_vec_isa()` to initialize the `simd_nelements` of the `LoopLevel`. `c14cf312c9/torch/_inductor/codegen/cpp.py (L2883C53-L2883C53)` In terms of the Triton backend, it does not need to touch it. But we'd prefer to uniform the code. Therefore, the new design simultaneously registers `CpuScheduling` for CPU and `TritonScheduling` for Triton regardless of whether the current backend is Triton. It will bring additional overhead to the Triton backend. ```python def init_backend_registration(self): if get_scheduling_for_device("cpu") is None: from .codegen.cpp import CppScheduling register_backend_for_device("cpu", CppScheduling, WrapperCodeGen) if get_scheduling_for_device("cuda") is None: from .codegen.triton import TritonScheduling register_backend_for_device("cuda", TritonScheduling, WrapperCodeGen) ``` ## Solution To resolve the compilation latency regression for the Triton backend, we changed the `LoopLevel` a little bit([new code changes](https://github.com/pytorch/pytorch/pull/106874/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2893-R2904)) by moving the `simd_nelements` to `__post_init__` and the compilation performance would be back. ## Compilation Latency Performance Result We ran a single model benchmark and reproduced the compilation regression: - Run `python benchmarks/dynamo/torchbench.py -dcuda --training --performance --inductor --only hf_Bart` - W/ PR #100706, the compilation latency is about 57~58 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.556712,109.676554,57.055242,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.646658,109.621747,57.909817,0.936330,5.760698,6.152422,642,1,8,7 ``` - W/O PR #100706, the compilation latency is about 46~47 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.599065,108.702480,47.490346,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.588419,108.431411,46.983041,0.936330,5.760698,6.152422,642,1,8,7 ``` This PR fixed the compilation performance regression. - W/ this PR #106874, the compilation latency is about 47~48 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.586261,108.149467,47.481058,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.758915,108.613899,47.925633,0.936330,5.760698,6.152422,642,1,8,7 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106874 Approved by: https://github.com/jansel	2023-08-16 04:11:36 +00:00
PyTorch MergeBot	6c0bba3daf	Revert "Use cpuinfo to determine c10::ThreadPool thread number (#107010 )" This reverts commit ad0476540dcaf07aa6e3639f6c60ee820d5f3f99. Reverted https://github.com/pytorch/pytorch/pull/107010 on behalf of https://github.com/izaitsevfb due to Breaks internal meta builds ([comment](https://github.com/pytorch/pytorch/pull/107010#issuecomment-1679866821))	2023-08-16 02:20:31 +00:00
PyTorch MergeBot	1af324b560	Revert "Introduce CUDA-only `_scaled_mm` op (#106844 )" This reverts commit 9440a8cbec52ce5c2eb9b95b4a8d9f16055d611d. Reverted https://github.com/pytorch/pytorch/pull/106844 on behalf of https://github.com/izaitsevfb due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/106844#issuecomment-1679858327))	2023-08-16 02:05:29 +00:00
Wei-Sheng Chin	22f5889753	[Dynamo, ONNX] Replace `onnxrt` backend with new backend from ONNXRuntime team (#106929 ) In https://github.com/pytorch/pytorch/pull/106589, a new ONNXRuntime-based Dynamo backend is introduced. As mentioned in that PR, we hope to replace legacy `onnxrt` with that new backend. This PR remove legacy `onnxrt` and register the new backend under the same name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106929 Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao, https://github.com/abock, https://github.com/msaroufim, https://github.com/jansel	2023-08-15 22:50:46 +00:00
Fernando Gasperi	d290511ecd	[gtest-static-listing] Enable for cc_test (#107186 ) Reviewed By: Nekitosss Differential Revision: D48323659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107186 Approved by: https://github.com/jeanschmidt	2023-08-15 22:31:32 +00:00
Jack Taylor	19db570cd9	[ROCm] Add miopen determinism support for convolutions (#107028 ) With torchvision installed many of the test_distributed_spawn tests failed due to divergence between model runs. To resolve this we are adding the MIOPEN_CONVOLUTION_ATTRIB_DETERMINISTIC attribute to support deterministic convolutions on ROCm. This means examples such as https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/distributed/distributed_test.py#L4777 which use the torch.backends.cudnn.flags.deterministic flag will behave correctly on ROCm ``` with torch.backends.cudnn.flags( enabled=True, deterministic=True, benchmark=False ): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107028 Approved by: https://github.com/jeffdaily, https://github.com/kit1980	2023-08-15 22:18:32 +00:00
Catherine Lee	bc053070f8	Mark test_gradient_extreme_cases as slow for inductor (#107189 ) test_gradient_extreme_cases_* takes ~5 minutes on the inductor sm86 shard and possibly even longer on the inductor workflow since it's timing out right now although I'm not sure what the difference between the two is, and sometimes auto slow test detection isn't catching it Pull Request resolved: https://github.com/pytorch/pytorch/pull/107189 Approved by: https://github.com/ZainRizvi	2023-08-15 22:03:00 +00:00
Chien-Chin Huang	f6a9c15421	[FSDP][state_dict] Make optim_state_dict_to_load work with use_orig_param=False + NO_SHARD (#107185 ) Summary: As title Test Plan: CI Reviewed By: wz337 Differential Revision: D48329724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107185 Approved by: https://github.com/fegin	2023-08-15 21:42:41 +00:00
BowenBao	f76250f6e3	[ONNX] Relax not exist assertion for 'register_pytree_node' (#107245 ) To not conflict with potential existing workaround or solution outside of exporter. Latest huggingface/transformers main (>4.31) patches PyTorch PyTree with support over `ModelOutput` class. `_PyTreeExtensionContext` is kept to support prior versions of transformers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107245 Approved by: https://github.com/titaiwangms ghstack dependencies: #106741, #107158, #107165	2023-08-15 21:01:17 +00:00
BowenBao	d8a71a6633	[ONNX] Set 'Generic[Diagnostic]' as base class for 'DiagnosticContext' (#107165 ) Allows overriding the `Diagnostic` type for DiagnosticContext and enable type checking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107165 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms ghstack dependencies: #106741, #107158	2023-08-15 21:01:17 +00:00
ydwu4	c71828b097	Lift non-FakeTensor restriction for compile (#107042 ) Currently, we have the assertion that dynamo won't accept FakeTensor input unless we're exporting. This PR try to remove this restriction to finish https://github.com/pytorch/pytorch/pull/105679. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107042 Approved by: https://github.com/ezyang, https://github.com/zou3519	2023-08-15 20:58:56 +00:00
PyTorch MergeBot	22cade56ba	Revert "[Reland] Upgrade NVTX to NVTX3 (#97582 )" This reverts commit 5bbfb96203370f73b4cd28e6ac766a26debce3df. Reverted https://github.com/pytorch/pytorch/pull/97582 on behalf of https://github.com/izaitsevfb due to Breaks meta RL builds ([comment](https://github.com/pytorch/pytorch/pull/97582#issuecomment-1679568525))	2023-08-15 20:55:12 +00:00
Zain Rizvi	87cd10bc7b	Add basic TD framework (#106997 ) Adds a new structure to house all heuristics we use for Target Determination and Test Reordering. I'm keeping it somewhat minimal for now, to let it evolve more easily as we try new things. It currently does nothing. The 2nd pr in the stack ports the existing heuristics to actually use this new framework Pull Request resolved: https://github.com/pytorch/pytorch/pull/106997 Approved by: https://github.com/clee2000, https://github.com/huydhn	2023-08-15 20:54:54 +00:00
PyTorch MergeBot	b860c8c5b8	Revert "ExportedProgram.transform now updates graph_signature automatically (#107080 )" This reverts commit 8c9b2fe8f097cd4b32000e5124232a0047d92234. Reverted https://github.com/pytorch/pytorch/pull/107080 on behalf of https://github.com/izaitsevfb due to Breaks executorch tests, see D48333170 ([comment](https://github.com/pytorch/pytorch/pull/107080#issuecomment-1679588292))	2023-08-15 20:47:35 +00:00
Edward Z. Yang	10ce16bebb	Specify if mismatch is input or output in export (#107145 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107145 Approved by: https://github.com/suo, https://github.com/gmagogsfm	2023-08-15 20:34:25 +00:00
Edward Z. Yang	5673c0874c	Use expect_true to make split with unbacked sizes work. (#106788 ) This pattern shows up in torchrec KeyedJaggedTensor. Most of the change in this PR is mechanical: whenever we failed an unbacked symint test due to just error checking, replace the conditional with something that calls expect_true (e.g., torch._check or TORCH_SYM_CHECK). Some of the changes are a bit more nuanced, I've commented on the PR accordingly. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106788 Approved by: https://github.com/lezcano ghstack dependencies: #106720	2023-08-15 20:31:30 +00:00
Edward Z. Yang	e1ee10e6f5	Add expect_true for irrefutable guards (#106720 ) Here's what it does from the comments: ``` Assume that a boolean is true for the purposes of subsequent symbolic reasoning. This will keep track of corresponding runtime checks to verify that the result is upheld: either as a regular guard, or as a special set of asserts which are triggered when an unbacked SymInt is allocated. DO NOT use this function for these cases: - This is inappropriate for "branching" conditions (where both true and false result in valid programs). We will always assume the condition evaluates true, and so it will never be possible to trace the false condition when you use it. For true branching on unbacked SymInts, you must use torch.cond. - This is inappropriate for situations where you know some other system invariant guarantees that this property holds, since you don't really need to insert a runtime check in that case. Use something like constrain_range in that case. This API has a hitch. To avoid having to reimplement error reporting capabilities, this function CAN return False. The invariant is that the surrounding code must raise an error when this function returns False. This is quite low level, so we recommend using other functions like check() which enforce this in a more intuitive way. By the way, this name is a nod to the __builtin_expect likely macro, which is used similarly (but unlike __builtin_expect, you MUST fail in the unlikely branch.) ``` We don't do anything with this right now, except use it to discharge regular guards. Follow up PRs to (1) use it at important error checking sites, (2) actually ensure the runtime asserts make there way into the exported IR / inductor generated code. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106720 Approved by: https://github.com/ysiraichi, https://github.com/voznesenskym	2023-08-15 18:42:22 +00:00
Shawn Xu	388ba7e5ae	[ptd] make multithreaded pg wait for readiness before the 1st collective (#106954 ) Summary: This used to be not a problem because in c10d collective init, a store based barrier would be applied. This recently got changed in https://github.com/pytorch/pytorch/pull/103033 where the barrier is not by default applied. For normal PGs like gloo/nccl, this is not a problem as the rendezvous process is implicitly a barrier anyway. But for threaded pg, without the store based barrier this would lead to race condition as the local pg does not wait for world to be ready before starting collectives. This fixes the issue by just doing a store based barrier for each pg created. The CV attempt wouldn't work since that would still rely on class level variables which would break in the device mesh case. See inline comment for details. Differential Revision: D48220125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106954 Approved by: https://github.com/wanchaol, https://github.com/H-Huang, https://github.com/XilunWu	2023-08-15 18:40:49 +00:00
BowenBao	e9cb7179cb	[ONNX] Fix diagnostic log and add unittest (#107158 ) As title. Previously message was formatted but mistakenly not logged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107158 Approved by: https://github.com/titaiwangms ghstack dependencies: #106741	2023-08-15 17:46:15 +00:00
BowenBao	19a76290d8	[ONNX] Public diagnostic options for 'dynamo_export' (#106741 ) Generate diagnostic reports to monitor the internal stages of the export process. This tool aids in unblocking model exports and debugging the exporter. #### Settings ~~1. Choose if you want to produce a .sarif file and specify its location.~~ 1. Updated: saving .sarif file should be done by `export_output.save_sarif_log(dst)`, similar to saving exported onnx model `export_output.save(model_dst)`. 2. Customize diagnostic options: - Set the desired verbosity for diagnostics. - Treat warnings as errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106741 Approved by: https://github.com/titaiwangms, https://github.com/justinchuby, https://github.com/malfet	2023-08-15 17:46:15 +00:00
Shen Li	45128ab67c	[Reland] Add OnCompletion Hook to ProcessGroup (#106988 ) (#107233 ) This allows infra/trainers to get detailed stats about communication efficiencies without know anything about what model or distributed training paradigms have been used. This is helpful as infra/trainer package usually prefers to be as model/algorithm agnostic as possible. Therefore, we cannot assume that infra/trainer can have access to all collectives used by the model authors. This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which will be fired on every work completion event. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107233 Approved by: https://github.com/kumpera	2023-08-15 17:35:14 +00:00
soulitzer	d9dc4b2b4c	[BE] Add missing override to remove build warning spam (#107191 ) ``` In file included from /local/pytorch3/test/cpp/api/optim.cpp:7: local/pytorch3/test/cpp/api/support.h:44:3: warning: '~WarningCapture' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] ~WarningCapture() { ^ local/pytorch3/c10/util/Exception.h:167:11: note: overridden virtual function is here virtual ~WarningHandler() = default; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107191 Approved by: https://github.com/janeyx99	2023-08-15 17:32:34 +00:00
Brian Hirsh	8c44cfef5e	Add some support for detecting false aliasing in AOTAutograd (#106461 ) This is a partial fix for https://github.com/pytorch/pytorch/issues/106457. In the examples with the shampoo optimizer that i ran, they were enough to remove the parameter aliasing in shampoo. I added some new logic for detecting if two inputs have overlapping memory in specific cases: if they're both 2D tensors with stride 1. In that case (the case for shampoo), I try to compute a bunch of contiguous intervals on the two tensors, and check if any of the intervals overlap. In theory this is slow, since if our two tensors are e.g. of size (256, N), we'll need to create 256 intervals to check for overlap on. This seems... probably fine, since I think we do more egregious things in the compile stack to cause slowness. Open to suggestions though! Pull Request resolved: https://github.com/pytorch/pytorch/pull/106461 Approved by: https://github.com/albanD ghstack dependencies: #106460	2023-08-15 17:27:37 +00:00
Brian Hirsh	517ba2add7	AOTAutograd: allow input mutations on inputs that are non-contiguous (#106460 ) Fixes https://github.com/pytorch/pytorch/issues/106456 I also had to update the logic in functionalization's resize_() kernel to convey to AOTAutograd that resize_() is a metadata mutation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106460 Approved by: https://github.com/ezyang	2023-08-15 17:27:37 +00:00
cyy	ad0476540d	Use cpuinfo to determine c10::ThreadPool thread number (#107010 ) This PR prefers "logical processor number" (the cpu cores shown in htop) returned by cpuinfo for determining c10 thread number. If that fails, it uses hardware_concurrency exactly. The motivation is that in a x86 host with 64 cores and Hyper-Threading disabled, the current behavior uses 32 threads, resulting half of cores being idle. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107010 Approved by: https://github.com/ezyang	2023-08-15 17:26:24 +00:00
Andres Lugo	7fb543e36d	[ROCm] Enable hipsolver unit tests for batched linalg drivers (#106620 ) This is a follow up to https://github.com/pytorch/pytorch/pull/105881 and replaces https://github.com/pytorch/pytorch/pull/103203 The batched linalg drivers from 103203 were brought in as part of the first PR. This change enables the ROCm unit tests that were enabled as a result of that change. Along with a fix to prioritize hipsolver over magma when the preferred linalg backend is set to `default` The following 16 unit tests will be enabled for rocm in this change: - test_inverse_many_batches_cuda* - test_inverse_errors_large_cuda* - test_linalg_solve_triangular_large_cuda* - test_lu_solve_batched_many_batches_cuda* Pull Request resolved: https://github.com/pytorch/pytorch/pull/106620 Approved by: https://github.com/lezcano	2023-08-15 15:54:27 +00:00
Fernando Gasperi	ed0782125a	[gtest-static-listing] Make decision for caffe2 targets (#107129 ) Summary: title Pull Request resolved: https://github.com/pytorch/pytorch/pull/107129 Approved by: https://github.com/atalman	2023-08-15 12:41:47 +00:00
PyTorch MergeBot	fd214aa8be	Revert "Add OnCompletion Hook to ProcessGroup (#106988 )" This reverts commit ba1da47e8fa95ca0dd8b2d63430f7eb54fdbbccb. Reverted https://github.com/pytorch/pytorch/pull/106988 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing Windows build with some linker error. The Windows failures on PR looks legit ([comment](https://github.com/pytorch/pytorch/pull/106988#issuecomment-1678580899))	2023-08-15 08:24:33 +00:00
Jun Luo	2abcfc40b0	Enable torchgen for MTIA dispatch key (#107046 ) Summary: As title. Test Plan: See diff D48258693 Differential Revision: D48273743 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107046 Approved by: https://github.com/albanD	2023-08-15 07:56:18 +00:00
HDCharles	935f2dd084	adding fused uint4x2_mixed_mm to inductor (#106516 ) Summary: this is needed for int4 weight-only quantization, we're matching on the specific unpack operation that unpacks the uint4x2 into int4's so we can have a fused kernel for it. note, even if the user isn't specifically doing this, the two operations are mathematically equilvanet so it won't cause issues (for some reason int8 bitwise logic in triton and pytorch doesn't match so that's the only exception). Ideally at some point full prologue fusion for the mm arguments would be able to handle this chain but until then, this type of kernel is needed. Test Plan: python test/inductor/test_pattern_matcher.py -k "uint4x2" print test/inductor/test_torchinductor.py -k "uint4x2" Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/106516 Approved by: https://github.com/jansel	2023-08-15 06:58:36 +00:00
eqy	4add06eb5c	[CUDNN][CUDNN V8 API] LRU Cache for cuDNN frontend `ExecutionPlan` (#104369 ) Adds LRU functionality to the cuDNN frontend `ExecutionPlan` cache to address high memory usage as observed in #98688, #104122 via the `TORCH_CUDNN_V8_LRU_CACHE_LIMIT` environment variable. By default this limit is set to 10000, which corresponds to about 2GiB of host memory usage as observed empirically. Note that we are still following up with cuDNN to see if the size of an `ExecutionPlan` can be reduced, as it appears to currently be around 200KiB (!!) for a single plan. This implementation is a bit heavy on the internal asserts for now as it's a bit difficult to directly test the state of the cache without instrumenting it explicitly in tests. Once we are confident that the implementation is stable, we can remove the asserts. CC @malfet who @ptrblck mentioned may have also been looking into this CC @colesbury Pull Request resolved: https://github.com/pytorch/pytorch/pull/104369 Approved by: https://github.com/malfet	2023-08-15 05:52:49 +00:00
cyy	b7431d815f	[submodule] Pin fmtlib/fmt to 10.1.0 (#106672 ) fmt10.1.0 fixes a bug of format_string_checker initialisation order which is important to our improved clang-tidy checks #103058. This PR upgrades third_party fmt to 10.1.0, in the meanwhile, kineto is also upgraded to avoid fmt errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106672 Approved by: https://github.com/Skylion007	2023-08-15 05:47:04 +00:00
Tugsbayasgalan Manlaibaatar	20c5add133	[export] Refactor `constrain_as_value` and `constrain_as_size` (#106591 ) Some notable changes: 1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2. 2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591 Approved by: https://github.com/gmagogsfm, https://github.com/ezyang	2023-08-15 05:41:43 +00:00
fduwjj	d6c120d7f9	[TP][DTensor Perf]Fix DTensor Spec hash (#107181 ) https://github.com/pytorch/pytorch/pull/106524 gets merged so fast that we didn't figure out that we should hash both stride and dtype in DTensorSpec. This is a forward fix. One analysis for why using just shape is not enough. 1. We use the hash value for sharding propogation cache. And the output sharding contains the stride, size of the output DTensor. If we don't consider stride, we will see errors. 2. One reason can be found below: ``` OpSchema(func_schema=aten::t(Tensor(a) self) -> Tensor(a), args_schema=(DTensorSpec(mesh=DeviceMesh:([0, 1, 2, 3, 4, 5, 6, 7]), placements=(Shard(dim=0),), tensor_meta=TensorMetadata(shape=torch.Size([64, 128]), dtype=torch.float32, requires_grad=False, stride=(128, 1), memory_format=None, is_quantized=False, qparams={})),), kwargs_schema={}) ``` ``` OpSchema(func_schema=aten::t(Tensor(a) self) -> Tensor(a), args_schema=(DTensorSpec(mesh=DeviceMesh:([0, 1, 2, 3, 4, 5, 6, 7]), placements=(Shard(dim=0),), tensor_meta=TensorMetadata(shape=torch.Size([64, 128]), dtype=torch.float32, requires_grad=False, stride=(1, 64), memory_format=None, is_quantized=False, qparams={})),), kwargs_schema={}) ``` The only difference between two op_schame is the tensor stride: <img width="151" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/161335df-bdfb-47c5-ba79-82616d070d15"> that makes the transpose op generates wrong result and leads to the add_/addmm_ op failing with errors: ``` Traceback (most recent call last): File "/data/users/fduwjj/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap fn(i, args) File "/data/users/fduwjj/pytorch/benchmarks/distributed/tensor/tp_benchmark.py", line 210, in run_tp output.sum().backward() File "/data/users/fduwjj/pytorch/torch/_tensor.py", line 491, in backward torch.autograd.backward( File "/data/users/fduwjj/pytorch/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/data/users/fduwjj/pytorch/torch/distributed/_tensor/api.py", line 252, in __torch_dispatch__ return op_dispatch.operator_dispatch( File "/data/users/fduwjj/pytorch/torch/distributed/_tensor/dispatch.py", line 116, in operator_dispatch out, _, _ = _operator_dispatch(op_call, args, kwargs, sharding_propagator) File "/data/users/fduwjj/pytorch/torch/distributed/_tensor/dispatch.py", line 246, in _operator_dispatch local_results = op_call(local_tensor_args, *local_tensor_kwargs) File "/data/users/fduwjj/pytorch/torch/_ops.py", line 435, in __call__ return self._op(args, **kwargs or {}) RuntimeError: The size of tensor a (64) must match the size of tensor b (8) at non-singleton dimension 1 ``` Same thing with dtype, if we are using DTensor in the environment of mixed precision, we will run into situations like this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107181 Approved by: https://github.com/wanchaol ghstack dependencies: #106524	2023-08-15 05:33:10 +00:00
Michael Lazos	2d841bcb9f	Remove type promotion workaround (#107202 ) Removes old type promotion workaround Pull Request resolved: https://github.com/pytorch/pytorch/pull/107202 Approved by: https://github.com/xuzhao9, https://github.com/eellison	2023-08-15 05:32:42 +00:00
andreasfloros	c9c90765c1	grad_mode decorators without paren (#107086 ) This PR implements the feature described in #107036 for `no_grad`, `enable_grad` and `inference_mode`. Users can still use the above as before but they can also use them without parentheses. For example: ```python import torch a = torch.ones(1, requires_grad=True) def do_something(): print(2 * a) with torch.no_grad(): do_something() # tensor([2.]) torch.no_grad()(do_something)() # tensor([2.]) torch.no_grad(do_something)() # tensor([2.]) do_something() # tensor([2.], grad_fn=<MulBackward0>) ``` For `inference_mode`, decorating without parenthesis is equivalent to decorating with the default `mode=True`, similiar to how dataclasses behave (https://docs.python.org/3/library/dataclasses.html#module-contents) Closes #107036 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107086 Approved by: https://github.com/albanD	2023-08-15 05:25:33 +00:00
Shen Li	ba1da47e8f	Add OnCompletion Hook to ProcessGroup (#106988 ) This allows infra/trainers to get detailed stats about communication efficiencies without know anything about what model or distributed training paradigms have been used. This is helpful as infra/trainer package usually prefers to be as model/algorithm agnostic as possible. Therefore, we cannot assume that infra/trainer can have access to all collectives used by the model authors. This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which will be fired on every work completion event. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106988 Approved by: https://github.com/kumpera, https://github.com/H-Huang ghstack dependencies: #107140, #107141, #107160	2023-08-15 04:32:23 +00:00
Bruce Jiang	2624da638d	Support third-party devices to use the init_process_group method with… (#107113 ) …out specifying the Backend When init_process_group is not been done before, it will automatically apply init_process_group within Devicemesh without specifying the backend. Thus, when a third-party device want to use Devicemesh without doing init_process_group before, there comes a problem. In this PR, add a default_device_backend_map for third-party device users to add their backends to this map when they register their backends to pytorch firstly. When doing init_process_group without parameter backend, it will init the backends in this map. Thus, a third-party user can use init_process_group method without specifying the Backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107113 Approved by: https://github.com/wanchaol	2023-08-15 03:46:07 +00:00
Nikita Shulga	574442ba01	CI upgradeapalooza `bionic`->`focal`, `gcc7`->`gcc9`, `clang7`->`clang10` (#105260 ) Bionic support was finished back in April 2023, see https://ubuntu.com/blog/ubuntu-18-04-eol-for-devices And neither gcc-7 nor clang7 are fully compatible with c++17, update minimal tested gcc to gcc9 and clang to clang-10 Note: OpenMP support is broken in Focal's `clang9`, so move up to a `clang10` - Suppress `-Wuninitialized` in complex_test as gcc-11 fires a seemingly false-positive warning: ``` In file included from /home/malfet/git/pytorch/pytorch/c10/test/util/complex_test.cpp:1: /home/malfet/git/pytorch/pytorch/c10/test/util/complex_test_common.h: In member function ‘virtual void memory::TestMemory_ReinterpretCast_Test::TestBody()’: /home/malfet/git/pytorch/pytorch/c10/test/util/complex_test_common.h:38:25: warning: ‘z’ is used uninitialized [-Wuninitialized] 38 \| c10::complex<float> zz = reinterpret_cast<c10::complex<float>>(&z); \| ^~ /home/malfet/git/pytorch/pytorch/c10/test/util/complex_test_common.h:37:25: note: ‘z’ declared here 37 \| std::complex<float> z(1, 2); \| ^ ``` - Downgrade `ucc` to 2.15, as 2.16 brings an incompatible libnccl, that causes crash during the initialization - Install `pango` from condo environment for `doctr` torch bench tests to pass, as one available in the system is too new for conda - Suppress some functorch tests when used with python-3.8+dynamo, see https://github.com/pytorch/pytorch/issues/107173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105260 Approved by: https://github.com/huydhn, https://github.com/Skylion007, https://github.com/ZainRizvi, https://github.com/seemethere	2023-08-15 03:07:01 +00:00
Nikita Shulga	9440a8cbec	Introduce CUDA-only `_scaled_mm` op (#106844 ) According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed. Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix. See table below for supported input and output types: \| Mat1 type \| Mat2 type \| Bias type \| Output types \| \| ----------- \| ----------- \| ----------- \| ----------- \| \| Float8_e4m3 \| Float8_e4m3 \| Float16 \| Float8_e4m3, Float16 \| \| Float8_e4m3 \| Float8_e4m3 \| BFloat16 \| Float8_e4m3, BFloat16, Float \| \| Float8_e5m2 \| Float8_e4m3 \| Float16 \| Float8_e4m3, Float8_e5m2, Float16 \| \| Float8_e5m2 \| Float8_e4m3 \| BFloat16 \| Float8_e4m3, Float8_e5m2, BFloat16, Float \| \| Float8_e4m3 \| Float8_e5m2 \| Float16 \| Float8_e4m3, Float8_e5m2, Float16 \| \| Float8_e4m3 \| Float8_e5m2 \| BFloat16 \| Float8_e4m3, Float8_e5m2, BFloat16, Float \| \| Float8_e4m3 \| Float8_e5m2 \| Not supported \| Not supported \| Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following: ```python @register_decomposition(aten._scaled_mm) def _scaled_mm( mat1: Tensor, mat2: Tensor, , dtype: Optional[torch.dtype] = None, scale_a: Optional[Tensor] = None, scale_b: Optional[Tensor] = None, scale_result: Optional[Tensor] = None, ) -> Tuple[Tensor, Tensor]: rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32)) rc = scale_a rc if scale_a is not None else rc rc = scale_b * rc if scale_b is not None else rc rc = scale_result * rc if scale_result is not None else rc rc = rc.to(dtype if dtype is not None else mat1.dtype) return rc, torch.tensor(0.0, device=mat1.device) ``` Known limitations: - Only works for matrix sizes divisible by 16 - 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106844 Approved by: https://github.com/albanD ghstack dependencies: #106977	2023-08-15 02:59:41 +00:00
Richard Zou	e4e9aa28a7	Add `generate_opcheck_tests`, a PT2 crossref testing mechanism (#106903 ) This PR adds `generate_opcheck_tests`. This is a utility that adds additional crossref tests to an existing TestCase that has tests that invokes operators. The main use case is if you have a large test suite that already exercises operators and want to add automated testing that the operators are correct, without actually refactoring your code into something like OpInfos. Given a `test_` method of a TestCase, we will generate one new additional test for each of {schema correctness, autograd registration, faketensor rule, aot_autograd static shapes, aot_autograd dynamic shapes}. Each newly generated test runs the original test method under a special torch_function mode (OpCheckMode) that intercepts `op(args, *kwargs)` calls and additional passes (op, args, kwargs) to a separate function (e.g. SchemaCheck). Nitty-gritty details: - If a test is named test_cumsum, we end up generating new tests (`test_schema__test_cumsum`, `test_<something>__test_cumsum`) - Users can provide a dictionary of expected failures / skips that is indexed on operators. This gives us a sense of what operators support PT2 and which operators require fixing before they support PT2. Due to some co-dev limitations, I'm planning on landing this PR first and then using it to add crossref testing for internal tests and fbgemms. I could squash this PR with the internal changes if we want to see how that works out, just let me know. Test Plan: - We create a mini op test suite called MiniOpTests. - Then, we use `generate_opcheck_tests` to generate tests onto it. - We have our own test xfail list to check that the things that should fail do fail. - Finally, there is a separate TestGenerateOpcheckTests that checks that the correct number of tests were generated and also tests some helper functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106903 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2023-08-15 02:16:07 +00:00
Rohan Varma	ddf36c82b8	[PT-D][FSDP] Handle corner case of load with multi-backend PG (#107172 ) Summary: When loading a CPU state_dict with a pg initialized with cpu:gloo,cuda:nccl, we hit a gloo crash since dest tensor is on GPU and input is on CPU. As a workaround, just enforce that if local_tensor.is_cpu, the dest tensor is also cpu. Test Plan: CI Differential Revision: D48324752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107172 Approved by: https://github.com/fegin	2023-08-14 23:24:44 +00:00
Shen Li	064d813f37	Add distributed/c10d *.hpp files to lintrunner (#107160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107160 Approved by: https://github.com/rohan-varma, https://github.com/fegin, https://github.com/Skylion007 ghstack dependencies: #107140, #107141	2023-08-14 23:16:40 +00:00
Shen Li	facadc6c97	[Easy] Make Work::retrieveOpType a const function (#107141 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107141 Approved by: https://github.com/awgu ghstack dependencies: #107140	2023-08-14 23:16:40 +00:00
Shen Li	dd6319198d	Apply clang-format to distributed/c10d folder (#107140 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107140 Approved by: https://github.com/H-Huang	2023-08-14 23:16:38 +00:00
Jirka	858b465d74	fix str splits in single line (#106005 ) Simple formating improvement and two spell fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/106005 Approved by: https://github.com/H-Huang	2023-08-14 23:07:38 +00:00
Wanchao Liang	759c4995e7	[ci] clean up some multigpu tests, and add funcol test (#107153 ) Add the funcol tests to multigpu tests to ensure it runs on CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/107153 Approved by: https://github.com/kumpera ghstack dependencies: #107151, #107152	2023-08-14 21:55:31 +00:00
Wanchao Liang	9ae51e3ad9	[thread_pg] fix reduce_scatter to respect different cuda device (#107152 ) Same reason as the previous all_reduce PR, see context in the allreduce PR https://github.com/pytorch/pytorch/pull/107151 instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/107152 Approved by: https://github.com/kumpera ghstack dependencies: #107151	2023-08-14 21:55:31 +00:00
Wanchao Liang	4be8fe0f0d	[thread_pg] fix all_reduce to respect different cuda device (#107151 ) The previous implementation only works on CPU and it does not respect the fact that each rank have its data in different devices (i.e. cuda), so the implementation will raise the error like below: ``` RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! ``` See report in https://github.com/pytorch/pytorch/pull/105604#issuecomment-1675472670 This PR fix this issue and tested that the failed tests on GPU now works Pull Request resolved: https://github.com/pytorch/pytorch/pull/107151 Approved by: https://github.com/kumpera	2023-08-14 21:55:29 +00:00
wangxiyuan	50927e25f7	Correct compile doc string format (#107124 ) The blank line should be added between the list items See the wrong generated doc: https://pytorch.org/docs/main/generated/torch.compile.html#torch-compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/107124 Approved by: https://github.com/colesbury	2023-08-14 21:49:12 +00:00
EnaAlogo	2f0ca722d1	Typo fix in Nonzero.cu (#107090 ) Stride of the output that is being produced is (1, num_nonzeros) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107090 Approved by: https://github.com/colesbury	2023-08-14 21:15:41 +00:00
lezcano	2c5f96deac	[Inductor] Make softshrink composite implicit (#107052 ) The backward is pretty much equivalent to the one we had written Pull Request resolved: https://github.com/pytorch/pytorch/pull/107052 Approved by: https://github.com/peterbell10 ghstack dependencies: #107038, #107039, #107051	2023-08-14 21:01:50 +00:00
lezcano	6d899571d6	Simplify sign lowering in triton (#107051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107051 Approved by: https://github.com/peterbell10 ghstack dependencies: #107038, #107039	2023-08-14 21:01:50 +00:00
lezcano	3b1254e800	Make hardshrink's decomp composite implicit (#107039 ) The generated code is the same Pull Request resolved: https://github.com/pytorch/pytorch/pull/107039 Approved by: https://github.com/peterbell10 ghstack dependencies: #107038	2023-08-14 21:01:50 +00:00
lezcano	45c7880486	Simplify some decompositions. (#107038 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107038 Approved by: https://github.com/peterbell10	2023-08-14 21:01:50 +00:00
Zachary DeVito	80988b6277	Introduce memory stacks for free (#106758 ) Previously when we recorded a free action in a memory trace, we would provide the stack for when the block was allocated. This is faster because we do not have to record stacks for free, which would otherwise double the number of stacks collected. However, sometimes knowing the location of a free is useful for figuring out why a tensor was live. So this PR adds this behavior. If performance ends up being a concern the old behavior is possible by passing "alloc" to the context argument rather than "all". Also refactors some of glue logic to be consistent across C++ and Python and routes the Python API through the C++ version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758 Approved by: https://github.com/albanD	2023-08-14 20:38:15 +00:00
hongxyan	df8493455e	[ROCm] enable test_api (test_libtorch) cpp unit tests (#106712 ) This is part of effort to enable missed cpp tests for ROCm platform. In this change, - enabled test_libtorch cpp tests (more than 3107 tests) - fixed missing dependency: libcaffe2_nvrtc.so required by FunctionalTest.Conv1d - test_api binary is changed to exclude failed tests InitTest and IntegrationTest - to revisit later Pull Request resolved: https://github.com/pytorch/pytorch/pull/106712 Approved by: https://github.com/jithunnair-amd, https://github.com/kit1980	2023-08-14 20:09:34 +00:00
Sherlock Huang	1e007d044d	[AOTInductor] Prepare for ProxyExecutor, OSS only change (#107065 ) Summary: Minor fixes to export schema and serialization Test Plan: OSS CI Differential Revision: D48280809 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107065 Approved by: https://github.com/zhxchen17	2023-08-14 20:04:45 +00:00
fduwjj	4a6ca4cc05	[TP][DTensor Perf] Some perf improvement to reduce DTensor CPU overhead (#106524 ) By inspecting a small TP benchmark, we found couple things we can optimize: 1. We call deep_copy so many times when we initialize DTensor. 2. Some shading_prop is not cached successfully. 3. We are still calling redistribute when not necessary. ![image](https://github.com/pytorch/pytorch/assets/6937752/b847d110-eea1-45df-9298-066d0ba07dd7) ![image](https://github.com/pytorch/pytorch/assets/6937752/fc08f564-caed-496b-80d7-275c1dba3806) ![image](https://github.com/pytorch/pytorch/assets/6937752/fdc06cc4-a4ba-48e8-a118-c041bbd04f5e) So we want to: 1. Remove the deep_copy, and we now make placements a tuple so we are sure it's immutable. 2. Somehow the op_schema gets changed during sharding_op propogation, so we store a hash version of it before passing it to sharding_prop. Ideally we want to figure out why `op_schema` gets changed, but looks like in both index and detach/view op, all get changed, it might take more time to debug. 3. Also when we do hashing of op_schema, we want to hash the entire args_schema not just the args_spec which only contains the DTensorSpec from args which are Dtensors. 4. It turns out that sometimes, DTensor has mem_format to be None (not contiguous) and this will lead to redistribute get triggered, so that we only need to compare type/shape and stride in the metadata. Also we need to ensure _Partial and Shard have different hash value in the DTensorSpec. ![image](https://github.com/pytorch/pytorch/assets/6937752/321e6890-1ab6-4975-adc9-524c6ef9a76b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106524 Approved by: https://github.com/wanchaol	2023-08-14 20:03:19 +00:00
Huy Do	00751772e6	Upload perf benchmark to Rockset in batch of at most 5000 records (#107095 ) TIL, uploading to Rockset has an upper limit of 5000 records per request. So uploading PT2 perf benchmark could fail if that limit was reached, for example https://github.com/pytorch/pytorch/actions/runs/5828810421/job/15849232756 ``` HTTP response body: {"message":"The number of documents specified in this request exceeds the maximum allowed limit of 5,000 documents.","message_key":"RECEIVER_REQUEST_MAX_DOCUMENT_LIMIT","type":"INVALIDINPUT","line":null,"column":null,"trace_id":"73fc2eb5-cfd1-4baa-8141-47c7cde87812","error_id":null,"query_id":null,"internal_errors":null} ``` The fix is to upload the results in multiple smaller batches of at most 5000 records. ### Testing 5743 records from https://github.com/pytorch/pytorch/actions/runs/5828810421/job/15849232756 were written in 2 batches (5000 + 743) ``` python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 5821183777 --workflow-run-attempt 1 --repo pytorch/pytorch --head-branch gh/ezyang/2294/head ... Writing 5000 documents to Rockset Done! Writing 743 documents to Rockset Done! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107095 Approved by: https://github.com/atalman, https://github.com/seemethere, https://github.com/ZainRizvi	2023-08-14 19:56:42 +00:00
gmagogsfm	8c9b2fe8f0	ExportedProgram.transform now updates graph_signature automatically (#107080 ) Update graph_signature according to graph after transformation. Transformations can lead to node name changes, which are used in graph_signature to identify inputs and outputs. Therefore, after each transformation, we need to update the graph_signature according to new node names. WARNING: This implementation makes a few assumptions - The transformation doesn't change number of inputs/outputs - Each input/output still has the same meaning. - For inputs, that means that the inputs in transformed graph map to the same lifted parameter/buffer or user input as the input of the same position in the graph before transformation. - Similarly for outputs, each output should correspond to the same mutated buffer or user output as the output value of the same position in the graph before transformation. It is difficult to programatically validate these assumptions, but they should hold true most of the time as inputs/outputs of the graph rarely need to be changed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107080 Approved by: https://github.com/tugsbayasgalan	2023-08-14 19:52:41 +00:00
youkaichao	05db3d9969	improve doc on how to understand dynamo (#106860 ) Per the discussion in https://github.com/pytorch/pytorch/pull/106673#issuecomment-1669939815 , I add more documentation to explain the output of dynamo compilation. I didn't find any de-compile library, so I manually de-compile the bytecode. The result looks good. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106860 Approved by: https://github.com/jansel, https://github.com/msaroufim	2023-08-14 19:49:24 +00:00
David Berard	770a565e26	[dynamo][easy] Only xfail test_dynamic_shapes_float_guard_dynamic_shapes if z3 is available (#107137 ) This test only fails when z3 is available. So we should only xfail it if z3 is available, otherwise the test passes with an unexpected success. Differential Revision: [D48323103](https://our.internmc.facebook.com/intern/diff/D48323103) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107137 Approved by: https://github.com/ysiraichi, https://github.com/williamwen42	2023-08-14 19:47:21 +00:00
Kurt Mohler	6af6b8f728	Reland: Remove `set_default_dtype` from nn tests (#107069 ) Part of #68972 Relands #105775 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107069 Approved by: https://github.com/ezyang	2023-08-14 17:01:57 +00:00
atalman	32f93b1c68	[Security] Use github environment for update-commit-hash workflow (#107060 ) Similar to: https://github.com/pytorch/pytorch/pull/101718 https://github.com/pytorch/pytorch/actions/runs/5856611801/job/15876722301 Please note since we can't specify environment for a composite workflow. It was needed to move update-commit-hash as action rather then workflow. Still todo: Move docs and binary builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/107060 Approved by: https://github.com/seemethere	2023-08-14 16:55:37 +00:00
cyy	5bbfb96203	[Reland] Upgrade NVTX to NVTX3 (#97582 ) PR #90689 replaces NVTX with NVTX3. However, the torch::nvtoolsext is created only when the third party NVTX is used. This is clear a logical error. We now move the creation code out of the branch to cover all cases. This should fix the issues reported in the comments of #90689. It would be better to move configurations of the failed FRL jobs to CI tests so that we can find such issues early before merging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97582 Approved by: https://github.com/peterbell10	2023-08-14 16:55:25 +00:00
wchen61	461c703ee6	Add typecasting for gelu backward kernel (#86673 ) (#106856 ) Fixes #86673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106856 Approved by: https://github.com/janeyx99	2023-08-14 15:02:25 +00:00
Richard Zou	2932b0bf37	Extend impl_backward to be usable with torch.library operators (#106817 ) - impl_save_for_backward/impl_backward only work for functional, non-view schemas. We validate this. - impl_save_for_backward/impl_backward raise if there already exists an autograd implementation from torch.library / TORCH_LIBRARY. - Operators constructed via custom_op receive an "autograd indirection kernel". The "autograd indirection kernel" automatically pulls the constructed autograd kernel out of a dict. When impl_save_for_backward/impl_backward get used with torch.library operators, we also register the "autograd indirection kernel" so we can reuse the logic. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/106817 Approved by: https://github.com/soulitzer ghstack dependencies: #106799, #106800	2023-08-14 14:33:46 +00:00
Richard Zou	db9a0cf689	Extend impl_backward to handle non-Tensor outputs (#106800 ) Recall that the user must give us a backward function that accepts `(ctx, saved, *grads)`, with one grad per output. Previously, impl_backward only worked for functions that return one or more Tensors. The new semantics are that if the output has: - a TensorList, the backward function provided by the user will receive a List[Tensor] of grads for that output. - a number, the backward function provided by the user will receive None as the grad. Also recall that impl_backward is implemented by registering an autograd.Function to the autograd dispatch key. We needed to make the following changes: - If an output is a TensorList, autograd.Function will ignore it. So we need to tree-flatten it before returning it from the autograd.Function - This means that the autograd.Function receives a flat list of grad during the backwards pass. We need to tree-unflatten it into the correct shape before passing it to the user-defined backward - We modify the logic of output_differentiability. Only Tensor/TensorList outputs can be marked as differentiable. If a TensorList is marked as non-differentiable, then this is equivalent to all Tensors in the list being non-differentiable. There is no finer-grain control over this (to match derivatives.yaml). Test Plan: - There are new `numpy_split_copy` (returns TensorList) and `numpy_split_copy_with_int` (returns (TensorList, int)) operators in custom_op_db - Added tests for output_differentiability into test/test_custom_ops.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/106800 Approved by: https://github.com/soulitzer ghstack dependencies: #106799	2023-08-14 14:33:46 +00:00
Richard Zou	9fcce1baf1	[custom_op] Allow constructor to infer more types (#106799 ) This expands the torch._custom_ops.custom_op API to be able to construct operators that return (int, bool, float, Scalar, List[Tensor]) to make it more in-line with our torch.library API. NB: there needs to be updates to our custom_op autograd registration API. For ease of review those changes will go in the next PR up but I can squash if requested. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/106799 Approved by: https://github.com/soulitzer	2023-08-14 14:33:43 +00:00
Yukio Siraichi	d8ad74857c	Run translation validation on tracing error. (#106645 ) This PR wraps `InstructionTranslator` run with a try-catch block so as to run the translation validation (TV) if it ends up raising an error. In this context, we run TV so as to catch simplification errors. These may turn `ShapeEnv.divisible` and `ShapeEnv.replacements` incorrect. For example: #101173 describes a SymPy simplification bug that doesn't reach TV, since it's run only in the end of the tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106645 Approved by: https://github.com/ezyang	2023-08-14 13:43:34 +00:00
PyTorch UpdateBot	937cd3742b	[xla hash update] update the pinned xla hash (#107120 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107120 Approved by: https://github.com/pytorchbot	2023-08-14 10:54:47 +00:00
FFFrog	2b1058c542	Enable mypy check in torch/_inductor/wrapper_benchmark.py (#106775 ) Fixes #105230 ```shell $ lintrunner init && lintrunner -a torch/_inductor/wrapper_benchmark.py ... ok No lint issues. Successfully applied all patches. ``` ```shell $ mypy torch/_inductor/wrapper_benchmark.py Success: no issues found in 1 source file ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106775 Approved by: https://github.com/eellison	2023-08-14 04:32:08 +00:00
Zhuoran Zhao	d392963ac4	[fbcode] Use FastCat in PT Concat implementation (#106727 ) Summary: Reimplement D48081898 and PR https://github.com/pytorch/pytorch/pull/106518 in fbcode first to accelerate the launching process Test Plan: All checks have been passed: https://github.com/pytorch/pytorch/actions/runs/5758987335/job/15612600466?pr=106518 (For my own learning purpose) Check out OSS PyTorch repo and test following the instructions in https://www.internalfb.com/intern/wiki/PyTorch/PyTorchDev/Workflow/PyTorch_environment_setup/ and https://www.internalfb.com/intern/wiki/PyTorch/PyTorchDev/Workflow/PyTorch_environment_setup/oss_setup_on_devserver : ``` pytest -k test_cat_out test/test_tensor_creation_ops.py -v -s ``` To submit to GitHub ``` hg amend; jf submit; ghexport ``` Differential Revision: D48082741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106727 Approved by: https://github.com/ezyang, https://github.com/houseroad	2023-08-13 22:36:51 +00:00
Nikita Karetnikov	e7a3fb13e7	[pt2] add Python metas for `special` ops (#106683 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106683 Approved by: https://github.com/ezyang	2023-08-13 14:12:21 +00:00
Lujia Zhang	b897c57d47	[TGIF][Inplace][Perf] Copy tensor to device with pinned memory & move copy weight sleep to getRecord (#106849 ) Summary: There are 2 changes in the diff that helps optimize perf during inplace update: 1. Read data with pinned memory 2. move the copy weight sleep from between copying the whole Tensor to between copying chunks Test Plan: Local Test ``` ./ai_infra/inference_platform/test_platform/script/run_sigrid_4card.sh --port 7451 --local_model_dir /home/lujia/script --cuda_devices 6 --bind_node 3 --model_id 962549778_514 --gflag_config_path sigrid/predictor/predictor_x_gflags_mrs_prospector_gpu_torchscript_fusedsolution_1card_opt_fm -- --enable_thrift_warmup=false --tgif_replicate_merge_by_tempfile=false --enable_inplace_snapshot_transition --model_version_config_path sigrid/predictor/models_version/lujia_test --inplace_update_max_retries 0 --submod_to_device="merge\|cuda0" ``` Load test on job tsp_eag/smart/inference_platform_sp__sigrid_predictor_gpu_adhoc_realtimetest_m962549778_latest.s3 Before: (p99 latency) {F1066957232} (SR error rate) {F1066957650} After: (p99 latency) {F1066957141} (SR error rate) {F1066957376} Differential Revision: D48182533 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106849 Approved by: https://github.com/842974287, https://github.com/kit1980	2023-08-13 07:37:46 +00:00
Mengwei Liu	ddd2f682b9	[executorch] Let custom ops registration code only import ATen headers (#107064 ) Summary: Basically we generate `CustomOpsNativeFunctions.h` for registering custom ops into PyTorch JIT runtime. This header needs to hookup with the C++ kernel implementation of all the custom ops. For this reason it should include ATen headers instead of Executorch headers. This PR changes it. Test Plan: Rely on existing CI jobs Differential Revision: D48282828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107064 Approved by: https://github.com/kirklandsign	2023-08-13 00:34:34 +00:00
gmagogsfm	f26aa2dcd9	Keep fx node name consistent with aot_export (#107068 ) torch.export() starts initially with node names in aot_export, if we don't make this change, any no-op transformation would break name consistency, thus breaking GraphSignature correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107068 Approved by: https://github.com/tugsbayasgalan	2023-08-12 23:12:03 +00:00
Peter Bell	8472c24e3b	[inductor] Optimize away zero-element loads (#107074 ) Fixes #107066, closes #107008 This replaces loads to zero-element `Loops` or `Buffer`s with `ops.constant` calls. This both avoids the issue of masked loads under triton, and also means the buffer is not listed as a dependency for downstream users which may improve performance generally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107074 Approved by: https://github.com/davidberard98	2023-08-12 07:58:14 +00:00
Jithun Nair	aa36e16f95	Add gfx90a target for ROCm CI (#106879 ) ...in preparation for upgrading CI runners to MI2xx Pull Request resolved: https://github.com/pytorch/pytorch/pull/106879 Approved by: https://github.com/seemethere	2023-08-12 07:23:20 +00:00
Shunting Zhang	6f83382161	[inductor][easy] add a missing parenthesis (#107001 ) If I understand the code correctly, we want to add a fusion choice if - node2 is template or foreach and - can_fuse return true for (node2, node1) But the code misses a pair of parenthesis since in python 'and' has higher precedence than 'or'. This does not cause much damage since even if we add a pair of nodes that can not be fused, we will skip them later when we call can_fuse again (in fuse_nodes_once). Fixing this mainly to avoid confusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107001 Approved by: https://github.com/jansel, https://github.com/mlazos	2023-08-12 06:26:06 +00:00
Edward Z. Yang	5b04e9b6ce	Install torchrec/fbgemm from source in CI (#106808 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106808 Approved by: https://github.com/malfet, https://github.com/xuzhao9	2023-08-12 02:08:44 +00:00
PyTorch MergeBot	9858edd99f	Revert "Reordering tests experiment (#106347 )" This reverts commit 7dfab082be9eaeeee95c7b0363e59c824c6a9009. Reverted https://github.com/pytorch/pytorch/pull/106347 on behalf of https://github.com/clee2000 due to probably broke sharding ([comment](https://github.com/pytorch/pytorch/pull/106347#issuecomment-1675542738))	2023-08-11 23:59:48 +00:00
Wanchao Liang	c9cbcb2449	[device_mesh] move remaining collectives to a separate file (#107012 ) Move the remaining collectives to a separate file to prepare device mesh to become a public distributed API For those remaining utils, we need to upstream them to functional collectives with proper implementation, added TODO there for a follow up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107012 Approved by: https://github.com/fduwjj	2023-08-11 23:49:27 +00:00
BowenBao	22095acfd7	[ONNX] Migrate to PT2 logging (#106592 ) Summary - The 'dynamo_export' diagnostics leverages the PT2 artifact logger to handle the verbosity level of logs that are recorded in each SARIF log diagnostic. In addition to SARIF log, terminal logging is by default disabled. Terminal logging can be activated by setting the environment variable `TORCH_LOGS="onnx_diagnostics"`. When the environment variable is set, it also fixes logging level to `logging.DEBUG`, overriding the verbosity level specified in the diagnostic options. See `torch/_logging/__init__.py` for more on PT2 logging. - Replaces 'with_additional_message' with 'Logger.log' like apis. - Introduce 'LazyString', adopted from 'torch._dynamo.utils', to skip evaluation if the message will not be logged into diagnostic. - Introduce 'log_source_exception' for easier exception logging. - Introduce 'log_section' for easier markdown title logging. - Updated all existing code to use new api. - Removed 'arg_format_too_verbose' diagnostic. - Rename legacy diagnostic classes for TorchScript Onnx Exporter to avoid confusion. Follow ups - The 'dynamo_export' diagnostic now will not capture python stack information at point of diagnostic creation. This will be added back in follow up PRs for debug level logging. - There is type mismatch due to subclassing 'Diagnostic' and 'DiagnosticContext' for 'dynamo_export' to incorporate with PT2 logging. Follow up PR will attempt to fix it. - More docstrings with examples. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106592 Approved by: https://github.com/titaiwangms	2023-08-11 23:27:00 +00:00
Han Qi	5d09e49947	Make the __call__ op of ExportedProgram follow calling convention. (#106186 ) Convention documented here: `01069ad4be/torch/_functorch/aot_autograd.py (L1034)` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/106186 Approved by: https://github.com/zhxchen17	2023-08-11 23:12:37 +00:00
Michael Voznesensky	42660015b4	[Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly (#106886 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106886 Approved by: https://github.com/awgu, https://github.com/wconstab ghstack dependencies: #106884	2023-08-11 22:35:50 +00:00
Shunting Zhang	91778ada87	[inductor] graph replayer (#106952 ) Recently I feel it's a bit painful to run benchmark scripts on my dev environment. E.g., the command below ``` python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only YituTechConvBert --training ``` took about 2 minutes to run. It may take even longer for some other models. The command is slow since it - need do dynamo work - verify the model on CPU - run perf tests - compile all the graphs However, often times I only need to debug inductor specific logic like loop ordering and fusion. A lot of the things the script is done are useless for me. Also I only need test one graph at a time (e.g. check fwd graph first and when I'm done, continue to check bwd graph) rather than compiling all the graphs. The graph replayer add a `@save_args` decorator to compile_fx_inner function. When `config.save_args` is true, it will pickle all the arguments to `comple_fx_inner` to the file system. Later on, we can call `load_args_and_run_compile_fx_inner("/tmp/inductor_saved_args/compile_fx_inner_0.pkl")` to replay the graph and compile it with inductor. Replaying the fwd graph took around 60 seconds (maybe this can be further reduced but this is already 2x speedup for dev efficiency) , and it only took around 20 seconds to reach `Scheduler.__init__` method. I also checked `TORCH_COMPILE_DEBUG` flag that already exists. The most similar part of `TORCH_COMPILE_DEBUG` is it can save a graph and it's arguments and later on rerun it. But the difference here is, rather than run the model, we want to call inductor API to compile the model (without even going thru dynamo or aot-autograd). Pull Request resolved: https://github.com/pytorch/pytorch/pull/106952 Approved by: https://github.com/jansel ghstack dependencies: #106990	2023-08-11 22:28:20 +00:00
Shunting Zhang	6730d5f9a0	[inductur][easy] show kernel name in str(ExternKernel) (#106990 ) The string representation of an ExternKernel does not show kernel name. Since kernel name is such an important information for an ExternKernel, this PR adds that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106990 Approved by: https://github.com/eellison	2023-08-11 22:27:28 +00:00
Michael Gschwind	2c8f24829f	Decomposition of bmm and mm for dot product (#106593 ) Summary: Decomposition of bmm and mm for dot product Test Plan: sandcastle, and Bert Differential Revision: D48055141 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106593 Approved by: https://github.com/jansel	2023-08-11 22:23:04 +00:00
PyTorch MergeBot	ec0f3fda7d	Revert "Remove `set_default_dtype` from nn tests (#105775 )" This reverts commit 4d6a891baf2224cfa81bfe7632cf08be50812216. Reverted https://github.com/pytorch/pytorch/pull/105775 on behalf of https://github.com/huydhn due to Sorry for reverting you change, it is failing one of the slow test in trunk ([comment](https://github.com/pytorch/pytorch/pull/105775#issuecomment-1675460195))	2023-08-11 22:14:17 +00:00
Sam Larsen	3d00170b20	[inductor] fix test_dim_function_empty (#106994 ) Summary: Looks like the assert syntax was just wrong Test Plan: PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_torch.py -k test_dim_function_empty PYTORCH_TEST_WITH_AOT_EAGER=1 python test/test_torch.py -k test_dim_function_empty Pull Request resolved: https://github.com/pytorch/pytorch/pull/106994 Approved by: https://github.com/eellison	2023-08-11 21:38:53 +00:00
Zhengxu Chen	547ccae0db	[export] Support preserving calling convention to some modules. (#106798 ) Summary: APS use this feature to swap out some submodules after unflattening. Test Plan: test_export_preserve_signature Differential Revision: D48154341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106798 Approved by: https://github.com/tugsbayasgalan	2023-08-11 21:17:45 +00:00
PyTorch MergeBot	354484ea6d	Revert "Add `_foreach_clamp` (#106574 )" This reverts commit 2b560d3c3a9b34cd11fc9ff9e3a0be6a81d47968. Reverted https://github.com/pytorch/pytorch/pull/106574 on behalf of https://github.com/kit1980 due to breaking internal windows builds ([comment](https://github.com/pytorch/pytorch/pull/106574#issuecomment-1675400335))	2023-08-11 21:05:04 +00:00
David Watson	c9cdcb299a	Remove ExclusivelyOwned from register_dispatch_key (#106791 ) This fixes a bug that could occur with python decompositions. When an operation is intercepted in the c++ code in pytorch the outputs a created as `ExclusivelyOwned<at::Tensor>`s. Later on when it dispatches back to python for the decomposition these tensors have their ownership shared with python. In a normal use case the exclusively owned tensor is released and it's value returned as a non-exclusively owned tensor from the operation. However if the python decomposition throws an error the `ExclusivelyOwned` wrapper destroys the `at::Tensor` leading to a python reference to a tensor which isn't alive (and meaning pytorch falls over in debug mode). Note this will be a performance hit when handling errors. Fixes #106790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106791 Approved by: https://github.com/ezyang	2023-08-11 21:04:33 +00:00
Edward Z. Yang	d97b18d769	Propose nkaretnikov as general PrimTorch/meta/decomp reviewer (#106970 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106970 Approved by: https://github.com/albanD	2023-08-11 20:50:31 +00:00
Yanbo Liang	fbfb9a1648	[Dynamo] Improve PT2 fbcode logging observability (#106932 ) Summary: https://docs.google.com/document/d/1D5K3_ELsda3tIUeSyNL_2yee-M3jVWbirqSQ5BDNvHQ/edit This is the revamped version of D47908299. For each frame, we will record a list of compilation metrics: e.g, backend_compile time, entire_frame_compile time, cache_size, co_filename, co_firstlineno, co_name, guards, graph input_count, graph node_count, graph op_count. With the help of job info: mast_job_name, global_rank, we can satisfy the requirements from `Things I’ve used/wanted to use our logging to determine` in https://docs.google.com/document/d/1D5K3_ELsda3tIUeSyNL_2yee-M3jVWbirqSQ5BDNvHQ/edit (or add more metrics for this framework) Test Plan: ``` buck2 test //caffe2/test:test_dynamo ``` Differential Revision: D48142400 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106932 Approved by: https://github.com/anijain2305	2023-08-11 20:46:04 +00:00
Catherine Lee	1cfe292061	Mark test_lstm_packed as slow (#107048 ) The test takes >30 minutes to run on some configurations and keeps getting unmarked as slow by the automatic slow test detection. Examples: https://ossci-raw-job-status.s3.amazonaws.com/log/15824750763 https://ossci-raw-job-status.s3.amazonaws.com/log/15802766247 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107048 Approved by: https://github.com/huydhn	2023-08-11 20:35:14 +00:00
Nikita Shulga	22a20d0850	Add `isFloat8Type` predicate (#106977 ) And make float8 dtypes part of `isReducedFloatingType()` predicate Pull Request resolved: https://github.com/pytorch/pytorch/pull/106977 Approved by: https://github.com/albanD	2023-08-11 20:13:57 +00:00
Wanchao Liang	5c48ff20b5	AsyncCollectiveTensor: dont sync on view ops (#105240 ) AsyncCollectiveTensor is a tensor subclass that is meant to "delay synchronization" when you call into the functional collectives API's. It does this (if I understand correctly) by internally holding an "unsynchronized" version of the tensor, which is the result of the communication op, and internally calling `.wait()` to synchronize the data the next time it is used. Previously, these wait() calls would happen immediately, because `AsyncCollectiveTensor` gets wrapped by `DTensor()`, which calls `.detach()` on its inner tensor, immediately causing the sync (code: `1518d5eec4/torch/distributed/_tensor/api.py (L207)`) AsyncCollectiveTensor shouldn't need to do a synchronization if you try to detach() it though - in fact, it should be fine to avoid synchronizing if you perform any view ops on it (which just require viewing metadata, but not actual data). This PR tries to update `AsyncCollectiveTensor` to delay `wait()` calls whenever the subclass encounters a view op. Added some light testing, that just runs some DTensor compute followed by view ops, and confirms that the output is still an `AsyncCollectiveTensor` when we call `.to_local()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105240 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/wconstab	2023-08-11 19:20:25 +00:00
Sam Larsen	e165938853	Implement decomposition for aten.rrelu_with_noise (#106812 ) Test Plan: * Primarily, added new test in test/test_decomp.py * Updated existing tests, e.g., to NOT expect failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/106812 Approved by: https://github.com/eellison	2023-08-11 19:18:29 +00:00
Adnan Akhundov	b1b3f61f2c	Skip Triton templates in MM max autotune with zero-size inputs (#106865 ) Summary: MM max autotune (and friends) crash when one of the inputs is zero-size. E.g., running this code: ``` @torch.compile() def fn(x, y): return torch.mm(x, y) inps = [torch.rand([0, 30]), torch.rand([30, 40])] inps = [x.to(device="cuda") for x in inps] out = fn(inps) ``` with this command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 python test.py ``` raises this error (the top of the stack trace omitted for brevity): ``` ... File "/data/users/aakhundov/pytorch/torch/_inductor/kernel/mm.py", line 119, in tuned_mm return autotune_select_algorithm("mm", choices, [mat1, mat2], layout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/aakhundov/pytorch/torch/_inductor/select_algorithm.py", line 960, in autotune_select_algorithm return _ALGORITHM_SELECTOR_CACHE(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/aakhundov/pytorch/torch/_inductor/select_algorithm.py", line 787, in __call__ timings = self.lookup( ^^^^^^^^^^^^ File "/data/users/aakhundov/pytorch/torch/_inductor/codecache.py", line 267, in lookup timings[choice] = benchmark(choice) ^^^^^^^^^^^^^^^^^ File "/data/users/aakhundov/pytorch/torch/_inductor/select_algorithm.py", line 774, in autotune raise ErrorFromChoice(msg, choice, benchmark_fn.debug_str()) torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: LoweringException: ErrorFromChoice: Please run `ptxas /tmp/compile-ptx-src-bfb1c6` to confirm that this is a bug in `ptxas` From choice TritonTemplateCaller(/tmp/torchinductor_aakhundov/z7/cz7n7nn6rdlaelu4pbaaurgmu74ikl6g76lkngwawrevlfxlc6re.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=False, GROUP_M=8, num_stages=2, num_warps=4) inputs = [ torch.empty_strided((0, 30), (30, 1), dtype=torch.float32, device='cuda'), torch.empty_strided((30, 40), (40, 1), dtype=torch.float32, device='cuda'), ] out = torch.empty_strided((0, 40), (40, 1), dtype=torch.float32, device='cuda') target: aten.mm.default args[0]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda', torch.float32, size=[0, s0], stride=[s0, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg3_1', layout=FixedLayout('cuda', torch.float32, size=[s0, s1], stride=[s1, 1])) )) ``` This PR adds a check to skip Triton templates in the `mm`, `addmm`, `mm_plus_mm` autotuning when the product of the MM problem shape (`m n * k`) is zero. Additionally, early exit conditions have been added to the mm and mm_plus_mm Triton templates on `M * N * K == 0`, to prevent issues when autotuning was done on non-zero-size inputs with dynamic shapes, then zero-size inputs are encountered by the compiled model. Test Plan: ``` $ python test/inductor/test_max_autotune.py -v ... ---------------------------------------------------------------------- Ran 16 tests in 29.569s OK ``` Reviewers: @eellison Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/106865 Approved by: https://github.com/jansel	2023-08-11 19:10:16 +00:00
Howard Huang	656412f0cb	Add multiprocess option to dynamo benchmarks (#106394 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106394 Approved by: https://github.com/XilunWu	2023-08-11 18:34:09 +00:00
ydwu4	3fe1dba068	Fix test_cond_functionalized_aot_func_check_functional (#106889 ) Fix a typo in unit test. Test Plan: Existing tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106889 Approved by: https://github.com/tugsbayasgalan	2023-08-11 18:31:12 +00:00
Philip Meier	a926be39d4	torch.jit.script escape hatch (#106229 ) Although the sun is setting for torchscript, it is not [officially deprecated](https://github.com/pytorch/pytorch/issues/103841#issuecomment-1605017153) since nothing currently fully replaces it. Thus, "downstream" libraries like TorchVision, that started offering torchscript support still need to support it for BC. torchscript has forced us to use workaround after workaround since forever. Although this makes the code harder to read and maintain, we made our peace with it. However, we are currently looking into more elaborate API designs that are severely hampered by our torchscript BC guarantees. Although likely not intended as such, while looking for ways to enable our design while keeping a subset of it scriptable, we found the undocumented `__prepare_scriptable__` escape hatch: `0cf918947d/torch/jit/_script.py (L977)` One can define this method and if you call `torch.jit.script` on the object, the returned object of the method will be scripted rather than the original object. In TorchVision we are using exactly [this mechanism to enable BC](`3966f9558b/torchvision/transforms/v2/_transform.py (L122-L136)`) while allowing the object in eager mode to be a lot more flexible (`args, *kwargs`, dynamic dispatch, ...). Unfortunately, this escape hatch is only available for `nn.Module`'s `0cf918947d/torch/jit/_script.py (L1279-L1283)` This was fine for the example above since we were subclassing from `nn.Module` anyway. However, we recently also hit a case [where this wasn't the case](https://github.com/pytorch/vision/pull/7747#issuecomment-1642045479). Given the frozen state on JIT, would it be possible to give us a general escape hatch so that we can move forward with the design unconstrained while still keeping BC? This PR implements just this by re-using the `__prepare_scriptable__` hook. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106229 Approved by: https://github.com/lezcano, https://github.com/ezyang	2023-08-11 18:24:46 +00:00
PyTorch MergeBot	71be8f2223	Revert "Add initial support for FP8 ONNX export (#106379 )" This reverts commit 08704f96f08da5a52f65a7c3001d6ce4aae0102e. Reverted https://github.com/pytorch/pytorch/pull/106379 on behalf of https://github.com/kit1980 due to breaking multiple internal builds ([comment](https://github.com/pytorch/pytorch/pull/106379#issuecomment-1675192700))	2023-08-11 18:22:35 +00:00
Shunting Zhang	e18ca4028b	[indcutor] add one triton config for reduction (#106925 ) This config found by coordinate descent tuning improves kernel https://gist.github.com/shunting314/189a8ef69f90db9d614a823385147a72 from - 10.008ms 5.993GB 598.83GB/s to - 6.170ms 5.993GB 971.28GB/s . It should only affect reduction with hint ReductionHint.DEFAULT or when max autotune is enabled. (It's funny that before I upgrade my triton version, the improvement is from 9.076ms -> 5.692ms ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106925 Approved by: https://github.com/jansel	2023-08-11 17:15:03 +00:00
Shunting Zhang	6696a75ea8	[inductor] make thread order consistent with loop order (#106827 ) I found that for a tiled kernel for tensor with shape [a, b], we map 'a' with XBLOCK and 'b' with YBLOCK. However, 'a' actually should be the outer looper while 'b' corresponding to the inner loop. This order is picked by our loop ordering algorithm. Mapping 'a' with XBLOCK has the semantic like assigning 'a' to the inner loop instead. For a simple 'A + B.t()' kernel, making the loop order consistent can brings 1.027x speedup ( 1.938ms -> 1.887ms speedup) . Here are the dump of kernels: - before fix: https://gist.github.com/shunting314/4dacf73cf495cdd7e84dede7c3e0872d - after fix (this one is done manually): https://gist.github.com/shunting314/441e8839d24e1878c313e539b1ebd551 I tried this on DistillGPT2 and found perf is neutral. But that because DistillGPT2 has a single tiled pointwise kernel in it's backward graph. Will check the dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106827 Approved by: https://github.com/jansel	2023-08-11 17:05:21 +00:00
PyTorch MergeBot	745d29b0cc	Revert "[export] Refactor `constrain_as_value` and `constrain_as_size` (#106591 )" This reverts commit 18989890bfc4d74dbf4a175d425b5b291e09cb8b. Reverted https://github.com/pytorch/pytorch/pull/106591 on behalf of https://github.com/izaitsevfb due to Breaks inductor test on trunk ([comment](https://github.com/pytorch/pytorch/pull/106591#issuecomment-1675069091))	2023-08-11 16:37:47 +00:00
Thiago Crepaldi	0b05aef8d0	Add ONNX export support for huggingface's bigscience/bloom-560m model (#106930 ) Port fix from https://github.com/huggingface/safetensors/pull/318 into ONNX exporter until it is merged * This add support for safetensors to be loaded within a FakeTensorMode, which results in creating `torch.empty((shape,), dtype=)`. This is done through a monkeypatch for the in-progress https://github.com/huggingface/safetensors/pull/318 * Adds a test for the HF bloom model (bigscience/bloom-560m) * This PR also fixes existing fake tensor unit tests by moving the `torch.onnx.dynamo_export` to be inside the `enable_fake_mode()` context. Although calling `torch.onnx._dynamo_export` works for several models, the right way of using fake mode is calling the exporter within the context manager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106930 Approved by: https://github.com/BowenBao	2023-08-11 16:34:24 +00:00
Edward Z. Yang	9f26503bf0	SymInt'ify tile (#106933 ) When auditing before I was deceived by the argument name "dims". Actually, this is saying how many times to replicate each dim, so definitely can be symbolic. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106933 Approved by: https://github.com/nkaretnikov, https://github.com/lezcano	2023-08-11 15:28:28 +00:00
Yukio Siraichi	a5d841ef01	`asarray`: take the default device into consideration. (#106779 ) Fix: #106773 This PR makes it so `asarray` takes the default device into consideration when called with a Python sequence as the data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106779 Approved by: https://github.com/rgommers, https://github.com/lezcano	2023-08-11 13:16:42 +00:00
Kurt Mohler	171341ee65	Support complex inputs in `nan_to_num` (#106944 ) Fixes #105462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106944 Approved by: https://github.com/lezcano	2023-08-11 09:15:57 +00:00
summerdo	7db6eb7156	[test_nn] add custom device support for dropout tests、lazy_modules te… (#106609 ) add custom device support for dropout tests、lazy_modules tests and multihead_attention tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106609 Approved by: https://github.com/mikaylagawarecki	2023-08-11 09:14:34 +00:00
HDCharles	03414081ff	adding mixed_dtype_mm to torch._inductor (#106443 ) Summary: if torch._inductor.config.use_mixed_mm then we can convert torch.mm(a, b.to(some_dtype)) into a triton kernel where the casting b is fused into the matmul rather than needing to instantiate the casted b tensor. If use_mixed_mm is set, this fused kernel will be autotuned against the default 2 kernel fallback option. If force_mixed_mm then the fused kernel will always be used, This option is needed for weight-only quantization where we are in some cases relying on the superior memory characteristics of the fused kernel rather than the perf numbers (when we can't afford to load memory with a tensor 4x the size of our quantized one). Test Plan: python test/inductor/test_pattern_matcher.py -k "mixed_mm" python test/inductor/test_torchinductor.py -k "mixed_mm" Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/106443 Approved by: https://github.com/jansel	2023-08-11 05:34:54 +00:00
Tugsbayasgalan Manlaibaatar	18989890bf	[export] Refactor `constrain_as_value` and `constrain_as_size` (#106591 ) Some notable changes: 1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2. 2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591 Approved by: https://github.com/gmagogsfm, https://github.com/ezyang	2023-08-11 05:29:22 +00:00
XiaobingSuper	df6aaf7bc2	inductor: fix compile error for inplace variable multi-defined (#106852 ) When removing an inplace buffer, we just mark it as ```REMOVED```, after removing some inplace buffer, and then if we mark a buffer as inplace buffer using the ```self.inplace_buffer.values()``` length to create a buffer name, there may have an issue which we may define a same inplace buffer name with existed in ```self.inplace_buffer.values()```: before removing some inplace buffers, the ```self.inplace_buffers``` may be like: ``` {'buf0': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf2': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf4': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf5': InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf5', 'buf7', 'buf9']), 'buf7': InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf5', 'buf7', 'buf9']), 'buf9': InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf5', 'buf7', 'buf9']), 'buf12': InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf12', 'buf13']), 'buf13': InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf12', 'buf13']), 'buf17': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf19': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf21': InplacedBuffer(inner_name='in_out_ptr4', other_names=['buf21', 'buf25']), 'buf25': InplacedBuffer(inner_name='in_out_ptr4', other_names=['buf21', 'buf25']), 'buf20': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32']), 'buf26': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32']), 'buf31': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32']), 'buf32': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32'])} ``` After removing some inplace buffers, the ```self.inplace_buffers``` may be like: ``` {'buf0': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf2': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf4': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf5': 'REMOVED', 'buf7': 'REMOVED', 'buf9': 'REMOVED', 'buf12': 'REMOVED', 'buf13': 'REMOVED', 'buf17': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf19': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf21': 'REMOVED', 'buf25': 'REMOVED', 'buf20': 'REMOVED', 'buf26': 'REMOVED', 'buf31': 'REMOVED', 'buf32': 'REMOVED', 'buf16': InplacedBuffer(inner_name='in_out_ptr6', other_names=['buf16', 'buf38']), 'buf38': InplacedBuffer(inner_name='in_out_ptr6', other_names=['buf16', 'buf38'])} ``` And then if we mark some buffer as inplace buffer and the buffer name will use ```in_out_ptr{len(unique(self.inplace_buffers.values()))}```, the buffer name may be ```in_out_ptr6``` even this name has existed in ```self.inplace_buffers```. After this PR, we will change ```REMOVED``` to ```REMOVED{1, 2, 3..}``` which avoids defining a duplicate name. ```pyhpc_equation_of_state ``` of ```torchbench``` will work for CPU backend: ```python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/torchbench.py --performance --inference --float32 -dcpu -n50 --inductor --freezing --no-skip --dashboard --only pyhpc_equation_of_state --cold_start_latency``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106852 Approved by: https://github.com/lezcano	2023-08-11 04:06:58 +00:00
Driss Guessous	7460adf7f3	Causing internal clang tidy to error (#106895 ) Summary: This was causing an error with clang tidy for internal version of PyTorch: https://www.internalfb.com/diff/D47044755?dst_version_fbid=1190859734932683&transaction_fbid=1553883761684581 Test Plan: See Summary Reviewed By: dmpolukhin Differential Revision: D48202402 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106895 Approved by: https://github.com/malfet, https://github.com/kit1980	2023-08-11 03:54:05 +00:00
Michael Voznesensky	71a336ef75	[Dynamo x FSDP][1/x] Builder support for deque, appendleft (#106884 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106884 Approved by: https://github.com/ezyang	2023-08-11 03:26:12 +00:00
XiaobingSuper	4df84c3b4d	make sure mkldnn convolution given same stride as ref path for nc11 contiguous input (#106966 ) On SPR machine, the mkldnn bfloat16 convolution always return a channels last output, and we will convert it to channels first if input and weight are channels first, there has an issue when we do such conversion if output is nc11(45121*1), we always mark it as public format ideep tensor, and even we calling ```to_dense``` before returning the output, the output's stride is still a channels last stride(512, 1, 512, 512), this PR will calling ```resize_``` to make sure the stride is contiguous stride. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106966 Approved by: https://github.com/mingfeima	2023-08-11 00:59:48 +00:00
lezcano	a9dca53438	NumPy support in torch.compile (#106211 ) RFC: https://github.com/pytorch/rfcs/pull/54 First commit is the contents of https://github.com/Quansight-Labs/numpy_pytorch_interop/ We have already been using this in core for the last few months as a external dependency. This PR pulls all these into core. In the next commits, I do a number of things in this order - Fix a few small issues - Make the tests that this PR adds pass - Bend backwards until lintrunner passes - Remove the optional dependency on `torch_np` and simply rely on the upstreamed code - Fix a number dynamo tests that were passing before (they were not tasting anything I think) and are not passing now. Missing from this PR (but not blocking): - Have a flag that deactivates tracing NumPy functions and simply breaks. There used to be one but after the merge stopped working and I removed it. @lezcano to investigate. - https://github.com/pytorch/pytorch/pull/106431#issuecomment-1667079543. @voznesenskym to submit a fix after we merge. All the tests in `tests/torch_np` take about 75s to run. This was a work by @ev-br, @rgommers @honno and I. I did not create this PR via ghstack (which would have been convenient) as this is a collaboration, and ghstack doesn't allow for shared contributions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106211 Approved by: https://github.com/ezyang	2023-08-11 00:39:32 +00:00
Sam Larsen	8f774330af	[inductor] Use shape env bounds in inductor bounds.py (#106175 ) (#106568 ) Summary: If constrained ranges are available, use them in bounds.py before value range analysis (to enable Int64 -> Int32 optimization). Test Plan: New unit test in test_torchinductor.py to mark a tensor as dynamic, then constrain with constrain_as_size (as outlined in https://github.com/pytorch/pytorch/issues/106175) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106568 Approved by: https://github.com/eellison, https://github.com/lezcano	2023-08-11 00:16:09 +00:00
Stephen Jia	62b3018024	[Vulkan] Introduce GPU Memory Layout qualifier (#106978 ) Summary: Introduce a GPU memory Layout qualifier in `vTensor`, which will allow more efficient memory layouts when storing Tensors on the GPU. The plan is for shaders to use the memory layout qualifier to convert between logical tensor coordinates and physical texel positions. Test Plan: As-is, this diff should be a no-op. Run standard tests to make sure everything works as expected. ``` buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 ``` Reviewed By: kimishpatel Differential Revision: D48129905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106978 Approved by: https://github.com/liuk22	2023-08-10 23:45:54 +00:00
Stephen Jia	8c8477e55a	Add _unsafe_index decomp (#106814 ) Summary: Redirect `aten._unsafe_index` to `aten.index` through a decomposition. Also add it to the list of core decompositions. Test Plan: contbuild and OSS CI (similar to D40075277) Differential Revision: D48163393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106814 Approved by: https://github.com/SherlockNoMad	2023-08-10 23:23:37 +00:00
Jiaxu Zhu	152203d3c3	[pytorch][ao] Add `torch.matmul` in FloatFunctional/QFunctional (#106831 ) Summary: As title Test Plan: new unit tests Differential Revision: D48172841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106831 Approved by: https://github.com/jerryzh168	2023-08-10 22:43:36 +00:00
Howard Cheng	dfb1b95919	[caffe2] Add enforce inside ScatterAssignOp (#106882 ) Summary: Adding an enforce gives better error information than raising SIGFPE when division by zero happens. We'll get the actual BlobRef names as well as the error categories. Test Plan: Ran a local worker and client using DPP session with empty tensors and checked the error: `../buck-out/v2/gen/fbcode/data_preproc/perf_test/client --sr2_event_base_pool_size=24` `../buck-out/v2/gen/fbcode/data_preproc/perf_test/worker --dpp_session_id=5D49F56C98CC95BD97027BC0DDB38D8F` ```{dpp_internal_errorcategory : user_error, ONCALL : MLDP_CONTROL, CATEGORY : INPUT_ERROR, errorsubsystemtags : [DPP_WORKER], errorcause : USER_ERROR, RETRYABILITY : 0}F0806 17:47:52.607200 2280375 SchedRuntimeEnv.cpp:385] facebook::data_preproc::NonRetryableGenericUser Error: User preprocessing error c10::Error: [enforce fail at utility_ops.h:730] input.numel() > 0. 0 vs 0. tensor has t o be nonempty (Error from operator: input: "preproc_data_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_ features_processor_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCOR ELIST_ENCODED_FB_UIP_TOP100_IDSCORELIST_ENCODED_1/sequential_1019/id_score_list_quantization_decode_1/Concat:0" input: "preproc_data_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_feature s_processor_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCORELIST_E NCODED_FB_UIP_TOP100_IDSCORELIST_ENCODED_1/sequential_1019/id_score_list_quantization_decode_1/Mul_2" input: "preproc_d ata_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_features_processo r_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCORELIST_ENCODED_FB_UIP_TOP100_IDSCORELIST_ENCODED_1/sequential_1019/id_score_list_quantization_decode_1/encoded_id_lengths" output: "preproc_data_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_features_processor_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCORELIST_ENCODED_FB_UIP_TOP100_IDSCORELIST``` Differential Revision: D48104430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106882 Approved by: https://github.com/kit1980	2023-08-10 21:46:13 +00:00
atalman	aef27bdbe7	Reword release version: major->minor in README.md (#106980 ) Correct wording to reflect reality Pull Request resolved: https://github.com/pytorch/pytorch/pull/106980 Approved by: https://github.com/huydhn, https://github.com/albanD	2023-08-10 21:32:30 +00:00
Peter Bell	a62de2d5ec	[inductor] Enable multilayer reductions with dynamic shapes (#106747 ) Currently multilayer reduction (aka split reductions) are only used with static shapes which results in worse performance and accuracy when dynamic shapes are enabled. Instead, this only requires that the shape has a hint value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106747 Approved by: https://github.com/lezcano ghstack dependencies: #106626, #106870	2023-08-10 21:07:25 +00:00
Peter Bell	fa65df3745	[inductor] Type triton size arguments in the kernel index_dtype (#106870 ) `JITFunction._key_of` uses the value of the argument to distinguish between i32 and i64, but this fails if the value is used in indexing calculations where the value exceeds `INT_MAX`. Instead, we should use `index_dtype` which means all indexing calculations are performed in the same dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106870 Approved by: https://github.com/lezcano ghstack dependencies: #106626	2023-08-10 21:07:25 +00:00
Peter Bell	3b2cb459fc	[inductor] Fix reference_as_float gradcheck (#106626 ) When `reference_as_float` is true, reference gradients will not have the same dtype as the actual computed gradients. This fixes the issue by downcasting before doing the comparison. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106626 Approved by: https://github.com/lezcano	2023-08-10 21:07:25 +00:00
Alexander Pivovarov	02abbb8109	Fix some typos, mostly "that that" (#106901 ) Fix some typos Pull Request resolved: https://github.com/pytorch/pytorch/pull/106901 Approved by: https://github.com/janeyx99	2023-08-10 19:46:53 +00:00
Andrew Gu	7b94d93431	[FSDP] Fix train -> EMA -> eval with mixed precision (#106858 ) This fixes a pretty vicious bug relating to `SHARD_GRAD_OP`, mixed precision, EMA, and eval. Bug Explanation The model has a main module and an EMA module, where the main module is used for training and the EMA module is used for eval. The model has FSDP's fp16 mixed precision enabled. The flow consists of (1) training forward/backward/optimizer -> (2) EMA update (copy main module to EMA module) -> eval forward in `torch.no_grad()`, where this repeats for many iterations. Consider the _second_ iteration. - From the first iteration's eval forward, the EMA module has the fp16 unsharded parameters in memory (not freed due to `SHARD_GRAD_OP`). - In this second iteration's step (2), we perform the EMA update under the `summon_full_params()` context, where FSDP specially forces full precision. This means that the EMA module now uses fp32 unsharded parameters, distinct from the fp16 unsharded parameters still in memory. The EMA update modifies those fp32 parameters, and upon exiting the context, FSDP correctly writes the modifications back to the fp32 sharded parameters. - In the second iteration's step (3) (eval forward), FSDP checks whether it needs to run the unshard op (including all-gather) but sees it does not since the fp16 unsharded parameters are still in memory. Thus, FSDP uses those fp16 unsharded parameters directly without all-gather. However, these fp16 unsharded parameters are stale and do not include the EMA update! - In other words, at this point, the fp32 sharded parameters are correct, the fp16 unsharded parameters are stale, and FSDP chooses _not_ to re-all-gather since the fp16 unsharded parameters are in memory. Fix Explanation This PR fixes this by freeing the fp16 unsharded parameters if they are still allocated when forcing full precision, i.e. using fp32 unsharded parameters in `summon_full_params()`. This ensures that any modifications written back to the fp32 sharded parameters will be persisted via the next all-gather. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106858 Approved by: https://github.com/kumpera ghstack dependencies: #106857	2023-08-10 19:32:43 +00:00
Jerry Zhang	79449e6272	[quant][pt2e][fix] Remove the requirement of using no_grad for reference model that contains quantized conv2d (#106924 ) Summary: att we don't actually need gradient for conv2d, just need it to run without error, so we delayed the error of out_dtype gradient to the time when user actually requested it Test Plan: python test/test_quantization.py TestQuantizePT2E.test_representation_conv2d Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/106924 Approved by: https://github.com/zou3519, https://github.com/kimishpatel	2023-08-10 19:16:10 +00:00
alanhe151220037	1afbc985fe	Make RNGStateTracker support cuda-like device (#106771 ) replace `CudaRNGStateTracker` with `RNGStateTracker` by rewriting some Cuda-binding code with `device_handle` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106771 Approved by: https://github.com/wanchaol	2023-08-10 19:14:33 +00:00
Nikita Shulga	bb6b157458	Fix IndexKernel.cu build (#104423 ) Fixes `signed-unsigned comparison` warnings introduced by https://github.com/pytorch/pytorch/pull/106809 (previously by <s> https://github.com/pytorch/pytorch/pull/104054 </s> ) that changed type of `num_indices` to unsigned. Before the change warnings looks as follows: ``` /tmp/tmpxft_00194ca7_00000000-6_IndexKernel.cudafe1.stub.c:31:580: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:58:63: warning: comparison of integer expressions of different signedness: ‘const long unsigned int’ and ‘int’ [-Wsign-compare] 58 \| AT_ASSERT(num_indices == iter.ntensors() - 2); \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:74:19: warning: comparison of integer expressions of different signedness: ‘int’ and ‘const long unsigned int’ [-Wsign-compare] 74 \| for (int i = 0; i < num_indices; i++) { \| ~~^~~~~~~~~~~~~ ``` TODO: Turn those warning into errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/104423 Approved by: https://github.com/Skylion007	2023-08-10 18:32:47 +00:00
David Berard	393e9eed90	[inductor] modify index_reduce to pass opinfo tests (#106429 ) 1. add a python meta registration, to fix an issue with the forward pass. The problem was that previously, the C++ meta registration calls [numel()](`7b14a14e27/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L329)`) which fails (LMK if it's better to fix the C++ implementation to not do this check) 2. Modify the backward to fix an issue in the backward. The backward is not a custom op - it's a custom manual backward implementation. In particular, there's some situations that don't support double backward; the check for whether double backward is allowed requires a .item() call. To fix the meta/fake tensor case, this PR will avoid setting the double backward error only if `GradMode::is_enabled()` - which shouldn't be turned on in PT2. 3. Update skips. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106429 Approved by: https://github.com/zou3519	2023-08-10 18:14:00 +00:00
Catherine Lee	a14d99bb6c	Close non existent disable issues complete rollout (#106923 ) follow up to https://github.com/pytorch/pytorch/pull/105096 It seems fine, anecdotally I have seen some issues closed and they haven't been reopened Pull Request resolved: https://github.com/pytorch/pytorch/pull/106923 Approved by: https://github.com/huydhn	2023-08-10 16:48:14 +00:00
Jane Xu	c0f80c6696	[forward-fix] Fix multigpu varying tensor optim tests (#106887 ) Forward fixes https://github.com/pytorch/pytorch/pull/106615 by increasing tolerance in the test. The capturable implementation for foreach simply varies due to a different order of operations when updating params. I had also attempted to compare against fp64 but that introduced more disparity in the other optimizer configs. It is worth trying the fp64 comparison at a later point, but let's get the test passing first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106887 Approved by: https://github.com/izaitsevfb	2023-08-10 16:35:38 +00:00
Howard Huang	149e458846	[BE] RPC is missing RRef docs (#106902 ) Current `RRef` class derives from `PyRRef` which has all the method definitions and documentations, and we don't see any of this in the current documentation: <img width="891" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/62897766-a660-4846-97bf-182e4aa45079"> Changing to :inherited-member: so sphinx can pick up these methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/106902 Approved by: https://github.com/svekars	2023-08-10 16:26:27 +00:00
Edward Z. Yang	89fd1b8717	Upgrade all inductor workflows to CUDA 12.1 / gcc9 (#106876 ) gcc7 is too old to build fbgemm Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106876 Approved by: https://github.com/msaroufim, https://github.com/anijain2305 ghstack dependencies: #106900	2023-08-10 15:02:20 +00:00
Kurt Mohler	4d6a891baf	Remove `set_default_dtype` from nn tests (#105775 ) Part of #68972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105775 Approved by: https://github.com/ezyang	2023-08-10 14:56:13 +00:00
XiaobingSuper	22bc08da29	inductor: remove conv_bn folding from pre_grad pass (#106686 ) The freezing pass has support conv+bn folding pass, we don't need to do that at pre_grad pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106686 Approved by: https://github.com/eellison	2023-08-10 12:25:05 +00:00
vfdev-5	35a1913370	[inductor] Added affine_grid_generator decomposition (#104709 ) Description: - Added affine_grid_generator decomposition Related to https://github.com/pytorch/pytorch/issues/104296 Fixes https://github.com/pytorch/pytorch/issues/105565 Perfs: - speed-up on cuda with bilinear and nearest modes ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git3ed904e) PR-afgg" and "Compiled (2.1.0a0+gitbcdd413) Nightly" [------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cpu ------------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.1.0a0+git1afae24) PR-afgg \| Compiled (2.1.0a0+git1afae24) PR-afgg \| Compiled (2.1.0a0+git16df542) Nightly \| speed-up PR vs Nightly \| Eager (2.1.0a0+git16df542) Nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear \| 7.467 (+-0.036) \| 11.905 (+-0.276) \| 13.391 (+-0.051) \| 1.125 (+-0.000) \| 7.343 (+-0.036) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear \| 7.722 (+-0.168) \| 14.371 (+-0.035) \| 15.899 (+-0.038) \| 1.106 (+-0.000) \| 7.870 (+-0.043) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear \| 7.710 (+-0.051) \| 11.354 (+-0.053) \| 13.376 (+-0.045) \| 1.178 (+-0.000) \| 7.698 (+-0.061) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear \| 7.870 (+-0.050) \| 13.744 (+-0.237) \| 15.206 (+-0.102) \| 1.106 (+-0.000) \| 7.912 (+-0.039) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest \| 4.738 (+-0.015) \| 4.508 (+-0.005) \| 6.566 (+-0.027) \| 1.456 (+-0.000) \| 4.630 (+-0.022) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest \| 4.391 (+-0.010) \| 4.860 (+-0.390) \| 6.438 (+-0.047) \| 1.325 (+-0.000) \| 4.458 (+-0.010) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest \| 4.279 (+-0.008) \| 4.127 (+-0.010) \| 6.598 (+-0.709) \| 1.599 (+-0.000) \| 5.064 (+-0.025) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest \| 4.537 (+-0.010) \| 4.593 (+-0.006) \| 6.365 (+-0.104) \| 1.386 (+-0.000) \| 4.480 (+-0.011) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic \| 26.411 (+-0.066) \| 62.275 (+-0.436) \| 64.486 (+-0.353) \| 1.035 (+-0.000) \| 26.210 (+-0.110) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic \| 26.457 (+-0.096) \| 72.887 (+-0.247) \| 74.207 (+-0.337) \| 1.018 (+-0.000) \| 25.995 (+-0.120) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic \| 26.457 (+-0.086) \| 64.110 (+-0.233) \| 66.340 (+-0.406) \| 1.035 (+-0.000) \| 26.145 (+-0.085) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic \| 26.536 (+-0.094) \| 73.742 (+-0.483) \| 71.946 (+-0.460) \| 0.976 (+-0.000) \| 26.457 (+-0.166) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cuda -----------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.1.0a0+git1afae24) PR-afgg \| Compiled (2.1.0a0+git1afae24) PR-afgg \| Compiled (2.1.0a0+git16df542) Nightly \| speed-up PR vs Nightly \| Eager (2.1.0a0+git16df542) Nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear \| 91.971 (+-0.253) \| 90.570 (+-0.193) \| 137.206 (+-0.214) \| 1.515 (+-0.000) \| 84.280 (+-0.241) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear \| 91.893 (+-0.361) \| 89.866 (+-0.170) \| 136.678 (+-0.471) \| 1.521 (+-0.000) \| 84.573 (+-0.214) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear \| 116.967 (+-0.481) \| 110.468 (+-0.326) \| 223.770 (+-0.334) \| 2.026 (+-0.000) \| 108.098 (+-0.392) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear \| 117.563 (+-0.546) \| 111.438 (+-0.212) \| 223.101 (+-0.350) \| 2.002 (+-0.000) \| 108.225 (+-0.395) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest \| 80.706 (+-0.289) \| 70.525 (+-0.204) \| 143.697 (+-0.311) \| 2.038 (+-0.000) \| 74.485 (+-0.258) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest \| 80.955 (+-0.208) \| 69.986 (+-0.250) \| 143.658 (+-0.244) \| 2.053 (+-0.000) \| 74.163 (+-0.238) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest \| 117.576 (+-0.435) \| 71.179 (+-0.412) \| 178.515 (+-0.539) \| 2.508 (+-0.000) \| 108.394 (+-0.473) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest \| 117.441 (+-0.205) \| 70.313 (+-0.170) \| 178.664 (+-0.555) \| 2.541 (+-0.000) \| 108.098 (+-0.416) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic \| 92.962 (+-0.509) \| 1740.964 (+-0.597) \| 1785.401 (+-0.369) \| 1.026 (+-0.000) \| 92.638 (+-0.539) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic \| 92.928 (+-0.493) \| 1401.146 (+-0.732) \| 1453.229 (+-0.628) \| 1.037 (+-0.000) \| 92.458 (+-0.428) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic \| 118.152 (+-0.442) \| 1740.644 (+-0.480) \| 1793.475 (+-0.458) \| 1.030 (+-0.000) \| 107.962 (+-0.548) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic \| 118.182 (+-0.425) \| 1400.621 (+-0.624) \| 1461.796 (+-0.630) \| 1.044 (+-0.000) \| 107.894 (+-0.994) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230801-220216-affine-grid-sampler-PR-afgg-vs-Nightly-speedup.md), [script](https://github.com/vfdev-5/pth-inductor-dev/blob/master/perf_affine_grid_sampler.py) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104709 Approved by: https://github.com/lezcano	2023-08-10 09:52:48 +00:00
shibo19	bb2fcc7659	unify TEST_CUDA (#106685 ) Fixes #ISSUE_NUMBER as title, unify TEST_CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/106685 Approved by: https://github.com/zou3519	2023-08-10 09:01:36 +00:00
Masaki Kozuki	2b560d3c3a	Add `_foreach_clamp` (#106574 ) Rel: - #106221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106574 Approved by: https://github.com/janeyx99	2023-08-10 05:26:09 +00:00
eellison	3495f0c999	Generate mypy hints for torch.Tag, add a couple of pointwise ops (#106910 ) Replace https://github.com/pytorch/pytorch/pull/106739, since i had a bad CLA commit. - adds clone, and convert_element_dtype to pointwise - adds codegen for mypy hints of torch.Tag and removes existing ignores for them Pull Request resolved: https://github.com/pytorch/pytorch/pull/106910 Approved by: https://github.com/mlazos	2023-08-10 05:12:27 +00:00
eellison	606e3c297b	conv-bn folding in low precision (#106576 ) Batchnorm inference is done in fp32 if the inputs are in fp16/bf16 and the output is casted back down to its original precision. This causes the batchnorm weights to get constant folded to fp32, and prevented Conv-BN folding from firing. ``` def forward(self, arg0_1: bf16[32, 3, 3, 3], arg1_1: bf16[32], arg2_1: bf16[32], ...) convolution: bf16[3, 32, 15, 15] = aten..convolution.default(arg6_1, arg0_1, None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1); arg6_1 = arg0_1 = None # weight upcasting convert_element_type: f32[32] = torch.ops.prims.convert_element_type.default(arg3_1, torch.float32); arg3_1 = None convert_element_type_1: f32[32] = torch.ops.prims.convert_element_type.default(arg4_1, torch.float32); arg4_1 = None ... # end of batch norm add_1: f32[3, 32, 15, 15] = aten..add.Tensor(mul_2, unsqueeze_7); mul_2 = unsqueeze_7 = None # output downcast convert_element_type_2: bf16[3, 32, 15, 15] = torch.ops.prims.convert_element_type.default(add_1, torch.bfloat16); add_1 = None ``` I mark the convolutions which are followed by binary foldable ops in a higher precision that are then get converted back down to the original conv dtype. We fold the weights in fp32 because it's slightly better accuracy, then at the end of the pass convert back the weights to their original dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106576 Approved by: https://github.com/XiaobingSuper, https://github.com/yanboliang	2023-08-10 05:12:04 +00:00
Jerry Zhang	4afab40b56	[quant][pt2e] Removed mean/hardtanh annotations and refactored adaptive_avg_pool annotation (#106805 ) Summary: Removed annotations for some ops, since they are handled in torch/ao/quantization/pt2e/_propagate_annotation.py Test Plan: CIs Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/106805 Approved by: https://github.com/kimishpatel	2023-08-10 04:51:06 +00:00
Nikita Shulga	dfd441a12c	[BE] Use nested namespaces in `torch/csrc/cuda` (#106928 ) <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 6b1dde1</samp> > _`namespace` syntax_ > _Simplified with C++17_ > _Code is more readable_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/106928 Approved by: https://github.com/huydhn, https://github.com/izaitsevfb	2023-08-10 03:56:09 +00:00
Jackie (Jiaqi) Xu	e34a05b960	[ez][inductor][fx pass] strengthen numerical check for batch fusion (#106744 ) Summary: As title. For batch fusion, we use torch op to fuse and the result should be exactly same as the original ones. pull request: https://github.com/pytorch/pytorch/pull/106731#issuecomment-1668662078 Test Plan: ``` buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py File changed: fbsource//xplat/caffe2/test/inductor/test_group_batch_fusion.py Buck UI: https://www.internalfb.com/buck2/cf14a2dd-faee-417a-8d26-0b9326c944e4 Test UI: https://www.internalfb.com/intern/testinfra/testrun/6755399617159540 Network: Up: 0B Down: 0B Jobs completed: 12. Time elapsed: 2:55.5s. Tests finished: Pass 4. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Reviewed By: dshi7 Differential Revision: D48132255 @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/106744 Approved by: https://github.com/kit1980	2023-08-10 03:49:23 +00:00
FFFrog	83b5027027	Enable Mypy Check in torch/_inductor/select_algorithm.py (#106701 ) Fixes #105230 to enable mypy check. ``` $ mypy torch/_inductor/select_algorithm.py Success: no issues found in 1 source file ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106701 Approved by: https://github.com/eellison	2023-08-10 03:19:50 +00:00
FFFrog	8093349d42	Enable mypy check in torch/_inductor/fx_passes/post_grad.py (#106839 ) Fixes #105230 ```shell $ lintrunner -a torch/_inductor/fx_passes/post_grad.py FLAKE8 success! CLANGFORMAT success! MYPYNOFOLLOW success! MYPY success! MYPYSTRICT success! CLANGTIDY success! TYPEIGNORE success! NOQA success! CIRCLECI success! SPACES success! NEWLINE success! CONSTEXPR success! NATIVEFUNCTIONS success! INCLUDE success! TABS success! PYBIND11_INCLUDE success! ERROR_PRONE_ISINSTANCE success! PYBIND11_SPECIALIZATION success! PYPIDEP success! RAWCUDA success! CUBINCLUDE success! EXEC success! RAWCUDADEVICE success! ROOT_LOGGING success! DEPLOY_DETECTION success! ACTIONLINT success! CALL_ONCE success! TESTOWNERS success! WORKFLOWSYNC success! CMAKE success! COPYRIGHT success! BAZEL_LINTER success! SHELLCHECK success! LINTRUNNER_VERSION success! UFMT success! ONCE_FLAG success! RUFF success! ok No lint issues. Successfully applied all patches. ``` ```shell $ mypy torch/_inductor/fx_passes/post_grad.py Success: no issues found in 1 source file ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106839 Approved by: https://github.com/Skylion007	2023-08-10 03:11:00 +00:00
AllenTiTaiWang	e93a90bdd5	[ONNX] Refactor perfect/nearest match criteria to allow optional inputs and disallow mismatch attributes (#106478 ) Fix #106057, except Attribute dtype mismatch. E.g., alpha of aten.add.Tensor. -> Attribute: alpha INT vs FLOAT. Summarized the change * Fill in defaults of attribute when `param_schema` is applied. This relaxes the matching on default attributes. * Fill in None to optional input when `param_schema` is applied. * Keep extra kwargs in attributes to make matching strictly. * Allow input to be None when its dtype is `optiona[INPUT]` The change comes with the guarantee from torchlib that attribute would never be None. For example, if `memory_format` is needed. The function should specify like this: ```python @torch_op("aten::clone") def aten_clone( self: TTensor, memory_format: str = "" # pylint: disable=unused-argument ) -> TTensor: """clone(Tensor self, *, MemoryFormat? memory_format=None) -> Tensor""" return op.Identity(self) ``` Previous to this PR, opSchema matching didn't strictly guard the number of inputs/attributes to allow nearest match, which introduces the bug of dispatching `aten::div.Tensor` to `aten::div.default` disregarding the fact that `aten::div.Tensor` has an extra attibute `rounding_mode`. This PR fixes the issue with the new logic to perfect/nearest match. Particularly, strictly restrict the qualification of being nearest match candidate. For each ONNX variants, we check these step by step: 1. Check if the function signature of inputs number is the same as the inputs. 2. Check if the function signature of attribute names is the same set of inputs. If either of the above two criteria fails to meet, the ONNX variant is not a perfect match, nor a nearest match candidate (match_score=None) 3. Check if input dtype matches 4. Check if attribute dtype matches If 3 and 4 are met, then this is a perfect match, otherwise, it's still considered a candidate of nearest match with a matching score. ## Case Study ### Optional Input The dispatcher recognizes optional inputs. However, the input can't be ignored. None must be provided. ```python # Perfect match is found inputs = (Tensor([2, 3]), None) aten_op(X: TTensor, Y: Optional[INT64]): ... ``` Real Case: aten::convolution NOTE: There is/will not optional attribute in torchlib. ### Different attributes If an attribute is provided with value, it's a must to match the attribute in function signature. ```python # Not perfect match, nor nearest match inputs = (Tensor([2, 3]),) attributes = {"a":1, "b":2} aten_op(X: TTensor, a: int): ... ``` Real Case: aten::div and aten::div.Tensor_mode ### Default attribute Default attribute will fill in the value into inputs/attributes ```python # Perfect match is found inputs = (Tensor([2, 3]),) attributes = {} aten_op(X: TTensor, a: int = 3): ... ``` Real case: aten::clone ### Ignore attribute with None value The attributes with None value will be ignored in matching. ```python # Perfect match inputs = (Tensor([2, 3]),) attributes = {"a": None} aten_op(X: TTensor): ... # Not perfect match, but eligible for nearest match inputs = (Tensor([2, 3]),) attributes = {"a": None} aten_op(X: TTensor, a: int = 3): ... ``` Real case: aten::div and aten::div.Tensor_mode Pull Request resolved: https://github.com/pytorch/pytorch/pull/106478 Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao	2023-08-10 03:08:23 +00:00
PyTorch UpdateBot	4c1d8ab272	[vision hash update] update the pinned vision hash (#106926 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106926 Approved by: https://github.com/pytorchbot	2023-08-10 02:58:34 +00:00
Zhengxu Chen	9891c6aa15	[export] cleanup pass base. [2/n] (#106905 ) Test Plan: CI Differential Revision: D48004717 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106905 Approved by: https://github.com/angelayi	2023-08-10 02:49:58 +00:00
Tal Cherckez	08704f96f0	Add initial support for FP8 ONNX export (#106379 ) Add support for ONNX_NAMESPACE::TensorProto_DataType_FLOAT8E5M2 and ONNX_NAMESPACE::TensorProto_DataType_FLOAT8E4M3FN to enable export of torch models that use FP8 (E4M3 and E5M2) to ONNX (opset 19) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106379 Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi, https://github.com/malfet	2023-08-10 01:02:45 +00:00
Nikita Shulga	526d93bba3	Add `_onnx.pyi` to ONNX merge rules (#106927 ) Followup after https://github.com/pytorch/pytorch/pull/106379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106927 Approved by: https://github.com/izaitsevfb	2023-08-10 00:51:49 +00:00
Richard Zou	b9ad7bc533	Don't run test/autograd/test_fallback.py in parallel (#106866 ) Fixes https://github.com/pytorch/pytorch/issues/106754 This PR: - moves test/autograd/test_fallback.py to test_autograd_fallback.py and removes it from test_autograd.py (necessary for the next step) - adds test_autograd_fallback.py to parallel test blocklist. - lintrunner really wanted to make changes to the files, but other than that, it is a move. The problem is that we set a global option (the autograd fallback mode) during these tests which may cause the tests to interfere with each other. Test Plan: - python test/run_test.py -i test_autograd_fallback NOTE to diff train oncall: - You'll also need to modify the test/autograd/test_fallback.py TARGET in caffe2/test/TARGETS since we renamed the file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106866 Approved by: https://github.com/soulitzer	2023-08-10 00:26:23 +00:00
Guang Yang	0b57581dec	[pytorch] Disable fast path in MultiheadAttention in Export (#106824 ) Summary: We are seeing `aten._native_multi_head_attention` op (not in core Aten op set) is left in the exported graph and causes problems in the downstream at runtime. Two proposed solutions: 1. Disable fast path while tracing to leverage the non-optimized path to get decomp, that way, the blamed op won't show up in the exported graph 2. Add a decomp rule for `aten._native_multi_head_attention` After discussing with kimishpatel and bdhirsh, #1 is preferred and verified it could immediately unblock the critical model enablement work for PP. Test Plan: CI Differential Revision: D48169806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106824 Approved by: https://github.com/kimishpatel	2023-08-10 00:18:37 +00:00
angelayi	7f9d1cacca	[export] Minor fixes to contrain_as_size (#106737 ) Fixed some minor issues with constraint APIs while I was helping enable some other model Pull Request resolved: https://github.com/pytorch/pytorch/pull/106737 Approved by: https://github.com/tugsbayasgalan	2023-08-10 00:13:08 +00:00
Wei-Sheng Chin	99a10da295	[Dynamo] a dyanmo backend based on ONNXRuntime (#106589 ) This PR migrates the dynamo backend developed under ONNXRuntime into PyTorch. The ultimate goal is to replace legacy `onnxrt` in dynamo with dynamo compiler from ONNXRuntime team. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106589 Approved by: https://github.com/abock, https://github.com/thiagocrepaldi	2023-08-10 00:09:19 +00:00
Jane Xu	4dc66a4b5c	[BE] fix type iteration typo in test_lrscheduler.py (#106908 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106908 Approved by: https://github.com/clee2000, https://github.com/soulitzer	2023-08-09 23:56:06 +00:00
Lucy Qiu	7b3d50e4cc	[Pytorch][Vulkan] Set global and local sizes for image->bool copy (#106752 ) Summary: 1. Add bool to quantized flow 2. Add support for cases where channel is not a multiple of 4 to the shader `image_to_nchw_quantized_mul4.glsl`. Note that the `mul4` in the shader name refers to height * width % 4 == 0. Add test cases. See: D48082479 Test Plan: New tests: ``` lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="copy_to_texture_bool" Downloaded 1/3 artifacts, 1.74 Mbytes, 50.0% cache miss (for updated rules) Building: finished in 14.4 sec (100%) 474/474 jobs, 3/474 updated Total time: 14.4 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = copy_to_texture_bool [==========] Running 3 tests from 1 test suite. [----------] Global test environment set-up. [----------] 3 tests from VulkanAPITest [ RUN ] VulkanAPITest.copy_to_texture_bool_mul4_hw VUID-VkDeviceCreateInfo-pProperties-04451(ERROR / SPEC): msgNum: 976972960 - Validation Error: [ VUID-VkDeviceCreateInfo-pProperties-04451 ] Object 0: handle = 0x10bf61020, type = VK_OBJECT_TYPE_PHYSICAL_DEVICE; \| MessageID = 0x3a3b6ca0 \| vkCreateDevice: VK_KHR_portability_subset must be enabled because physical device VkPhysicalDevice 0x10bf61020[] supports it The Vulkan spec states: If the [VK_KHR_portability_subset] extension is included in pProperties of vkEnumerateDeviceExtensionProperties, ppEnabledExtensions must include "VK_KHR_portability_subset". (https://vulkan.lunarg.com/doc/view/1.2.182.0/mac/1.2-extensions/vkspec.html#VUID-VkDeviceCreateInfo-pProperties-04451) Objects: 1 [0] 0x10bf61020, type: 2, name: NULL [ OK ] VulkanAPITest.copy_to_texture_bool_mul4_hw (114 ms) [ RUN ] VulkanAPITest.copy_to_texture_bool_mul4_chw [ OK ] VulkanAPITest.copy_to_texture_bool_mul4_chw (4 ms) [ RUN ] VulkanAPITest.copy_to_texture_bool [ OK ] VulkanAPITest.copy_to_texture_bool (7 ms) [----------] 3 tests from VulkanAPITest (126 ms total) [----------] Global test environment tear-down [==========] 3 tests from 1 test suite ran. (127 ms total) [ PASSED ] 3 tests. ``` All tests: ``` [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms) [----------] 331 tests from VulkanAPITest (7327 ms total) [----------] Global test environment tear-down [==========] 331 tests from 1 test suite ran. (7327 ms total) [ PASSED ] 330 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log ``` Quantized tests: ``` [----------] 63 tests from VulkanAPITest (2009 ms total) [----------] Global test environment tear-down [==========] 63 tests from 1 test suite ran. (2009 ms total) [ PASSED ] 63 tests. YOU HAVE 8 DISABLED TESTS ``` Differential Revision: D48086455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106752 Approved by: https://github.com/SS-JIA	2023-08-09 23:37:13 +00:00
Nikita Shulga	eefe06ef56	[BE] Move common logic into `cublasCommonArgs` (#106842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106842 Approved by: https://github.com/vkuzo ghstack dependencies: #106843	2023-08-09 23:30:15 +00:00
weifengpy	4bc846c101	[FSDP] Ignore buffer type casting in ignored modules (#106766 ) issue resolved: https://github.com/pytorch/pytorch/issues/97791 before this PR, mixed_precision applies to buffers from ignored modules. see ```test_state_dict_with_ignored_modules(mixed_precision=True)``` for reproduce after, we avoid applying mixed_precision semantics to buffers from ignored modules * step 1 initialization: state._ignored_buffer_names contains all the buffers from ignored modules * step 2 lazy init at runtime: skip ignored buffers in ```_get_buffers_and_dtypes_for_computation``` * step 3 skip upcasting in state_dict hook: avoid upcasting for ignored buffers in ```_get_buffers_and_dtypes_for_computation``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106766 Approved by: https://github.com/awgu	2023-08-09 23:09:43 +00:00
Jerry Zhang	97ce979e5d	[quant][pt2e] Add reference representation for quantized conv2d (#105784 ) Summary: Implementing reference representation for quantized ops we decided in https://docs.google.com/document/d/17h-OEtD4o_hoVuPqUFsdm5uo7psiNMY8ThN03F9ZZwg/edit#heading=h.ov8z39149wy8 Test Plan: python test/test_quantization.py TestQuantizePT2E.test_representation_quantize_dequantize_per_channel Although right now it is not really testing things since there is some problem with dynamo export Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/105784 Approved by: https://github.com/kimishpatel ghstack dependencies: #105783	2023-08-09 22:41:35 +00:00
Edward Z. Yang	02e4415315	Attempt to pin opencv-python version (#106900 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106900 Approved by: https://github.com/voznesenskym, https://github.com/huydhn, https://github.com/malfet	2023-08-09 22:38:16 +00:00
Simon Fan	787d5259fa	Include fused nodes' debug_str in FusedSchedulerNode::debug_str_extra (#106356 ) Currently, there's no way to print the debug information of fused scheduler nodes. I'm adding this to inspect the individual nodes' ir type e.g. ComputedBuffer, but not sure if this would be useful for more use cases FusedSchedulerNode::debug_str_extra only prints its fused nodes' names ``` # calling .debug_str() on a FusedSchedulerNode buf0_buf1: FusedSchedulerNode(NoneType) buf0_buf1.writes = [MemoryDep('buf0', c0, {c0: 10}), MemoryDep('buf1', c0, {c0: 10})] buf0_buf1.unmet_dependencies = [] buf0_buf1.met_dependencies = [MemoryDep('arg0_1', c0, {c0: 100}), MemoryDep('arg1_1', c0, {c0: 10})] buf0_buf1.users = None buf0_buf1.snodes = ['buf0', 'buf1'] ``` This PR adds support to print the fused nodes' debug_str ``` buf0_buf1: FusedSchedulerNode(NoneType) buf0_buf1.writes = [MemoryDep('buf0', c0, {c0: 10}), MemoryDep('buf1', c0, {c0: 10})] buf0_buf1.unmet_dependencies = [] buf0_buf1.met_dependencies = [MemoryDep('arg0_1', c0, {c0: 100}), MemoryDep('arg1_1', c0, {c0: 10})] buf0_buf1.users = None buf0_buf1.snodes[0] = buf0: SchedulerNode(ComputedBuffer) buf0.writes = [MemoryDep('buf0', c0, {c0: 10})] buf0.unmet_dependencies = [] buf0.met_dependencies = [MemoryDep('arg0_1', c0, {c0: 100})] buf0.users = [NodeUser(node=SchedulerNode(name='buf1'), can_inplace=True)] buf0.group.device = cuda:0 buf0.group.iteration = (10, 10) buf0.sizes = ([10], [10]) class buf0_loop_body: var_ranges = {z0: 10, z1: 10} index0 = 10*z0 + z1 index1 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('arg0_1', get_index) reduction = ops.reduction(torch.float32, torch.float32, 'sum', load) get_index_1 = self.get_index('index1') store_reduction = ops.store_reduction('buf0', get_index_1, reduction) return store_reduction buf0_buf1.snodes[1] = buf1: SchedulerNode(ComputedBuffer) buf1.writes = [MemoryDep('buf1', c0, {c0: 10})] buf1.unmet_dependencies = [MemoryDep('buf0', c0, {c0: 10})] buf1.met_dependencies = [MemoryDep('arg1_1', c0, {c0: 10})] buf1.users = [NodeUser(node=OUTPUT, can_inplace=False)] buf1.group.device = cuda:0 buf1.group.iteration = (10, 1) buf1.sizes = ([10], []) class buf1_loop_body: var_ranges = {z0: 10} index0 = z0 def body(self, ops): get_index = self.get_index('index0') load = ops.load('arg1_1', get_index) cos = ops.cos(load) get_index_1 = self.get_index('index0') load_1 = ops.load('buf0', get_index_1) add = ops.add(cos, load_1) get_index_2 = self.get_index('index0') store = ops.store('buf1', get_index_2, add, None) return store ``` I'm assuming that FusedSchedulerNode cannot be fused, i.e. can't have FusedSchedulerNode::snodes contain any FusedSchedulerNode. # Tests Changes were tested adhoc by printing debug_str in GraphLowering::count_bytes, and running `python3 test/inductor/test_perf.py -k test_fusion_choice3` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106356 Approved by: https://github.com/peterbell10	2023-08-09 21:19:07 +00:00
Animesh Jain	77acb04a00	[dynamo] Readability - Rename name to get_frame_name (#106880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106880 Approved by: https://github.com/jansel ghstack dependencies: #106878	2023-08-09 21:15:41 +00:00
Animesh Jain	8aca724312	[dynamo] use cache size to detect recompilation (#106878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106878 Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/mlazos	2023-08-09 21:15:40 +00:00
Nikita Shulga	c2ddb71aba	Add F8 BLAS data types conversion (#106843 ) <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 54e843d</samp> > _`Float8` types added_ > _to switch on `ScalarType` -_ > _CUDA version checked._ Pull Request resolved: https://github.com/pytorch/pytorch/pull/106843 Approved by: https://github.com/Skylion007	2023-08-09 20:49:40 +00:00
atalman	0b88007540	Adding release compatibility matrix for release 2.1 (#106891 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 3c5a179</samp> Update `RELEASE.md` with compatibility information for PyTorch 2.1. This file documents the supported versions of Python, CUDA, and CUDNN for each PyTorch release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106891 Approved by: https://github.com/kit1980	2023-08-09 20:30:45 +00:00
Mike Schneider	861ae39938	[aarch64] Add PT Docker build image for aarch64 (#106881 ) # Changes * Update `generate_binary_build_matrix.py` for aarch64 to use `pytorch/manylinuxaarch64-builder:cpu` when it is created * Executed `generate_binary_build_matrix.py` to update `generated-linux-aarch64-binary-manywheel-nightly.yml` Aarch64 build/test will fail till the new docker image is available for consmption. Builder PR to build docker image : https://github.com/pytorch/builder/pull/1472 This switches nightly to use the docker build : https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=aarch64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106881 Approved by: https://github.com/atalman	2023-08-09 20:28:04 +00:00
Catherine Lee	7dfab082be	Reordering tests experiment (#106347 ) Companion with https://github.com/pytorch/test-infra/pull/4424 Uses the file rating generated by the test infra PR to re order tests. For each test file, sum the file ratings from the changed files in the PR, and put the tests in order of sum. A lot of tests are probably going to end up as "prioritized" since it takes anything with a rating > 0 right now. Sharding is done twice, once on the prioritized tests, and once on the general/non prioritized tests. Prioritized tests have an order, so they should be sharded according to that order, while general tests don't have an order and are sharded by test time, which should result in more balanced shards. I'll change the metric name before I merge, i want to quarantine my testing stuff from actual results Pull Request resolved: https://github.com/pytorch/pytorch/pull/106347 Approved by: https://github.com/ZainRizvi	2023-08-09 20:11:11 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	a44c072c89	Make InternalModel and Resnet work with rexportable flow (#106676 ) Summary: Internal model and Resnet uses "re-export" flow now. Also did some refactoring to make the code little cleaner Some changes for OSS: 1. Correctly use the "cached" fake tensors so that static symbols are still resolved to static 2. Change logic in PassBase to allocate static shapes for parameters 3. Add "is_torch_exported" tag to every node to make it survive during various graph transformations. 4. Added experimental wrapper API for quantization team to get pre_dispatch=True graph. Note that it doesn't actually do that right now. But we plan to switch soon. Test Plan: CI Differential Revision: D47890878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106676 Approved by: https://github.com/jerryzh168	2023-08-09 20:10:48 +00:00
Edward Z. Yang	8ea13a955a	Avoid subtracting by sys.maxsize when something is bounded below by -sys.maxsize - 1 (#106716 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106716 Approved by: https://github.com/albanD	2023-08-09 19:34:03 +00:00
Mark Saroufim	1b32ac3cab	Update torchbench.txt (#106761 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/106761 Approved by: https://github.com/malfet	2023-08-09 19:01:21 +00:00
gmagogsfm	47014883a7	Remove unused _add_runtime_assertions (#106759 ) `_add_runtime_assertions` is not used Pull Request resolved: https://github.com/pytorch/pytorch/pull/106759 Approved by: https://github.com/tugsbayasgalan	2023-08-09 18:58:32 +00:00
Jerry Zhang	e1a1780626	[quant][pt2e] Move annotate functions in XNNPACKQuantizer to utils (#106642 ) Summary: This is to allow sharing these annotate functions by other quantizers so that writing a new quantizer is easier note that these annotation functions will be maintained by XNNPACKQuantizer developers instead of AO team Test Plan: python test/test_quantization.py TestQuantizePT2E Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/106642 Approved by: https://github.com/andrewor14	2023-08-09 18:52:39 +00:00
Nikita Karetnikov	467a2e63f0	[pt2] add Python meta for `triangular_solve` (#106682 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106682 Approved by: https://github.com/ezyang	2023-08-09 18:50:54 +00:00
lyriexs666	318fcc5eb9	Change dropout of device Privateuse1 to fused kernel (#106774 ) Similar to issue in #97894, dropout is dispatched to fused kernel(native_dropout) only with some devices like cuda, etc.. It is hard to optimize performance when using AOT with custom device, as dropout is finally decomposed to bernoulli and mul. This PR changes this behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106774 Approved by: https://github.com/ezyang	2023-08-09 18:50:28 +00:00
Andrew Gu	6f036c9637	[FSDP][Easy] `zeros` -> `empty` for immediately freed tensors (#106857 ) Since we immediately free these tensors' storage (via `_free_storage()`), there is no reason to zero them after allocation: `92e5b124c8/torch/distributed/fsdp/flat_param.py (L1140-L1145)` `92e5b124c8/torch/distributed/fsdp/flat_param.py (L1155-L1161)` `92e5b124c8/torch/distributed/fsdp/flat_param.py (L1166-L1171)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106857 Approved by: https://github.com/Skylion007	2023-08-09 17:26:33 +00:00
Richard Barnes	a0c0666fca	Add some const to `IndexKernel.cu` (#106809 ) Test Plan: Sandcastle Differential Revision: D48137853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106809 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-08-09 16:54:47 +00:00
Nikita Shulga	387e3b04fa	Reenable `torch._int_mm` testing on newer CUDAs (#106840 ) Looks like "it just works" on SM80+ on CUDA-12 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106840 Approved by: https://github.com/vkuzo	2023-08-09 16:23:30 +00:00
Nikita Shulga	046d6178c5	[BE] Add optional `t` param to `CuBlasLtMatrixLayout` (#106841 ) To avoid writing the same ternary in 4 (and soon to be 6) places Pull Request resolved: https://github.com/pytorch/pytorch/pull/106841 Approved by: https://github.com/kit1980	2023-08-09 16:14:48 +00:00
PyTorch MergeBot	fe594ab323	Revert "[core][pruning][feature] cuSPARSELt kernels and ops (#102133 )" This reverts commit ad22f0ffb456fc3f967ad32e09376f7c9cf94a56. Reverted https://github.com/pytorch/pytorch/pull/102133 on behalf of https://github.com/jcaip due to breaking lots of internal builds, see D48144534 ([comment](https://github.com/pytorch/pytorch/pull/102133#issuecomment-1671707821))	2023-08-09 16:03:14 +00:00
David Berard	387f1ab104	[inductor] Switch inductor_prims._bucketize over to aten.bucketize (#106658 ) inductor_prims._bucketize was added while we worked on hardening the inductor lowering. Now the lowering should be sufficiently tested and should have good enough perf (https://github.com/pytorch/pytorch/pull/104456) - so we can remove the temporary `inductor_prims._bucketize` op and move the lowerings to the `aten.bucketize` op. Note that we haven't added a CPU implementation yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106658 Approved by: https://github.com/eellison	2023-08-09 14:00:22 +00:00
Yang Chen	40a15b50a8	Enable mypy checking in compile_fx.py (#105830 ) This is part of the effort for issue #105230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105830 Approved by: https://github.com/eellison	2023-08-09 09:05:23 +00:00
Deng Weishi	088e316659	add xpu support for foreach kernels (#106021 ) We want to add xpu support for foreach kernels, so we add the "xpu" devices to the support list. Besides, for fused kernels in Adam and AdamW, the devices check is enabled by the support list in adam.py (lines 44-46) and adamw.py (lines 60-64), so we remove the repetitive check for cuda devices as it will block the other devices in the support list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106021 Approved by: https://github.com/janeyx99	2023-08-09 07:51:02 +00:00
PyTorch MergeBot	dc7ec4c843	Revert "conv-bn folding in low precision (#106576 )" This reverts commit c21df02ec0b9fd366bdf203134595664de030758. Reverted https://github.com/pytorch/pytorch/pull/106576 on behalf of https://github.com/kit1980 due to breaking internal builds, see D48144191 ([comment](https://github.com/pytorch/pytorch/pull/106576#issuecomment-1670768310))	2023-08-09 06:51:54 +00:00
PyTorch MergeBot	0ce103a0f8	Revert "inductor: remove conv_bn folding from pre_grad pass (#106686 )" This reverts commit 2a16457976c884b9cd1f196120529464c880e7ba. Reverted https://github.com/pytorch/pytorch/pull/106686 on behalf of https://github.com/kit1980 due to Depends on reverted https://github.com/pytorch/pytorch/pull/106576 ([comment](https://github.com/pytorch/pytorch/pull/106686#issuecomment-1670753365))	2023-08-09 06:37:22 +00:00
PyTorch UpdateBot	c876afea2d	[vision hash update] update the pinned vision hash (#106832 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106832 Approved by: https://github.com/pytorchbot	2023-08-09 04:29:46 +00:00
shibo19	6691413145	export torch/csrc/dynamo/*.h (#106757 ) Fixes #ISSUE_NUMBER as title, we need the header files in torch/csrc/dynamo, so to export it. could you have a look? @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/106757 Approved by: https://github.com/albanD	2023-08-09 03:57:49 +00:00
Justin Chu	e1e6bbd889	Update opset version warning text (#106830 ) Fix the line break. Remove mentioning of "Torchlib" and instead mention `torch.onnx.dynamo_export` because `Torchlib` seems like a foreign concept in torch. Suggestions welcome. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106830 Approved by: https://github.com/titaiwangms, https://github.com/BowenBao	2023-08-09 03:42:10 +00:00
kshitij12345	cce2c52b0b	[pt2] support vmap (#101707 ) Teach dynamo about `vmap` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101707 Approved by: https://github.com/zou3519	2023-08-09 03:39:33 +00:00
Edward Z. Yang	c379d6283a	Don't suppress ModuleNotFoundError if the failure is for an unrelated module (#106807 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106807 Approved by: https://github.com/williamwen42, https://github.com/voznesenskym	2023-08-09 01:54:49 +00:00
Jerry Zhang	69ecad6f2b	[quant][pt2e] Add reference representation for quantize_per_channel and dequantize_per_channel (#105783 ) Summary: Implementing reference representation for quantized ops we decided in https://docs.google.com/document/d/17h-OEtD4o_hoVuPqUFsdm5uo7psiNMY8ThN03F9ZZwg/edit#heading=h.ov8z39149wy8 Test Plan: python test/test_quantization.py TestQuantizePT2E.test_representation_quantize_dequantize_per_channel Although right now it is not really testing things since there is some problem with dynamo export Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/105783 Approved by: https://github.com/kimishpatel	2023-08-09 01:39:52 +00:00
Zachary DeVito	c14cf312c9	expandable_segments fix possible assert (#106818 ) If record_history is enabled, then a block is allocated, record_history is disabled, and then the block is freed and later unnmapped, we can hit the `to_map->context_when_allocated == nullptr` assertion. This change universally clears context_when_allocated on free, which should prevent this sequence of events from happening. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106818 Approved by: https://github.com/eellison	2023-08-09 01:09:03 +00:00
Nikita Shulga	44448754c1	[CI] Fix sccaching of nvcc builds (#106811 ) In cmake-3.26 or newer, `--options-file` is used, which renders nvcc outputs uncacheable by `sccache`, which were enable for CUDA-11 or newer builds by default by `6377a43814` Fix it by disabling RESPONSE_FILE use for CUDA compilation. Test Plan: Check that `multiple input files` stats in `PyTorch Build Statistics` is down to 13 files again, see https://github.com/pytorch/pytorch/actions/runs/5801865789/job/15727069855?pr=106811#step:10:42423 Fixes https://github.com/pytorch/pytorch/issues/105004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106811 Approved by: https://github.com/seemethere	2023-08-09 00:25:11 +00:00
Jiaxu Zhu	9e35df4adc	[pytorch][ao] force weight observer/fake_quant to be on the same device as the weight tensor (#106755 ) Summary: As title. There's a corner case where both cpu and gpu are avaiable, although the model is moved to cpu, the newly created PTQ weight observer is still on gpu. Therefore, during the convert, this line will fail https://fburl.com/4rhipfvb Test Plan: CI Differential Revision: D48141494 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106755 Approved by: https://github.com/jerryzh168	2023-08-09 00:22:49 +00:00
Eddy Ogola Onyango	cbcd9083be	[DCP] Modify tensor saving logic in DCP (#106415 ) Currently, DCP treats tensors as duplicates and only saves them on rank0. This won't work for PiPPy as PiPPy does have unique tensors across different ranks. With the current setup, we would only be saving the tensors on rank0 (coordinator rank). In this PR, we are changing to letting each rank create its own WriteItem for tensors. For the ones that does replicate across different ranks, we are handling it thru dedup_tensors(), which will dedup the replicate WriteItem so we only do the actual writing once. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106415 Approved by: https://github.com/wz337	2023-08-09 00:16:10 +00:00
Ivan Yashchuk	c913f3857f	Remove dynamo+nvfuser (#105789 ) This PR removes unmaintained Dynamo+nvFuser. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105789 Approved by: https://github.com/jansel, https://github.com/jjsjann123, https://github.com/albanD	2023-08-08 22:29:32 +00:00
Ashok Kumar Kannan	32d8de23d4	Enable mypy check for torch/_inductor/codegen/common.py (#106199 ) Fixes #105230 Summary: As suggested in [#105230](https://github.com/pytorch/pytorch/issues/105230) mypy checking is enabled in torch/_inductor/codegen/common.py. After the fix: `mypy --follow-imports=skip torch/_inductor/codegen/common.py Success: no issues found in 1 source file` Reviewers: @eellison Pull Request resolved: https://github.com/pytorch/pytorch/pull/106199 Approved by: https://github.com/Skylion007, https://github.com/eellison	2023-08-08 20:37:47 +00:00
BowenBao	2a138d7f1d	[ONNX] Turn on batch norm related unittest (#105769 ) As title, add test for ops already supported. Bump ORT in CI to 1.15.1 release version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105769 Approved by: https://github.com/titaiwangms, https://github.com/thiagocrepaldi	2023-08-08 19:51:04 +00:00
Brian Coutinho	3c52c6fd53	[pytorch] Disable CUDA sync events by default (#106723 ) Summary: As above, this was missed in a previous diff accidentally setting defaul to true. Internal to MEta this is disabled but it is enabled in open source PyTorch. Test Plan: CI Differential Revision: D48124636 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106723 Approved by: https://github.com/davidberard98, https://github.com/aaronenyeshi	2023-08-08 19:30:45 +00:00
Angela Yi	d4bc27191a	[exir] Update exir.pass_base to use export.pass_base (#106647 ) Summary: Also fixed T159713621 Test Plan: CI Differential Revision: D48068293 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106647 Approved by: https://github.com/tugsbayasgalan	2023-08-08 19:27:21 +00:00
Shubhraprakash Das	2764ead429	Add missing quantize per tensor vulkan backend function (#106641 ) Summary: Add missing function for vulkan namespace for quantization Test Plan: Check that a quantized model runs on Vulkan. Notebook tested: https://www.internalfb.com/intern/anp/view/?id=4045081 Differential Revision: D48047516 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106641 Approved by: https://github.com/SS-JIA	2023-08-08 17:44:03 +00:00
ydwu4	3a300ed84e	[export] refactor and add same_signature flag to dynamo.export (#106569 ) This PR adds a same_signature flag to dynamo.export. Motivation: In https://github.com/pytorch/pytorch/pull/105679, we experimented on using dynamo to inspect the UDFs for cond in eager mode (without torch.compile). This helps us to normalize the inputs (e.g. lifting closure to inputs) and makes higher order operator more robust (e.g. forbid python side effects) and less error-prone in general. We decided to use dynamo.export (instead of torch.compile) to do the inspection (pointed out by @voznesenskym @zou3519): - We'd like a whole-graph capture for the UDF. - We'd like the dynamo inspection to be stateless. Using torch.compile would require resetting dynamo context before and after the inspection because the compile flags may be different from users' torch.compile. This will clear all dynamo cache. - We can still implement some caching based on the guards. However, this requires export to be able to handle the case where it cannot always rewrite signature: e.g. closure lifted as input. This PR makes the rewrite optional. Implementation: We just put all the code that are related to signature rewriting into a function called rewrite_signature and use a same_signature flag to optionally to the transformation. Test Plan: existing tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106569 Approved by: https://github.com/ezyang	2023-08-08 17:16:18 +00:00
youkaichao	bd3b6f1ab4	add a debug api to extract cache entry from code (#106673 ) Per the discussion with @jansel in https://dev-discuss.pytorch.org/t/how-are-guards-installed-on-frames-that-are-transient-objects/1415/7 , guards and compiled code live in `co_extra` field in pycodeobject, which cannot be accessed in a trivial way. This PR tries to add a debug API to extract the data from that field, which can make debugging torchdynamo much easier. The API is intended to be used for debug only, and should have no compatibility issues with the current system. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106673 Approved by: https://github.com/jansel	2023-08-08 16:33:46 +00:00
Jason Lu	bc88028e8e	Back out "Reland "Make adding buffers more like adding parameters (#104069 )" (#106224 )" (#106743 ) Summary: Original commit changeset: 81319beb97f3 Original Phabricator Diff: D47961182 Test Plan: revert to maintain backward compat with legacy ads_dper3 production package. Read details in: S357822 Reviewed By: atuljangra Differential Revision: D48131623 @diff-train-skip-merge (D48131623 landed internally) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106743 Approved by: https://github.com/malfet	2023-08-08 15:27:34 +00:00
PyTorch MergeBot	891bb259f8	Revert "Remove dynamo+nvfuser (#105789 )" This reverts commit 6030151d3758715097b89026e9b3b3f839fbd544. Reverted https://github.com/pytorch/pytorch/pull/105789 on behalf of https://github.com/DanilBaibak due to Break a lot of tests on main. ([comment](https://github.com/pytorch/pytorch/pull/105789#issuecomment-1669710571))	2023-08-08 14:20:32 +00:00
Richard Zou	16b6873885	[custom_ops] extend impl_abstract to work with existing torch.library ops (#106088 ) This PR extends impl_abstract to work with existing torch.library/TORCH_LIBRARY ops. There's a question of what to do if the user calls impl_abstract and the op already has a registration for: - DispatchKey::Meta. We raise. - DispatchKey::CompositeImplicitAutograd. We raise. - DispatchKey::CompositeExplicitAutograd. To be pragmatic, we don't raise, since the user's CompositeExplicitAutograd might work for all other backends but Meta. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/106088 Approved by: https://github.com/soulitzer ghstack dependencies: #106075, #106076	2023-08-08 13:53:20 +00:00
Richard Zou	cebff39fad	[custom_ops] make custom_ops.impl work on existing operators (#106076 ) The design is that we construct a CustomOp object around the existing operator and then use it to register things. It is totally OK if the operator isn't functional (unlike torch._custom_ops.custom_op that can only construct functional operators). If the operator already has an implementation from a backend (either via direct registration to e.g. DispatchKey::CPU, or an indirect registration like CompositeImplicitAutograd/CompositeExplicitAutograd), we raise an error. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/106076 Approved by: https://github.com/soulitzer ghstack dependencies: #106075	2023-08-08 13:53:20 +00:00
Richard Zou	60a4ac3068	[custom_ops] Block overload names (#106075 ) These are valid with the torch.library API, but (1) they add complexity and (2) I have never seen a custom op actually use an overload name before. For simplicity we block all overloads. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/106075 Approved by: https://github.com/soulitzer	2023-08-08 13:53:18 +00:00
Ivan Yashchuk	6030151d37	Remove dynamo+nvfuser (#105789 ) This PR removes unmaintained Dynamo+nvFuser. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105789 Approved by: https://github.com/jansel, https://github.com/jjsjann123, https://github.com/albanD	2023-08-08 13:29:31 +00:00
Jesse Cai	ad22f0ffb4	[core][pruning][feature] cuSPARSELt kernels and ops (#102133 ) This PR contains two new private ops, added for cuSPARSELt support. These ops call into the cuSPASRELt kernels using the bindings they provide. For more information, see the documentation [here](https://docs.nvidia.com/cuda/cusparselt/index.html). The two new private ops added are: ``` _cslt_compress() _cslt_sparse_mm() ``` _cslt_compress is an op that reuturns the compressesed matrix given a sparse matrix that is passed in. _cslt_sparse_mm is an op that expects a compressed matrix (the result of _cslt_compress) and a dense matrix and performs sparse-dense matmul These ops will throw runtime errors if they cusparselt is not present. This PR also modifies the test and tensor sublass to reflect the new cuSPARSELt support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102133 Approved by: https://github.com/cpuhrsch	2023-08-08 06:59:22 +00:00
eqy	03c9321722	[CUDA][CUDA Graphs] Fix potential race with autograd thread during a graph capture 2 (#106570 ) An alternative to #106235 that just adds our own uid generation so that we can call `beginAllocateStreamToPool` (which notifies the caching allocator that a capture is starting) before actually starting the capture. Note that this does appear to change the behavior uid generation a bit from the CUDA API call (which seems to increment by 3 each time instead of 1). Looking at the changes again I'm not sure if both the _begin_ capture ordering change is needed in addition to the _end_ capture ordering change, but it makes me uneasy as I'm not sure anything prevents the autograd thread from running cleanup code "in-between" captures. CC @zdevito @eellison Pull Request resolved: https://github.com/pytorch/pytorch/pull/106570 Approved by: https://github.com/zdevito	2023-08-08 06:03:21 +00:00
Thomas Ortner	cc21fa75a3	Enable dynamic shapes of torch.nn.Parameter (#105855 ) This PR adds a new configuration that enables shapes of torch.nn.Parameter to be treated as dynamic in order to avoid extensive recompilation when Paramters are used instead of Tensor. This features addresses part of issue #105279 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105855 Approved by: https://github.com/ezyang	2023-08-08 05:40:01 +00:00
Yun Wang (Speech)	0d57e87000	Fix test_div in caffe2/caffe2/python:hypothesis_test (#106694 ) Summary: Suppress the "too_slow" health check for `test_div`. Test Plan: Sandcastle Differential Revision: D48105842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106694 Approved by: https://github.com/malfet	2023-08-08 04:50:21 +00:00
Ramin Azarmehr	cdfd0ea162	[MPS] Introduce torch.mps.Event() APIs (#102121 ) - Implement `MPSEventPool` to recycle events. - Implement python bindings with `torch.mps.Event` class using the MPSEventPool backend. The current member functions of the Event class are `record()`, `wait()`, `synchronize()`, `query()`, and `elapsed_time()`. - Add API to measure elapsed time between two event recordings. - Added documentation for Event class to `mps.rst`. - Added test case to `test_mps.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102121 Approved by: https://github.com/albanD, https://github.com/kulinseth	2023-08-08 03:45:45 +00:00
eqy	5f551133dc	[NCCL][AVOID_RECORD_STREAMS] Initialize `stashed_for_allocator_safety_` in `endCoalescing` if `TORCH_NCCL_AVOID_RECORD_STREAMS=1` (#106166 ) Currently `stashed_for_allocator_safety_` is uninitialized in this path, which will crash if another operation assumes a non-nullptr (the case when `TORCH_NCCL_AVOID_RECORD_STREAMS=1` and `avoidRecordStreams_` is set). CC @kwen2501 @ptrblck @kwen2501 I'm not familiar with what happens to the coalesced work when `endCoalescing` is called. In theory, if the coalesced work has already "stashed for allocator safety," can we also avoid the record streams calls here? Or is the coalesced work discarded (and their `_stashed_for_allocator_safety` vectors also destroyed? Pull Request resolved: https://github.com/pytorch/pytorch/pull/106166 Approved by: https://github.com/kwen2501	2023-08-08 03:03:22 +00:00
Masaki Kozuki	9e4e0ecdd9	Add 0-dim `Tensor` overload to `_foreach_mul` (#106677 ) rel: - https://github.com/pytorch/pytorch/issues/106427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106677 Approved by: https://github.com/janeyx99	2023-08-08 03:00:01 +00:00
Mark Saroufim	90c264c276	sd flaky on cpu skip (#106726 ) waiting for update expected script Pull Request resolved: https://github.com/pytorch/pytorch/pull/106726 Approved by: https://github.com/malfet	2023-08-08 02:44:47 +00:00
XiaobingSuper	2a16457976	inductor: remove conv_bn folding from pre_grad pass (#106686 ) The freezing pass has support conv+bn folding pass, we don't need to do that at pre_grad pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106686 Approved by: https://github.com/eellison	2023-08-08 02:01:15 +00:00
Hongyi Jia	8ef7512dc4	create API jit::Module::deepcopy(device) (#106521 ) Summary: Before we copy a meta merge, and use it as a skeleton to do d2d merge replication. However some models like prospector has CPU op LongIndex which takes quite long time to load. That makes the meta merge copy expensive. Modify jit::Module::deepcopy() to allow device copy. It simplifies user code and removes all unnecessary copies like tempfile, meta merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/106521 Approved by: https://github.com/davidberard98	2023-08-08 00:45:49 +00:00
poseljacob	a25eee1d77	_force_original_view_tracking to work as both context manager and function (#106706 ) Fix _force_original_view_tracking to work as a function as well as a context manager, as stated by documentation. Applied similar fixes to PR: https://github.com/pytorch/pytorch/pull/105291 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106706 Approved by: https://github.com/albanD	2023-08-07 23:29:22 +00:00
Peter Bell	3cda19c10a	[inductor] Fix xpasses being impossible (#106631 ) This test raises an error inside the test when an xfailed test succeeds, but is also decorated with the xfail decorator which converts the error to an xfail. Instead, this lets the test function pass normally and lets the xfail decorator raise "Unexpected success". I also updated the COLLECT_EXPECT code and run it to get the updated set of failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106631 Approved by: https://github.com/lezcano ghstack dependencies: #106319, #106400	2023-08-07 20:59:30 +00:00
Peter Bell	ab6efb1649	[pt2] Add reference implementations of torch.{stft,istft} (#106400 ) This allows symbolic shapes to be traced through `torch.stft` and `torch.istft`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106400 Approved by: https://github.com/lezcano ghstack dependencies: #106319	2023-08-07 20:59:30 +00:00
Peter Bell	d4d090e2da	[FakeTensor] Workaround FFT ops with incorrect meta strides (#106319 ) Currently there are FFT operators which raise `UnsupportedOperatorException` because their meta implementations sometimes give incorrect strides. This works around the problem for static shapes by falling back to eager. Though we still don't support calls with dynamic shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106319 Approved by: https://github.com/ezyang	2023-08-07 20:59:30 +00:00
Jackie (Jiaqi) Xu	66d90e8054	[inductor][fx passes] add tensor size limit for group fusion and enable batch fusion (#106627 ) Summary: Add threshhold for max size. if tensor size> threshold, we will not fuse them. Enable batch_fusion by default since we have found consistent qps gain and ne neutral. Test Plan: Some local test result in: https://docs.google.com/document/d/1-qNuvGejhGgwKmRVTbz98_-SVu_fMoKgFcxyxrNMH_M/edit 4096 should be a better threshold for ads cmf model. f465511761 f465519705 4.8% qps gain {F1064213077} ne neutral {F1064214423} Reviewed By: yanboliang Differential Revision: D48042826 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106627 Approved by: https://github.com/jansel	2023-08-07 20:34:20 +00:00
Michael Voznesensky	45c03b1ad4	Better dynamo dict support via SetVariable keys (#106559 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106559 Approved by: https://github.com/ezyang	2023-08-07 20:20:06 +00:00
Nikita Shulga	4639ceb3fd	[BE] Use convenience function (#106709 ) Rather than typing `TORCH_CUDABLAS_CHECK(cublasLtMatmulDescSetAttribute(desc.descritor, attr, &value, sizeof(value))` introduce template method `setAttribute` that does the same Pull Request resolved: https://github.com/pytorch/pytorch/pull/106709 Approved by: https://github.com/Skylion007 ghstack dependencies: #106708	2023-08-07 20:12:23 +00:00
Nikita Shulga	e02a3d4ad5	[BE] Use nested namespace in `ATen/cuda` (#106708 ) <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at ac9bd0c</samp> > _We're sailing on the CUDA sea, with tensors and graphs aplenty_ > _We're refactoring the code, to make it clear and neat_ > _We're using nested namespaces, like `at::cuda::blas`_ > _So heave away, me hearties, heave away on the count of three_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/106708 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2023-08-07 20:12:23 +00:00
Mikayla Gawarecki	1317dbf176	Reland "Add nn.CircularPad{*}d for consistency + fix no_batch_dim support (#106148 )" (#106632 ) Previous one was reverted because the PR stacked under which added error-checking to Pad variants https://github.com/pytorch/pytorch/pull/106147 was reverted as internally some people pass 2D inputs to ZeroPad2d (which should actually take 3d or 4d inputs :) but there wasn't actually anything this PR was breaking according to my understanding Pull Request resolved: https://github.com/pytorch/pytorch/pull/106632 Approved by: https://github.com/albanD	2023-08-07 20:10:25 +00:00
Jane Xu	0208574db9	[NAdam] Add capturable API and tests + fix differentiable (#106615 ) This PR: - adds a capturable API for NAdam similar to Adam(W) - adds tests accordingly - discovered and fixed bugs in the differentiable implementation (now tested through the capturable codepath). Pull Request resolved: https://github.com/pytorch/pytorch/pull/106615 Approved by: https://github.com/albanD	2023-08-07 19:49:11 +00:00
hongxyan	3dd8cb12b5	[ROCM] enable test_aten cpp tests (#106476 ) This is part of effort to enable missed cpp test for ROCm platform. In this change, we enabled the test_aten cpp test. The total number of tests enabled is 214. Test plan: Tested in the rocm/pytorch-nightly:latest ``` jenkins@xxxxx:/tmp/pytorch$ .ci/pytorch/test.sh &> test_aten.out jenkins@xxxxx:/tmp/pytorch$ grep PASS test_aten.out \|wc -l 214 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106476 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2023-08-07 18:31:30 +00:00
albanD	3a07dfde48	Fix lifetime of JITException binding (#106401 ) Fix issues with new asserts introduced in 3.12 and pybind gil holding check on destructor. See https://github.com/pybind/pybind11/pull/4769 for details on why this is a preferred solution rather than skipping the decref in all pybind object destructors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106401 Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/Skylion007	2023-08-07 18:00:50 +00:00
Kshiteej K	af78e139a8	[functorch] fix dynamo support for functorch.grad (#106610 ) Ref: https://github.com/pytorch/pytorch/pull/106475#discussion_r1282384503 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106610 Approved by: https://github.com/zou3519	2023-08-07 17:44:49 +00:00
slc	2d4b1ae434	[Fix Bug] Cannot assign index like x[[1,2], :] = 2 when torch.use_deterministic_algorithms(True) to main (#105833 ) Fixes https://github.com/pytorch/pytorch/issues/105819 and fix https://github.com/pytorch/pytorch/issues/96724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105833 Approved by: https://github.com/kurtamohler, https://github.com/janeyx99	2023-08-07 17:00:19 +00:00
Yukio Siraichi	070eb88a96	Handle `Rational` divisors in `FloorDiv`. (#106644 ) Follow-up: #101173 This PR fixes the bug presented in #101173 by creating a special case for `sympy.Rational` divisors, inside `FloorDiv` evaluation. In summary: ```python FloorDiv(a, Rational(1, b)) a * b ``` Besides that, this PR also does 2 other things: - Replaces the use of the old `sympy.Mod` by the internal `Mod` (there were a few places that were still looking for the SymPy one) - Introduces debugging logs to the translation validator. These can be seen by setting the environment variable: `TORCH_LOGS=+torch.fx.experimental.validator` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106644 Approved by: https://github.com/ezyang ghstack dependencies: #106643	2023-08-07 16:52:22 +00:00
Yukio Siraichi	33e70e34a3	More readable Z3 expressions printer. (#106643 ) This PR makes Z3 expressions easier to read and understand by creating a custom printer for them. Z3 expressions can be printed in 2 forms: 1. Using the builtin `str(e)` function 2. Using the `e.sexpr()` method Problem is that (1) is a bit hard to read because its line breaks are not so intuitive. (2) is a bit nicer, but the `to_int` and `to_real` functions clutter things up. The custom printer is an improved `sexpr()` function: - Leaves everything in one line - Gets rid of `to_int` and `to_real` functions - Reconstruct the floor division operations - Merge commutative operation chains Pull Request resolved: https://github.com/pytorch/pytorch/pull/106643 Approved by: https://github.com/ezyang	2023-08-07 16:52:22 +00:00
shibo19	26846546e8	export tools/autograd to torchgen package (#106663 ) Fixes #ISSUE_NUMBER as discussed here https://github.com/pytorch/pytorch/pull/105003, I have exported tools/autograd to torchgen package, and could you have a look? @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106663 Approved by: https://github.com/zou3519	2023-08-07 16:14:51 +00:00
feifan	273ad1dd23	Fix typo in jit_opt_limit.h (#106684 ) Fix typo Pull Request resolved: https://github.com/pytorch/pytorch/pull/106684 Approved by: https://github.com/zou3519	2023-08-07 13:51:55 +00:00
PyTorch UpdateBot	1cc002621d	[xla hash update] update the pinned xla hash (#106695 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106695 Approved by: https://github.com/pytorchbot	2023-08-07 10:55:02 +00:00
Yanbo Liang	416bf4e3e7	[Inductor][FX passes] Pre grad batch linear LHS fusion (#106497 ) This is a popular pattern in many internal user cases, we have two versions (pre and post grad) and found the pre grad version has more perf gain, which makes sense in theory as this corresponding backward graph doesn't have this pattern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106497 Approved by: https://github.com/jackiexu1992, https://github.com/jansel	2023-08-07 05:52:27 +00:00
Iris Zhang (PyTorch)	e35cb480f4	[DCP][Test]Remove broken 2d checkpoint test (#106640 ) Summary: Removing this broken test as we are not going to land the fix for 2D regression. Instead, we are going to migrate to use device_mesh and dtensor state_dict for 2D. Differential Revision: D48082586 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106640 Approved by: https://github.com/fduwjj	2023-08-07 03:13:50 +00:00
Andy Rock	aa1b2f16c5	fix `upsample_nearest` decompositions for `uint8` tensors (#106675 ) Fixes #106674. This PR aligns the implementation of `_compute_upsample_nearest_indices` with `UpSampleKernel.cpp`: `68cb854d73/aten/src/ATen/native/cpu/UpSampleKernel.cpp (L1388-L1393)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106675 Approved by: https://github.com/albanD	2023-08-07 01:52:41 +00:00
Elias Ellison	c21df02ec0	conv-bn folding in low precision (#106576 ) Batchnorm inference is done in fp32 if the inputs are in fp16/bf16 and the output is casted back down to its original precision. This causes the batchnorm weights to get constant folded to fp32, and prevented Conv-BN folding from firing. ``` def forward(self, arg0_1: bf16[32, 3, 3, 3], arg1_1: bf16[32], arg2_1: bf16[32], ...) convolution: bf16[3, 32, 15, 15] = aten..convolution.default(arg6_1, arg0_1, None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1); arg6_1 = arg0_1 = None # weight upcasting convert_element_type: f32[32] = torch.ops.prims.convert_element_type.default(arg3_1, torch.float32); arg3_1 = None convert_element_type_1: f32[32] = torch.ops.prims.convert_element_type.default(arg4_1, torch.float32); arg4_1 = None ... # end of batch norm add_1: f32[3, 32, 15, 15] = aten..add.Tensor(mul_2, unsqueeze_7); mul_2 = unsqueeze_7 = None # output downcast convert_element_type_2: bf16[3, 32, 15, 15] = torch.ops.prims.convert_element_type.default(add_1, torch.bfloat16); add_1 = None ``` I mark the convolutions which are followed by binary foldable ops in a higher precision that are then get converted back down to the original conv dtype. We fold the weights in fp32 because it's slightly better accuracy, then at the end of the pass convert back the weights to their original dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106576 Approved by: https://github.com/XiaobingSuper, https://github.com/yanboliang ghstack dependencies: #106471, #106575	2023-08-07 01:30:47 +00:00
Nikita Karetnikov	7215007f01	[pt2] add Python meta for `polygamma` (#106681 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106681 Approved by: https://github.com/ezyang	2023-08-07 00:59:14 +00:00
Nikita Karetnikov	12041d8e1f	Use default dispatch table for `tensordot.out` (#106669 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106669 Approved by: https://github.com/ezyang	2023-08-07 00:58:17 +00:00
Nikita Karetnikov	f694bcc9a8	[pt2] add meta for `_cdist_backward` (#106680 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106680 Approved by: https://github.com/Skylion007	2023-08-07 00:58:14 +00:00
Nikita Karetnikov	05e1a50723	[pt2] remove meta skips for `aminmax`, decomp exists (#106670 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106670 Approved by: https://github.com/ezyang	2023-08-07 00:55:25 +00:00
Richard Zou	26e98040da	Improve AOTAutograd tests to do something when inputs don't require grad (#106558 ) This PR: - Changes the AOTAutograd tests to also check that the output of the forward is equal under AOTAutograd and eager-mode PyTorch. - Adds a "check_gradients" flag to `check_aot_autograd`. - If True, then we attempt to compute gradients and check them. - If False, then we we just check the outputs are equal - If "auto", then we will compute gradients and check them only if some input and some output requires grad. This option is useful for crossref tests where we don't necessarily have inputs that require grad. 1) I need a testing utility to test "AOTAutograd for inference", e.g. make_fx + functionalize. 2) I want to run aot_autograd_check in crossref tests for other test suites (e.g. fbgemm) and not all inputs require grad. Test Plan: - existing tests - new tests to test the degenerate cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/106558 Approved by: https://github.com/ezyang, https://github.com/soulitzer	2023-08-07 00:11:30 +00:00
Jack Taylor	8b8f576f56	Minor update to ROCm triton commit pin (#106616 ) This is required for a fix in latest nightly wheels in which the hip_runtime header file cannot be found during triton lowering ``` /tmp/tmpqoq6gtl8/main.c:3:14: fatal error: hip/hip_runtime.h: No such file or directory 3 \| #include <hip/hip_runtime.h> \| ^~~~~~~~~~~~~~~~~~~ ``` This is a single commit update to bring in this change https://github.com/ROCmSoftwarePlatform/triton/pull/283 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106616 Approved by: https://github.com/malfet	2023-08-06 16:56:44 +00:00
FFFrog	68cb854d73	Fix CPUFallback Mechinasm on TensorList Type (#105209 ) Fixes #104965 Currently, the cpufallback mechinasm lack the code logic of TensorList, so some operators like _foreach_add_/_foreach_add don`t work well. cc @bdhirsh Pull Request resolved: https://github.com/pytorch/pytorch/pull/105209 Approved by: https://github.com/bdhirsh	2023-08-05 15:38:30 +00:00
Nikita Karetnikov	19621a73c0	[pt2] add metas for `grid_sampler_3d` ops (#106261 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106261 Approved by: https://github.com/ezyang	2023-08-05 14:48:11 +00:00
Nikita Karetnikov	bd34f85fe5	[pt2] meta for `searchsorted.Scalar`, tests, and out support (#106283 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106283 Approved by: https://github.com/ezyang	2023-08-05 09:12:29 +00:00
Shunting Zhang	0a4e5e07db	[inductor][easy] log the number of fused nodes for each graph (#106653 ) This simple PR can let me know how much more fusion the loop ordering PR can bring compared to baseline. Need this separate PR since I need include it in both the baseline and test runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106653 Approved by: https://github.com/eellison	2023-08-05 08:34:55 +00:00
Huamin Li	6540f92507	Compile AOTInductor in Meta prod env (#106636 ) Summary: Reland https://github.com/pytorch/pytorch/pull/106442 (previously reverted with https://github.com/pytorch/pytorch/pull/106492) Differential Revision: D48036309 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106636 Approved by: https://github.com/houseroad	2023-08-05 08:01:24 +00:00
Yuxin Wu	7e55dd7a15	Make NCCL default logging more friendly. (#105695 ) Default behavior for a python library should be to not print anything that's not error/warning. However today any 8GPU tasks will by default print these logs that take more than a whole screen. This is especially heavily affecting user-experience for small workloads that don't print much themselves: ``` I0719 10:50:33.485718 219407 ProcessGroupNCCL.cpp:482] [Rank 3] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 NCCL_DEBUG: WARN I0719 10:50:33.485716 219402 ProcessGroupNCCL.cpp:482] [Rank 1] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 NCCL_DEBUG: WARN I0719 10:50:33.485841 220673 ProcessGroupNCCL.cpp:581] [Rank 1] NCCL watchdog thread started! I0719 10:50:33.485882 220672 ProcessGroupNCCL.cpp:581] [Rank 3] NCCL watchdog thread started! I0719 105033.485 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 3 I0719 105033.485 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 1 I0719 10:50:33.559300 219400 ProcessGroupNCCL.cpp:482] [Rank 0] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 NCCL_DEBUG: WARN I0719 10:50:33.559444 220675 ProcessGroupNCCL.cpp:581] [Rank 0] NCCL watchdog thread started! I0719 105033.559 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 0 I0719 10:50:33.577245 219415 ProcessGroupNCCL.cpp:482] [Rank 4] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 NCCL_DEBUG: WARN I0719 10:50:33.577381 220676 ProcessGroupNCCL.cpp:581] [Rank 4] NCCL watchdog thread started! I0719 105033.577 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 4 I0719 10:50:33.583372 219404 ProcessGroupNCCL.cpp:482] [Rank 2] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 NCCL_DEBUG: WARN I0719 10:50:33.583511 220677 ProcessGroupNCCL.cpp:581] [Rank 2] NCCL watchdog thread started! I0719 105033.583 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 2 I0719 10:50:33.672052 219421 ProcessGroupNCCL.cpp:482] [Rank 5] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 NCCL_DEBUG: WARN I0719 10:50:33.672153 220684 ProcessGroupNCCL.cpp:581] [Rank 5] NCCL watchdog thread started! I0719 105033.672 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 5 I0719 10:50:33.844262 219427 ProcessGroupNCCL.cpp:482] [Rank 6] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 NCCL_DEBUG: WARN I0719 10:50:33.844411 220687 ProcessGroupNCCL.cpp:581] [Rank 6] NCCL watchdog thread started! I0719 105033.844 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 6 I0719 10:50:33.853435 219432 ProcessGroupNCCL.cpp:482] [Rank 7] ProcessGroupNCCL initialized with following options: NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 NCCL_DEBUG: WARN I0719 10:50:33.853551 220688 ProcessGroupNCCL.cpp:581] [Rank 7] NCCL watchdog thread started! I0719 105033.854 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 7 I0719 105033.854 distributed_c10d.py:247] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. I0719 105033.854 distributed_c10d.py:247] Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. ``` This PR changes the NCCL init logs from multi-line to a shorter one-line format. And changes the watchdog logs from LOG(INFO) to VLOG so it can be enabled on-demand. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105695 Approved by: https://github.com/fduwjj	2023-08-05 07:40:53 +00:00
Yanbo Liang	1819fe1324	Revert "Extend Inductor to support the third-party backend (#100706 )" (#106652 ) This reverts commit 05bd24bb3548105776cf73226927cbd0ed575c55. It caused compilation time regression on torchbench, huggingface and dynamic models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106652 Approved by: https://github.com/davidberard98, https://github.com/voznesenskym	2023-08-05 06:41:08 +00:00
PyTorch UpdateBot	dc22b4fdb1	[vision hash update] update the pinned vision hash (#106654 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106654 Approved by: https://github.com/pytorchbot	2023-08-05 03:42:54 +00:00
Thiago Crepaldi	4be6b6b673	Add quantization support to reshape and size for the ONNX exporter (#106629 ) Fixes https://github.com/microsoft/onnx-converters-private/issues/175 Add quantization support for Reshape-14, Size-9 and Size-11 For Size operators, we don't requantize outputs because we want the original scalar in the graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/106629 Approved by: https://github.com/BowenBao	2023-08-05 02:08:52 +00:00
Thiago Crepaldi	5eb429ac30	Add test support for dynamic shapes for torch.onnx.dynamo_export (#106495 ) This PR adds the ability to check whether the resulting ONNX graph has dynamic shape when the dynamic shape is enabled Only test/onnx/test_fx_to_onnx.py and test/onnx/test_fx_op_consistency.py were covered because test/onnx/test_fx_to_onnx.py does not use any common "run_test" helper to wrap `dynamo_export` call. Maybe that could be a refactor Pull Request resolved: https://github.com/pytorch/pytorch/pull/106495 Approved by: https://github.com/BowenBao	2023-08-05 01:57:36 +00:00
BowenBao	aa7824867f	[ONNX] Remove legacy diagnostic printing (#106498 ) As title, these are unused and removed to make way for adoption to PT2 logging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106498 Approved by: https://github.com/thiagocrepaldi	2023-08-05 01:29:13 +00:00
Elias Ellison	136bda2568	fix issue with checking counters in binary folding (#106575 ) We were incorrectly determining whether we had found a foldable op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106575 Approved by: https://github.com/XiaobingSuper, https://github.com/yanboliang ghstack dependencies: #106471	2023-08-05 01:25:35 +00:00
Lucy Qiu	02b9119105	[Pytorch][Vulkan] aten::flip (#106628 ) Summary: https://pytorch.org/docs/stable/generated/torch.flip.html Implement flip for vulkan. For batch and channel cases: - Calculate the logical tensor values of N and C from pos.xyz - Flip the logical tensor value of N, C or both - Use `n[C/4] + i/4, i%4` to get the new tensor value Test Plan: New tests: ``` lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="flip" Recommended: Free up disk space to speed up builds. Only 17GB is available on disk. Buck is slow when free disk space is under 50GB. Consider running this command (from your home directory) to reclaim purgeable space: sudo /System/Library/Filesystems/apfs.fs/Contents/Resources/apfs.util -P Downloaded 0/53 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules) Building: finished in 35.3 sec (100%) 536/536 jobs, 6/536 updated Total time: 35.3 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = flip [==========] Running 4 tests from 1 test suite. [----------] Global test environment set-up. [----------] 4 tests from VulkanAPITest [ RUN ] VulkanAPITest.flip_1d [ OK ] VulkanAPITest.flip_1d (117 ms) [ RUN ] VulkanAPITest.flip_2d [ OK ] VulkanAPITest.flip_2d (1 ms) [ RUN ] VulkanAPITest.flip_3d [ OK ] VulkanAPITest.flip_3d (2 ms) [ RUN ] VulkanAPITest.flip_4d [ OK ] VulkanAPITest.flip_4d (10 ms) [----------] 4 tests from VulkanAPITest (132 ms total) [----------] Global test environment tear-down [==========] 4 tests from 1 test suite ran. (132 ms total) [ PASSED ] 4 tests. lfq@lfq-mbp fbsource % ``` clang-format on `Flip.cpp` and `flip.glsl` Reviewed By: SS-JIA Differential Revision: D47921025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106628 Approved by: https://github.com/SS-JIA	2023-08-05 00:59:29 +00:00
cyy	c287262b02	enable missing-prototypes warnings on MPS backend (#105831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105831 Approved by: https://github.com/kulinseth, https://github.com/albanD	2023-08-05 00:22:56 +00:00
Yanbo Liang	e190afb829	[Dynamo] Allow users to patch custom builtin functions and inline them (#106595 ) Fixes Meta internal user case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106595 Approved by: https://github.com/jansel	2023-08-04 23:47:09 +00:00
Elias Ellison	e61558b5fe	Test fixes (#106586 ) Fix for https://github.com/pytorch/pytorch/issues/106548 and https://github.com/pytorch/pytorch/issues/106299. The fallback was not actually testing fallback anymore now that we have a fake tensor rule for conv. Memory format fallback testing is also now exercised in test_ops.py `TestFakeTensor`. Gc collect fixes the list_clearing test. I suspect their was a refcycle introduced which is making it flakey. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106586 Approved by: https://github.com/wconstab	2023-08-04 23:23:17 +00:00
Mikayla Gawarecki	786977c647	[easy] Add reset_parameters for nn.PRelu (#106507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106507 Approved by: https://github.com/albanD	2023-08-04 23:22:42 +00:00
Austin	45f6ef2597	Expose intended public constraints. Fixes #106386 (#106458 ) Fixes #106386 Straightforward change, just exposes the `one_hot` and `nonnegative` distribution constraints that are intended to be public. This fixes downstream pyro usage of these constraints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106458 Approved by: https://github.com/ezyang, https://github.com/kit1980	2023-08-04 23:20:59 +00:00
Elias Ellison	578969ca61	skip maml (#106471 ) This one benchmark distorts benchmarks because it is so low (.0007, the equivalent of a 1400x speedup). It also has been flakey, which has produced a lot of noise. Disabling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106471 Approved by: https://github.com/anijain2305	2023-08-04 22:14:09 +00:00
Jason Ansel	a01e795a6d	[Compiled Autograd] Fix bug with multithreading check (#106621 ) Fixes #106555 There was bug where the multithreading check would fire because of the `compiled_autograd.disable()` calls in AotAutograd, even though compiled autograd was already disabled, so that call was doing nothing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106621 Approved by: https://github.com/yanboliang	2023-08-04 20:49:21 +00:00
AllenTiTaiWang	b782beb18e	[ONNX] Expose OnnxRegistry publicly (#106140 ) The official move of `OnnxRegistry` to `torch.onnx` allows it to become one of the parameters in `torch.onnx.ExportOption`. By incorporating `OnnxRegistry` in `torch.onnx.ExportOption`, users gain access to various functionalities, including the ability to register custom operators using `register_custom_op`, check whether an operator is supported using `is_registered_op`, and obtain symbolic functions that support specific operators using `get_functions`. Additionally, `opset_version` is now exclusively available in `torch.onnx.OnnxRegistry` as it is removed from `torch.onnx.ExportOption`. The initialization of the registry with torchlib under the provided opset version ensures that the exporter uses the specified opset version as the primary version for exporting. These changes encompass scenarios where users can: 1. Register an unsupported ATen operator with a custom implementation using onnx-script. 2. Override an existing symbolic function (onnx invariant). NOTE: The custom registered function will be prioritized in onnx dispatcher, and if there are multiple custom ones, the one registered the last will be picked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106140 Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi	2023-08-04 20:46:03 +00:00
William Wen	5dcc85d663	[dynamo, logging] add default pt2 logging group (#106417 ) Create a new logging group that enables "interesting" logging: graph breaks, recompiles, symbolic shapes, guards, source trace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106417 Approved by: https://github.com/ezyang	2023-08-04 20:34:42 +00:00
Jerry Zhang	2156f0434c	[quant][pt2e] Add reference representation for quantized adaptive_avg_pool2d (#105709 ) Summary: Implementing reference representation for quantized ops we decided in https://docs.google.com/document/d/17h-OEtD4o_hoVuPqUFsdm5uo7psiNMY8ThN03F9ZZwg/edit#heading=h.ov8z39149wy8 Test Plan: python test/test_quantization.py TestQuantizePT2E.test_representation_adaptive_avg_pool2d Although right now it is not really testing things since there is some problem with dynamo export Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/105709 Approved by: https://github.com/andrewor14 ghstack dependencies: #105708	2023-08-04 18:49:14 +00:00
bobby-palmer	3e6da46aff	err on dot product for tensors of different sizes (#106572 ) Fixes #106448 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106572 Approved by: https://github.com/ezyang	2023-08-04 18:34:34 +00:00
Yanbo Liang	df8abaaf5f	[Dynamo] Revert 'Enable torch._dynamo.config.suppress_errors by default' (#106562 ) D47969512 was the original diff to revert this, but the diff train doesn't work well, so I have to split it into two part: this OSS PR and another separate diff to revert the fbcode change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106562 Approved by: https://github.com/angelayi	2023-08-04 16:46:21 +00:00
PyTorch MergeBot	d67f4d4e9f	Revert "[DCP][Test]Remove broken 2d checkpoint test (#106367 )" This reverts commit d2a9b256f00742b3fd1271ad087fc4e02144aed8. Reverted https://github.com/pytorch/pytorch/pull/106367 on behalf of https://github.com/jeanschmidt due to Breaking internal builds for diff D48007925 ([comment](https://github.com/pytorch/pytorch/pull/106367#issuecomment-1665901322))	2023-08-04 16:45:28 +00:00
FFFrog	ae4b2d272f	Fix the Test of duplicate registration on genarator (#106536 ) The duplicate registration test case shown in the figure below has always failed. `3d165dc3f3/test/test_cpp_extensions_open_device_registration.py (L171-L173)` `3d165dc3f3/aten/src/ATen/core/GeneratorForPrivateuseone.h (L36-L37)` Because there is a static variable in the ```self.module.register_generator()``` function, it will only be initialized once. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106536 Approved by: https://github.com/albanD	2023-08-04 16:09:40 +00:00
FFFrog	8e76c01043	Fix the api of privateuse1 in comment (#106537 ) Fix the api of privateuse1 in comment Pull Request resolved: https://github.com/pytorch/pytorch/pull/106537 Approved by: https://github.com/albanD	2023-08-04 16:07:26 +00:00
Edward Z. Yang	91afefb55b	Fix some fake mode confusion between inner/outer fake mode in export (#106515 ) Fixes https://github.com/pytorch/pytorch/issues/106412 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106515 Approved by: https://github.com/voznesenskym, https://github.com/BowenBao, https://github.com/thiagocrepaldi	2023-08-04 15:42:23 +00:00
angelayi	5b13c779d4	[AOTInductor] Remove call to aot_autograd when receiving ExportedProgram (#105977 ) https://github.com/pytorch/pytorch/issues/105555 Existing flow first exports and then calls torch._inductor.aot_compile. However, export calls aot_autograd with the core aten decomposition table, and then torch._inductor.aot_compile calls aot_autograd again with the inductor decomposition table. The 2nd calling of aot_autograd is supposedly causing some problems, and seems excessive, so instead we will create a new function, torch._export.aot_compiler which will export using the inductor decomposition table, pass it to inductor's compile_fx_aot, and because it has already been exported, avoid recalling aot_autograd. ``` def aot_compile( f: Callable, args: Tuple[Any], kwargs: Optional[Dict[str, Any]] = None, constraints: Optional[List[Constraint]] = None, ) -> Tuple[str, ExportedProgram]: ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105977 Approved by: https://github.com/desertfire, https://github.com/zhxchen17, https://github.com/eellison	2023-08-04 15:35:23 +00:00
Michael Gschwind	63d45275f4	is causal hints for transformer (#106143 ) Summary: make is_causal hint flags available for the top level transformer module. It's debatable whether this is useful -- at present we autodetect causal masks for src and tgt masks in transformer encoder and decoder, respectively. is_causal flags available woul enable users to short-cut this check by asserting whether they mask is causal, or not. I am putting this diff up for discussion, not as a solution. Not doing anything may be the right solution, unless there is strong (data-driven) user demand. -- it appears the consensus is to move ahead with this, as per discussions below. @cpuhrsch @mikaylagawarecki @jbschlosser @janEbert Test Plan: sandcastle Differential Revision: D47373260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106143 Approved by: https://github.com/mikaylagawarecki	2023-08-04 14:16:48 +00:00
Richard Zou	e421edf377	Add utility to test if autograd was registered correctly (#106561 ) See docstring for more details. This API is not meant to be directly user-facing, it is meant to be used as a subtest of D47965247, which is coming soon. Test Plan: - Some new tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106561 Approved by: https://github.com/ezyang, https://github.com/soulitzer	2023-08-04 13:39:10 +00:00
dilililiwhy	5a9e82fa02	let torch.device be overrideable by TorchFunctionMode (#106514 ) Fixes #103828 let torch.device be overrideable by TorchFunctionMode Pull Request resolved: https://github.com/pytorch/pytorch/pull/106514 Approved by: https://github.com/ezyang	2023-08-04 10:47:43 +00:00
Li-Huai (Allan) Lin	d4d086ce7b	[MPS] Fix Clamp with strided outputs/inputs (#97858 ) Fixes #94396 Fixes #87348 1. If output is strided, we don't gather input tensors. 2. If output is not strided but min_t or max_t is strided, we make min_t or max_t contiguous. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97858 Approved by: https://github.com/kulinseth	2023-08-04 09:32:12 +00:00
Jerry Zhang	9e301949ec	[quant][pt2e] Add reference representation for quantized max_pool2d (#105708 ) Summary: Implementing reference representation for quantized ops we decided in https://docs.google.com/document/d/17h-OEtD4o_hoVuPqUFsdm5uo7psiNMY8ThN03F9ZZwg/edit#heading=h.ov8z39149wy8 Test Plan: python test/test_quantization.py TestQuantizePT2E.test_representation_maxpool2d Although right now it is not really testing things since there is some problem with dynamo export Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/105708 Approved by: https://github.com/andrewor14	2023-08-04 08:19:52 +00:00
angelayi	b2d3a2f433	[inductor] Remove ReinterpretView copy_ for AOT Inductor outputs (#106564 ) Running benchmark on HF models result in 71% pass rate now: P802905571 Updated [dashboard](https://hud.pytorch.org/benchmark/compilers?startTime=Fri%2C%2028%20Jul%202023%2005%3A02%3A20%20GMT&stopTime=Fri%2C%2004%20Aug%202023%2005%3A02%3A20%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/bench&lCommit=e35a655e59b2038c0395f972a1f567f862093d9c&rBranch=main&rCommit=3e5a52cedd2d586fc6cb40a73a098252b9edc2a1) Originally, a lot of the HF export-aot-inductor tests are failing with the error message: ``` RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation. ``` I looked at the result of one of the models, AlbertForMaskedLM, and the error is due to an additional [`copy_`](https://www.internalfb.com/phabricator/paste/view/P802043305?lines=1460%2C1426%2C1438%2C1451%2C1428) being inserted at the end. Looking at the [exported graph](https://www.internalfb.com/phabricator/paste/view/P802908243?lines=1124), `buf237` in the cpp program corresponds to the `view_269` node. During inductor lowering, this `view_269` node will result in a `ir.ReinterpretView` node, and when generating code for the outputs, this [line](https://fburl.com/code/epola0di) will add an additional `copy_`. I'm unsure if removing this case will result in other errors, but it seems to raise the HF model benchmark pass rate :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106564 Approved by: https://github.com/jansel	2023-08-04 07:51:29 +00:00
Kshiteej K	a899333ffc	fix: nll_loss batch rule with negative ignore_idx (#106118 ) We use python decompositions instead of writing our own for batching rules. Fixes https://github.com/pytorch/pytorch/issues/105736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106118 Approved by: https://github.com/lezcano, https://github.com/zou3519	2023-08-04 07:43:02 +00:00
Tarun Karuturi	f8817d8ac8	Remove deepcopy override from ExportedProgram (#106578 ) Summary: When we do a deep copy of the ExportedProgram because of the custom deep copy override the graph metadata (graph.meta) is failing to be copied over. This can be fixed but overall i don't see a need for a custom deepcopy in ExportedProgram and thus trying to get rid of it. Test Plan: CI Differential Revision: D48043723 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106578 Approved by: https://github.com/JacobSzwejbka	2023-08-04 06:31:32 +00:00
arunppsg	0ae7afd14e	[MPS] Adding renorm implementation (#106059 ) Related to #77764 Adding support for `aten::renorm` in mps backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106059 Approved by: https://github.com/kulinseth	2023-08-04 05:31:30 +00:00
XiaobingSuper	aaa989c244	inductor: support linear fusion when multi linear using same input (#106300 ) For ```llama``` model, there has a pattern that multi linear using the same input and input dim > 2: ```input->view->(linear->view->silu, linear->view)```, this PR update the pattern to make the linar+silu can be fused(we first need remove view ops, and then apply fusion patterns). Pull Request resolved: https://github.com/pytorch/pytorch/pull/106300 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-08-04 03:53:00 +00:00
Fuzzkatt	3c7331742a	test_fused_sdp_choice in test_transformers.py fix (#106587 ) sdp dispatcher prioritizes flash attention over efficient attention: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp#L684-L687, and flash attention is enabled for sm75+: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp#L625. Thus, the unit test `test_fused_sdp_choice` from `test_transformers.py` which is failing on T4 (sm75) should have this `SM80OrLater` check changed to `SM75OrLater`: https://github.com/pytorch/pytorch/blob/main/test/test_transformers.py#L1914-L1917. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106587 Approved by: https://github.com/drisspg	2023-08-04 03:43:56 +00:00
PyTorch UpdateBot	d4271b16ca	[vision hash update] update the pinned vision hash (#106588 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106588 Approved by: https://github.com/pytorchbot	2023-08-04 03:07:30 +00:00
Huy Do	8fe5fa8613	Update mobile build docker image to pytorch-linux-jammy-py3-clang12-asan (#106582 ) `pytorch-linux-focal-py3-clang7-asan` has been removed by https://github.com/pytorch/pytorch/pull/106355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106582 Approved by: https://github.com/malfet	2023-08-04 01:21:30 +00:00
Amadeusz Skrzypczak	b283e93158	Add missing hpu check to is_any_autocast_enabled (#106539 ) Hpu check was never added to this function, because both commits were delivered the same day. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106539 Approved by: https://github.com/albanD	2023-08-04 00:40:56 +00:00
Cao Doan	93f538db35	Fix nullable-to-nonnull-conversion warnings (#106232 ) Summary: Cleaning up some build warnings when Wnullable-to-nonnull-conversion is enabled. Changelog: Fixes nullability warnings from `MetalContext.mm`. Test Plan: ``` buck build fbsource//fbobjc/mode/iphonesimulator //fbobjc/Apps/MSQRD/MSQRDPlayer:ARStudioPlayer ``` Differential Revision: D47886793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106232 Approved by: https://github.com/drisspg	2023-08-03 23:18:30 +00:00
Piotr Gryko	c0b8b7b90c	[inductor] Enable mypy checking in torch/_inductor/metrics.py (#105793 ) As suggested in https://github.com/pytorch/pytorch/issues/105230 Implements small fix for torch/_inductor/metrics.py I ran into a circular import, which I handled using if TYPE_CHECKING (https://docs.python.org/3/library/typing.html#constant). There are then two options for describing the types, either use their class names as strings or use from future import annotations ``` If from __future__ import annotations is used, annotations are not evaluated at function definition time. Instead, they are stored as strings in __annotations__. This makes it unnecessary to use quotes around the annotation (see [PEP 563](https://peps.python.org/pep-0563/)). ``` I'm open to suggestions if it does not meet your coding guidelines Pull Request resolved: https://github.com/pytorch/pytorch/pull/105793 Approved by: https://github.com/Skylion007	2023-08-03 22:43:57 +00:00
Shunting Zhang	ce608712cb	[inductor] don't cache non-static content (#106502 ) I happened to find that inductor may cache stale inner_fn_str and ReadWrites object in a ComputedBuffer when I work on looping ordering. Let's say we have producer buffer buf0 and consumer buffer buf1. Before we call GraphLowering.finalize, the layout for buf0 may be a FlexibleLayout. At that moment, the inner_fn_str or ReadWrites object computed for buf1 will be based on the layout of buf0 which most likely is a contiguous FlexibleLayout. And they will be cached on buf1 object (or buf1.data). However after we call GraphLowering.finalize, we may realize it's better to give a non-contiguous layout for buf0 (e.g., if its input has non-contiguous layout or whatever reason). The layout change of buf0 should affect the inner_fn_str and ReadWrites object for buf1. But we may have cached those on buf1. The stale ReadWrites objects for buf1 may result in sub-optimal strides for buf1. This may affect perf and I'll check the nightly runs. Here is a dump of `nodes` in `Scheduler.__init__` before the fix as a reference: https://gist.github.com/shunting314/ed2152a08e268f5563fd55398b1392c7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106502 Approved by: https://github.com/jansel	2023-08-03 22:09:58 +00:00
v-s-2	60121e391b	[caffe2] Replace `CAFFE_` prefixes in `static_tracepoint.h` macros with `TORCH_` (#106380 ) Summary: Rename static tracepoint macros to better describe their targeted usage. Test Plan: Same as for D47159249: Tested the following macros on test scripts with libbpf USDTs: * `CAFFE_SDT` * `CAFFE_DISABLE_SDT` * `CAFFE_SDT_WITH_SEMAPHORE` Reviewed By: chaekit Differential Revision: D47727339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106380 Approved by: https://github.com/chaekit	2023-08-03 21:51:36 +00:00
David Berard	1642daeaaa	[inductor] codegen dynamic shapes tests: reset inductor metrics (#106481 ) A bunch of the tests are getting skipped/xfailed because of generated_kernel_count checks. In other tests, inductor metrics automatically get reset in the common() function, so we should do this in the test_torchinductor_codegen_dynamic_shapes tests as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106481 Approved by: https://github.com/eellison	2023-08-03 21:50:25 +00:00
Benjamin Ghaemmaghami	424dc238f4	Fix split module interaction with dead code (#104554 ) Summary: This change fixes split_module's interaction with dead code. Previously if a dead region was split out, split module would throw an error while attempting to access the outputs for the partition even though the partition has no outputs. This change adds a new unit test to cover the dead code case and changes the output check to allow no output. The split module with no output will now output None like a normal python function Unit Test Added: test_split_module_dead_code A module with dead code: ``` class ModWithDeadCode(torch.nn.Module): def forward(self, x): output = x * 2 # we want this dead_line = x + 2 # this is dead return output ``` Before: ``` torch/fx/passes/split_module.py, line 357, in split_module base_mod_env[list(partition.outputs)[0]] = output_val IndexError: list index out of range ``` After: ``` class GraphModule(torch.nn.Module): def forward(self, x): # No stacktrace found for following nodes submod_2 = self.submod_2(x) submod_1 = self.submod_1(x); x = None return submod_1 class GraphModule(torch.nn.Module): def forward(self, x): # No stacktrace found for following nodes add = x + 2; x = None return None class GraphModule(torch.nn.Module): def forward(self, x): # No stacktrace found for following nodes mul = x * 2; x = None return mul ``` Submod 2 is correctly extracted Test Plan: Tested with new unit test Differential Revision: D47196732 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104554 Approved by: https://github.com/yf225	2023-08-03 21:36:35 +00:00
rraminen	239578beff	[ROCm] Enable a few bfloat16 unit tests (#105177 ) Currently a few unit tests from test_matmul_cuda and test_sparse_csr test suites are being skipped on ROCm. This PR is to enable the following unit tests on ROCm (~30 UTs): test_cublas_baddbmm_large_input_* (__main__.TestMatmulCudaCUDA) test_addmm_sizes_all_sparse_csr* (__main__.TestSparseCSRCUDA) when m==0 or n==0 or k==0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105177 Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet	2023-08-03 21:17:19 +00:00
Howard Huang	236eda4d51	remove jit from torchbench (#106071 ) Need to remove jit arguments after changes in https://github.com/pytorch/benchmark/pull/1787 Also curious, is there is a procedure for updating torchbench version in Pytorch CI? Pull Request resolved: https://github.com/pytorch/pytorch/pull/106071 Approved by: https://github.com/xuzhao9, https://github.com/msaroufim, https://github.com/malfet, https://github.com/lezcano	2023-08-03 21:04:43 +00:00
Mark Saroufim	b03505eca8	update expected pass for torchbench dynamic (#106573 ) fixes https://github.com/pytorch/pytorch/pull/106009#issuecomment-1664513049 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106573 Approved by: https://github.com/cpuhrsch	2023-08-03 20:46:08 +00:00
JackCaoG	c9eb95cca4	Update XLA dyanmo backend name (#106489 ) This is to deprecate the old XLA dyanmo backend and rename it `openxla`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106489 Approved by: https://github.com/jansel, https://github.com/shunting314	2023-08-03 20:00:37 +00:00
Zhengxu Chen	a8e3bd97cf	[export] cleanup pass base. [1/n] (#106480 ) Test Plan: CI Differential Revision: D48004635 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106480 Approved by: https://github.com/angelayi	2023-08-03 19:48:05 +00:00
Michael Gschwind	c27e15359a	use no_grad() consistently for testing transformer trace construction (#106523 ) Summary: check trace runs with no_grad() and grad or not impacts transformer trace construction. use no_grad() consistently Test Plan: sandcastle and github ci ``` buck2 run mode/opt mode/inplace //caffe2/test:test_jit_cuda -- --regex test_scriptmodule_transformer_cuda ``` Differential Revision: D48020889 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106523 Approved by: https://github.com/davidberard98	2023-08-03 19:28:20 +00:00
Michael Gschwind	3200f63ee6	Make mocked functioned return the proper result structure (tuple for native MHA for attn result and attn weights) (#106526 ) Summary: Make mocked functioned return the proper result structure (tuple for native MHA for attn result and attn weights) Test Plan: sandcastle Differential Revision: D48021277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106526 Approved by: https://github.com/davidberard98	2023-08-03 19:27:31 +00:00
Michael Voznesensky	d1a99a083f	Reland Simplify handle indexing (#105006 ) (#106357 ) This reverts commit a9a3c456495ccddff55e088ebf395c599db62d12. This PR changes the following: - `_ExecOrderData.handle_to_handle_index` -> `FlatParamHandle._handle_index` - `_ExecOrderData.handles_to_pre_forward_order_index` -> `FlatParamHandle._pre_forward_order_index` - `_ExecOrderData.handles_to_post_forward_order_index` -> `FlatParamHandle._post_forward_index` - `_FSDPState._needs_pre_forward_unshard` -> `FlatParamHandle._needs_pre_forward_unshard` - `_FSDPState._needs_pre_backward_unshard` -> `FlatParamHandle._needs_pre_backward_unshard` - `_FSDPState._handles_prefetched` -> `FlatParamHandle._prefetched` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106357 Approved by: https://github.com/awgu	2023-08-03 19:17:32 +00:00
fduwjj	578d9fee42	[DTensor][EZ] op schema comparison so that no redistribute is called (#106158 ) When looking at traces of TP more carefully, I found that for cases when input reshard is not needed, we also call redistribute within sharding propogation. Upon carefully checking, looks like the way we compare different op_schema is not correct. One example can be seen in the following trace: <img width="1146" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/7322d26f-7029-41f9-8f8c-5f27a6bb98f9"> As you can see, no collectives are called, and this redistribute is not needed. With this change: <img width="1491" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/eb4a971f-44c1-4d83-8671-fce94cfa926c"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106158 Approved by: https://github.com/Skylion007, https://github.com/wanchaol	2023-08-03 19:17:10 +00:00
Tugsbayasgalan Manlaibaatar	4c46ea583f	[Export] Support re-exportability (#106531 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106531 Approved by: https://github.com/zhxchen17	2023-08-03 18:27:26 +00:00
Michael Gschwind	3db255020b	Clarify the clarification (#106358 ) Summary: Clarify the clarification Differential Revision: D47941982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106358 Approved by: https://github.com/mikaylagawarecki	2023-08-03 16:58:36 +00:00
ydwu4	2f281949a5	[dynamo] resolve InlinedClosureVariable in InstructionTranslator stack (#106491 ) When inlining a function which loads a closure, its direct parent may not load that closure. So we cannot find the closure name in parent's symbolic locals. In this PR, we fix it by recursively searching the parent instruction translator stack to resolve the closure. Background When developing https://github.com/pytorch/pytorch/pull/105679, this corner case is triggered. A small repro is added in the test of this pr, where outer is loaded by deep2 but not by deep. ```python def test_inline_closure_not_loaded_by_parent(self): def outer(a): return a + 1 def indirect(x): return direct(x) def direct(x): def deep2(c): return outer(c) def deep(c): return deep2(c) return deep(x) x = torch.randn(3) eager = indirect(x) counter = CompileCounter() compiled = torch._dynamo.optimize(counter)(indirect)(x) ``` Running the test, we have the following error before the PR: ``` Traceback (most recent call last): File "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6584, in test_inline_closure_not_loaded_by_parent compiled = torch._dynamo.optimize(counter)(indirect)(x) File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 321, in _fn return fn(args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 481, in catch_errors return callback(frame, cache_size, hooks, frame_state) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 543, in _convert_frame result = inner_convert(frame, cache_size, hooks, frame_state) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 130, in _fn return fn(args, *kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 362, in _convert_frame_assert return _compile( File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 194, in time_wrapper r = func(args, **kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in _compile raise InternalTorchDynamoError(str(e)).with_traceback(e.__traceback__) from None File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 432, in _compile out_code = transform_code_object(code, transform) File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object transformations(instructions, code_options) File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 417, in transform tracer.run() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2067, in run super().run() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 724, in run and self.step() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 688, in step getattr(self, inst.opname)(inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 392, in wrapper return inner_fn(self, inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1116, in CALL_FUNCTION self.call_function(fn, args, {}) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 562, in call_function self.push(fn.call_function(self, args, kwargs)) File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 261, in call_function return super().call_function(tx, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return( File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 598, in inline_user_function_return result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2172, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2279, in inline_call_ tracer.run() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 724, in run and self.step() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 688, in step getattr(self, inst.opname)(inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 392, in wrapper return inner_fn(self, inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1116, in CALL_FUNCTION self.call_function(fn, args, {}) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 562, in call_function self.push(fn.call_function(self, args, kwargs)) File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return( File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 598, in inline_user_function_return result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2172, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2279, in inline_call_ tracer.run() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 724, in run and self.step() File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 688, in step getattr(self, inst.opname)(inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 392, in wrapper return inner_fn(self, inst) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1116, in CALL_FUNCTION self.call_function(fn, args, {}) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 562, in call_function self.push(fn.call_function(self, args, kwargs)) File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return( File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 598, in inline_user_function_return result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2172, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2227, in inline_call_ sub_locals, closure_cells = func.bind_args(parent, args, kwargs) File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 471, in bind_args result[name] = parent.symbolic_locals[name] torch._dynamo.exc.InternalTorchDynamoError: outer from user code: File "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6570, in indirect return direct(x) File "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6579, in direct return deep(x) File "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6577, in deep return deep2(c) Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True To execute this test, run the following from the base repo dir: python test/dynamo/test_misc.py -k test_inline_closure_not_loaded_by_parent This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------------------------------------------------- frames [('total', 1)] inline_call [] ---------------------------------------------------------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------------------------------------------------------- [2023-08-02 15:48:36,560] torch._dynamo.eval_frame: [DEBUG] skipping __init__ /home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py [2023-08-02 15:48:36,560] torch._dynamo.eval_frame: [DEBUG] skipping __enter__ /home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py [2023-08-02 15:48:36,560] torch._dynamo.eval_frame: [DEBUG] skipping helper /home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py [2023-08-02 15:48:36,560] torch._dynamo.eval_frame: [DEBUG] skipping __init__ /home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py [2023-08-02 15:48:36,560] torch._dynamo.eval_frame: [DEBUG] skipping __enter__ /home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py [2023-08-02 15:48:36,560] torch._dynamo.eval_frame: [DEBUG] skipping enable_dynamic /home/yidi/local/pytorch/torch/_dynamo/eval_frame.py [2023-08-02 15:48:36,561] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing indirect /home/yidi/local/pytorch/test/dynamo/test_misc.py:6569 TRACE starts_line indirect /home/yidi/local/pytorch/test/dynamo/test_misc.py:6569 def indirect(x): [2023-08-02 15:48:36,591] torch._dynamo.variables.builder: [DEBUG] wrap_to_fake L['x'] (3,) [<DimDynamic.STATIC: 2>] [None] TRACE starts_line indirect /home/yidi/local/pytorch/test/dynamo/test_misc.py:6570 return direct(x) [2023-08-02 15:48:36,594] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_DEREF direct [] [2023-08-02 15:48:36,594] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST x [UserFunctionVariable()] [2023-08-02 15:48:36,594] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION 1 [UserFunctionVariable(), TensorVariable()] [2023-08-02 15:48:36,595] torch._dynamo.symbolic_convert: [DEBUG] INLINING <code object direct at 0x7fbe4d366810, file "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6572> TRACE starts_line direct /home/yidi/local/pytorch/test/dynamo/test_misc.py:6572 (inline depth: 1) def direct(x): TRACE starts_line direct /home/yidi/local/pytorch/test/dynamo/test_misc.py:6573 (inline depth: 1) def deep2(c): [2023-08-02 15:48:36,595] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CLOSURE outer [] [2023-08-02 15:48:36,595] torch._dynamo.symbolic_convert: [DEBUG] TRACE BUILD_TUPLE 1 [InlinedClosureVariable()] [2023-08-02 15:48:36,595] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST <code object deep2 at 0x7fbe4d3666b0, file "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6573> [TupleVariable()] [2023-08-02 15:48:36,595] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST MiscTests.test_inline_closure_not_loaded_by_parent.<locals>.direct.<locals>.deep2 [TupleVariable(), ConstantVariable(code)] [2023-08-02 15:48:36,595] torch._dynamo.symbolic_convert: [DEBUG] TRACE MAKE_FUNCTION 8 [TupleVariable(), ConstantVariable(code), ConstantVariable(str)] [2023-08-02 15:48:36,597] torch._dynamo.symbolic_convert: [DEBUG] TRACE STORE_DEREF deep2 [NestedUserFunctionVariable()] TRACE starts_line direct /home/yidi/local/pytorch/test/dynamo/test_misc.py:6576 (inline depth: 1) def deep(c): [2023-08-02 15:48:36,597] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CLOSURE deep2 [] [2023-08-02 15:48:36,597] torch._dynamo.symbolic_convert: [DEBUG] TRACE BUILD_TUPLE 1 [NewCellVariable()] [2023-08-02 15:48:36,597] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST <code object deep at 0x7fbe4d366760, file "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6576> [TupleVariable()] [2023-08-02 15:48:36,597] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST MiscTests.test_inline_closure_not_loaded_by_parent.<locals>.direct.<locals>.deep [TupleVariable(), ConstantVariable(code)] [2023-08-02 15:48:36,597] torch._dynamo.symbolic_convert: [DEBUG] TRACE MAKE_FUNCTION 8 [TupleVariable(), ConstantVariable(code), ConstantVariable(str)] [2023-08-02 15:48:36,598] torch._dynamo.symbolic_convert: [DEBUG] TRACE STORE_FAST deep [NestedUserFunctionVariable()] TRACE starts_line direct /home/yidi/local/pytorch/test/dynamo/test_misc.py:6579 (inline depth: 1) return deep(x) [2023-08-02 15:48:36,598] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST deep [] [2023-08-02 15:48:36,598] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST x [NestedUserFunctionVariable()] [2023-08-02 15:48:36,598] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION 1 [NestedUserFunctionVariable(), TensorVariable()] [2023-08-02 15:48:36,598] torch._dynamo.symbolic_convert: [DEBUG] INLINING <code object deep at 0x7fbe4d366760, file "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6576> TRACE starts_line deep /home/yidi/local/pytorch/test/dynamo/test_misc.py:6576 (inline depth: 2) def deep(c): TRACE starts_line deep /home/yidi/local/pytorch/test/dynamo/test_misc.py:6577 (inline depth: 2) return deep2(c) [2023-08-02 15:48:36,599] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_DEREF deep2 [] [2023-08-02 15:48:36,599] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST c [NestedUserFunctionVariable()] [2023-08-02 15:48:36,599] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION 1 [NestedUserFunctionVariable(), TensorVariable()] [2023-08-02 15:48:36,599] torch._dynamo.output_graph: [DEBUG] restore_graphstate: removed 0 nodes [2023-08-02 15:48:36,599] torch._dynamo.symbolic_convert: [DEBUG] FAILED INLINING <code object deep at 0x7fbe4d366760, file "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6576> [2023-08-02 15:48:36,599] torch._dynamo.output_graph: [DEBUG] restore_graphstate: removed 0 nodes [2023-08-02 15:48:36,599] torch._dynamo.symbolic_convert: [DEBUG] FAILED INLINING <code object direct at 0x7fbe4d366810, file "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6572> [2023-08-02 15:48:36,599] torch._dynamo.output_graph: [DEBUG] restore_graphstate: removed 0 nodes ``` Test Plan: add new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/106491 Approved by: https://github.com/williamwen42, https://github.com/jansel, https://github.com/zou3519	2023-08-03 16:45:42 +00:00
Mark Saroufim	6268ab2c2d	torchbench pin upd: hf auth token, clip, whisper, llamav2, sd (#106009 ) Includes stable diffusion, whisper, llama7b and clip To get this to work I had to Pass in hf auth token to all ci jobs, github does not pass in secrets from parent to child automatically. There's a likelihood HF will rate limit us in case please revert this PR and I'll work on adding a cache next - cc @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @aakhundov @malfet Something upstream changed in torchbench too where now `hf_Bert` and `hf_Bert_large` are both failing on some dynamic shape looking error which I'm not sure how to debug yet so for now felt a bit gross but added a skip since others are building on top this work @ezyang `llamav2_7b_16h` cannot pass through accuracy checks cause it OOMs on deepcloning extra inputs this seems to make it not need to show up in expected numbers csv, will figure this when we update the pin with https://github.com/pytorch/benchmark/pull/1803 cc @H-Huang @xuzhao9 @cpuhrsch Pull Request resolved: https://github.com/pytorch/pytorch/pull/106009 Approved by: https://github.com/malfet	2023-08-03 16:28:40 +00:00
Bert Maher	0dc7f6df9d	[inductor] Make AOT CPU Inductor work in fbcode (#106225 ) Summary: This diff has a couple of hacks to make inductor-CPU work for AOT codegen in fbcode: - We need to add the CUDA link flags; AOT-Inductor is specialized for CUDA right now and uses a lot of `at::cuda` stuff. We should do a proper AOT CPU at some point but this unblocks perf measurement. - Add an include path to the cpp_prefix. It's kind of hilarious; we remove the include path for remote execution, but then for AOT we need it back. 🤷 Test Plan: internal test Differential Revision: D47882848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106225 Approved by: https://github.com/mikekgfb, https://github.com/bdhirsh, https://github.com/jansel	2023-08-03 13:56:54 +00:00
Nikita Karetnikov	1f734e03df	[pt2] add metas for `mode` ops (#106273 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106273 Approved by: https://github.com/ezyang ghstack dependencies: #106272	2023-08-03 13:11:10 +00:00
Nikita Karetnikov	70469e6f04	[pt2] add metas for `median` ops (#106272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106272 Approved by: https://github.com/ezyang	2023-08-03 13:11:10 +00:00
Andrew Gu	57fba6fd86	[FSDP][9/N] Introduce `CustomPolicy` (#104986 ) This PR adds a new `CustomPolicy` that acts like the existing `lambda_auto_wrap_policy` except it (1) leverages the new auto wrapping infrastructure and (2) allows overriding FSDP kwargs for particular instances. (1) gives it access to the validation checks (like for frozen parameters), and (2) makes it as expressive as manual wrapping. This should allow us to effectively deprecate manual wrapping if desired. The API is as follows: ``` def lambda_fn(module: nn.Module) -> Union[bool, Dict[str, Any]]: ... policy = CustomPolicy(lambda_fn) ``` The `lambda_fn` can return: - `False` or `{}` to indicate no wrapping - `True` to indicate wrapping while inheriting the root's FSDP kwargs - Non-empty `dict` to indicate wrapping while overriding the specified FSDP kwargs and inheriting the rest from the root --- After this PR, the follow-up work items for auto wrapping are: 1. Add shared parameter validation 2. (Longer-term / exploratory) Add a policy that provides a reasonable auto wrapping with "minimal" user input Pull Request resolved: https://github.com/pytorch/pytorch/pull/104986 Approved by: https://github.com/ezyang ghstack dependencies: #104427, #104967, #104999, #104969	2023-08-03 12:46:36 +00:00
Andrew Gu	15953fdf35	[FSDP][8/N] Replace `_FSDPPolicy.policy` with `_Policy._run_policy` (#104969 ) This does some code organization improvement. - It renames `_FSDPPolicy` to `_Policy` to show that it is not only for FSDP but for any module-level API. - It formalizes the contract that such a policy should return something like `target_module_to_kwargs: Dict[nn.Module, Dict[str, Any]]` that maps each module to wrap to its kwargs. It does so by requiring a `_run_policy` abstract method (this time private since users do not need to care about it). Then, our auto wrapping can just call `_run_policy()` to generate the dict and do any validation or post-processing. This PR is technically BC-breaking because it removes the public `ModuleWrapPolicy.policy`. However, I do not think anyone was using that anyway, so this is a pretty safe breakage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104969 Approved by: https://github.com/rohan-varma ghstack dependencies: #104427, #104967, #104999	2023-08-03 12:42:14 +00:00
Edward Z. Yang	697893568d	Improve error message when export encounters non-local input (#106403 ) Previously, you would get an error like ``` Dynamo input and output is a strict subset of traced input/output ``` now you get ``` Cannot export model which references tensors that are neither buffers/parameters/constants nor are direct inputs. For each tensor, if you'd like this tensor to be an explicit input, add it as a dummy argument to the top-level model definition you are exporting; if you would like its value to be embedded as an exported constant, wrap its access in a function marked with @assume_constant_result. G['bulbous_bouffant'], accessed at: File "test_export.py", line N, in f return bulbous_bouffant + y ``` This doesn't handle outputs, I'm going to hit that next. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106403 Approved by: https://github.com/tugsbayasgalan	2023-08-03 12:35:25 +00:00
Michael Lazos	83e36fe127	Fix vscode test discovery (#106490 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/106490 Approved by: https://github.com/eellison	2023-08-03 09:02:22 +00:00
Edward Z. Yang	3d165dc3f3	Upgrade expecttest to 0.1.6 (#106314 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106314 Approved by: https://github.com/malfet	2023-08-03 07:06:53 +00:00
Nikita Shulga	bcc0f4bcab	Move ASAN to clang12 and Ubuntu-22.04 (Jammy) (#106355 ) - Modify `install_conda` to remove libstdc++ from libstdcxx-ng to use one from OS - Modify `install_torchvision` to workaround weird glibc bug, where malloc interposers (such as ASAN) are causing a hang in internationalization library, see https://sourceware.org/bugzilla/show_bug.cgi?id=27653 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90589 - Modify `torch.utils.cpp_extension` to recognize Ubuntu's clang as supported compiler Extracted from https://github.com/pytorch/pytorch/pull/105260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106355 Approved by: https://github.com/huydhn ghstack dependencies: #106354	2023-08-03 05:36:04 +00:00
Nikita Shulga	97396cdbb2	Fix undefined behavior detected by clang-12 (#106354 ) Compiler behavior when non-zero offset is added to a null pointer is undefined and is a bad habit. - When `lapackEig` is called with to estimate a workspace size, do not add matrix size to the W pointer. - When `unpack_pivots_cpu_kernel` with zero `dim_size` exit early. - When `topk_impl_loop` is called with `k` is zero, exit right away as output tensors are empty anyway. - Ignore adding non-zero storage-offset in `TensorImpl::data_ptr_impl_impl`, which can be the case if tensor is created as `torch.empty(3)[4:]`. - In `s_addmm_out_sparse_dense_worker` do not call `axpy` over an empty vector. - In `_sparse_binary_op_intersection_kernel_impl` do skip computing `ptr_indices_dim` when `sparse_dim` is empty. - Exit `grid_sample` forward/backward kernels earlier if either `input` or `grid` are empty tensors. Found by asan in clang-12 Before the change UBSan report looks as follows: ``` ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-12/bin/llvm-symbolizer UBSAN_OPTIONS=print_stacktrace=1 LD_PRELOAD=/usr/lib/llvm-12/lib/clang/12.0.1/lib/linux/libclang_rt.asan-x86_64.so python test_fx_experimental.py -v -k test_normalize_operator_exhaustive_linalg_eig_cpu_float32 Test results will be stored in test-reports/python-unittest/test_fx_experimental Running tests... ---------------------------------------------------------------------- test_normalize_operator_exhaustive_linalg_eig_cpu_float32 (__main__.TestNormalizeOperatorsCPU) ... /opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/overrides.py:111: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' torch.has_cuda, /opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/overrides.py:112: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' torch.has_cudnn, /opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/overrides.py:118: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' torch.has_mps, /opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/overrides.py:119: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' torch.has_mkldnn, /var/lib/jenkins/workspace/aten/src/ATen/native/BatchLinearAlgebra.cpp:937:17: runtime error: applying non-zero offset 20 to null pointer #0 0x7f2025794888 in void at::native::lapackEig<float, float>(char, char, int, float, int, float, float, int, float, int, float, int, float, int) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x9945888) #1 0x7f20257da256 in void at::native::(anonymous namespace)::apply_linalg_eig<float>(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, bool) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x998b256) #2 0x7f20257d902d in at::native::(anonymous namespace)::linalg_eig_kernel(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor const&, bool) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x998a02d) #3 0x7f20257b5b3d in at::native::linalg_eig_out_info(at::Tensor const&, at::Tensor&, at::Tensor&, at::Tensor&, bool) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x9966b3d) #4 0x7f20257b4770 in at::native::linalg_eig_out(at::Tensor const&, at::Tensor&, at::Tensor&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x9965770) #5 0x7f20280710e6 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor&, at::Tensor&> (at::Tensor const&, at::Tensor&, at::Tensor&), &(at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU_out_linalg_eig_out(at::Tensor const&, at::Tensor&, at::Tensor&))>, std::tuple<at::Tensor&, at::Tensor&>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor&, at::Tensor&> >, std::tuple<at::Tensor&, at::Tensor&> (at::Tensor const&, at::Tensor&, at::Tensor&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, at::Tensor&, at::Tensor&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xc2220e6) #6 0x7f202727a045 in at::_ops::linalg_eig_out::call(at::Tensor const&, at::Tensor&, at::Tensor&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xb42b045) #7 0x7f20257b7e29 in at::native::linalg_eig(at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x9968e29) #8 0x7f2028070bf0 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor> (at::Tensor const&), &(at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__linalg_eig(at::Tensor const&))>, std::tuple<at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&> >, std::tuple<at::Tensor, at::Tensor> (at::Tensor const&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xc221bf0) #9 0x7f2026b1f787 in std::tuple<at::Tensor, at::Tensor> c10::Dispatcher::redispatch<std::tuple<at::Tensor, at::Tensor>, at::Tensor const&>(c10::TypedOperatorHandle<std::tuple<at::Tensor, at::Tensor> (at::Tensor const&)> const&, c10::DispatchKeySet, at::Tensor const&) const (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xacd0787) #10 0x7f20273230a7 in at::_ops::linalg_eig::redispatch(c10::DispatchKeySet, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xb4d40a7) #11 0x7f202c3cc32d in torch::autograd::VariableType::(anonymous namespace)::linalg_eig(c10::DispatchKeySet, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x1057d32d) #12 0x7f202c3cba96 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&), &(torch::autograd::VariableType::(anonymous namespace)::linalg_eig(c10::DispatchKeySet, at::Tensor const&))>, std::tuple<at::Tensor, at::Tensor>, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, std::tuple<at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x1057ca96) #13 0x7f20272798e0 in at::_ops::linalg_eig::call(at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xb42a8e0) #14 0x7f2043d97ae3 in torch::autograd::THPVariable_linalg_eig(_object, _object, _object*) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_python.so+0x23feae3) #15 0x5072d6 in cfunction_call /usr/local/src/conda/python-3.9.17/Objects/methodobject.c:543:19 ... SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/native/BatchLinearAlgebra.cpp:937:17 in ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106354 Approved by: https://github.com/huydhn, https://github.com/lezcano	2023-08-03 05:36:03 +00:00
summerdo	6e2a2849f0	[Typo]Fix a typo for index. (#106447 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/106447 Approved by: https://github.com/awgu	2023-08-03 04:58:15 +00:00
Will Constable	a6f7dd4707	Catch cuda driver shutdown error in NCCLWatchdog (#106503 ) There is a design flaw in NCCLWatchdog, namely it spawns threads that talk to the CUDA api, but the CUDA api may have been deinitialized, forming a race. This is a known issue with widespread impact (https://github.com/pytorch/pytorch/issues/90848). I should point out that i tested this fix on the repro command for https://github.com/pytorch/pytorch/issues/82632 by running `NCCL_DESYNC_DEBUG=1 CUDA_LAUNCH_BLOCKING=1 python test/distributed/test_c10d_nccl.py -k test_find_unused_parameters_kwarg_debug_detail` and observing that instead of crashing, we observe log messages with the exception string about the cuda driver shutdown error. A partial fix was landed already, but it applied too narrowly: `ec071a0815` This PR is a copy-paste of the previous fix, applying to one more case, plugging a hole. We probably need to do a more thorough review and either plug all the holes, or design this differently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106503 Approved by: https://github.com/kwen2501	2023-08-03 04:14:52 +00:00
Peter Stefek	c9c2b14c53	Fix copy_ broadcast behavior on mps (#105617 ) Fixes #105277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105617 Approved by: https://github.com/malfet	2023-08-03 04:03:32 +00:00
hl475	1c2918647a	Revert PR #106442 (#106492 ) Revert https://github.com/pytorch/pytorch/pull/106442 to prevent diff train goes to Meta internal Pull Request resolved: https://github.com/pytorch/pytorch/pull/106492 Approved by: https://github.com/osalpekar	2023-08-03 03:48:34 +00:00
Ubuntu	77e369b363	Run minification for TorchDynamo benchmark models that fail evaluation (#106201 ) ### Description As an alternative to PR #105774, which provides a standalone, end-to-end minification script that covers all types of failures and has more functionality, this PR adds the ability to minify models when they fail the eval loop (accuracy checks). Both this PR and the other one can be merged without issue. ### Purpose The goal is to leverage the minifier to minify models that fail accuracy checks, allowing failed models to be debugged more easily. The ideal use-case is trying to run a model suite on a backend where operator coverage is not known or is limited. If models can compile but fails the eval loop, having the repro script for each model is valuable for any developer that's trying to fix the issue. ### Functionality - Create minify flag that minifies models when they fail accuracy check - Produce minified graph for each model, and save it into repro script - Move repro script to output directory/base Dynamo directory - Enable functionality for running an entire model suite (Hugging Face, timm, and TorchBench) by prepending model name to repro script Pull Request resolved: https://github.com/pytorch/pytorch/pull/106201 Approved by: https://github.com/ezyang	2023-08-03 03:34:04 +00:00
PyTorch UpdateBot	4b1872f1e1	[vision hash update] update the pinned vision hash (#106500 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106500 Approved by: https://github.com/pytorchbot	2023-08-03 03:29:52 +00:00
Animesh Jain	e1a0543dac	[logs] Share same formatter between trace_source and other Dynamo loggers (#106493 ) Earlier we wont have formatter prefix - like [rank] etc. This makes grepping out for trace_source for rank harder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106493 Approved by: https://github.com/williamwen42	2023-08-03 02:53:52 +00:00
David Berard	3322bfb66e	[jit] test_complexity.py - don't set default dtype in global scope (#106486 ) Summary: Depending on import order, this was sometimes causing another assert to fail: `aec8418bd9/torch/testing/_internal/jit_metaprogramming_utils.py (L20)` Differential Revision: D48011132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106486 Approved by: https://github.com/eellison	2023-08-03 02:50:15 +00:00
Michael Lazos	14266b4955	make sure log tests are run in non-verbose mode (#106496 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/106496 Approved by: https://github.com/voznesenskym	2023-08-03 02:45:35 +00:00
gmagogsfm	410bc558e6	Assert that `args` is of tuple type. (#106352 ) This avoids accidental unpacking of tensor-type inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106352 Approved by: https://github.com/tugsbayasgalan	2023-08-03 01:47:38 +00:00
Richard Zou	fd6e052a8a	Some minor improvements to FakeTensor testing (#106311 ) Summary: - PyTorch testing chokes sometimes when it sees an exception where the first argument is not a string. fake_tensor.UnsupportedOperatorException's first arg is an OpOverload. This PR fixes PyTorch testing to not choke. I'm not really sure how to reproduce this in OSS. - It turns out that if an operator does not have a meta kernel, the FakeTensor rule is really slow (30ms in OSS in debug mode, 3s on some internal config). The thing that is slow (aside from the previous diff) is waiting for the Dispatcher to report NotImplemented and then attempting to catch that. I'm not really sure why this is slow but it's easy to workaround so I added a workaround. Test Plan: - existing tests Differential Revision: D47917554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106311 Approved by: https://github.com/eellison	2023-08-03 01:44:15 +00:00
haozhe.zhu	60237ccbdf	fix bf16 constant accuracy (#105827 ) This PR aims to sort out the data type for `constant`. The constant should be promoted to float https://github.com/pytorch/pytorch/pull/105440. So there are serval changes to do: - Data type propagation should propagate constant node to `float` dtype if original dtype is `bfloat16` - We do not need to insert `to_dtype` after the `constant` node, directly init an `fp32` constant is faster. ``` vectorized<bfloat16> tmp(value); vectorized <float> tmp1 = cvt_bf16_fp32(tmp); -> vectorized<float> tmp(value); ``` - move `constant` out of the list for `all operations can support bf16 without converting to fp32` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105827 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-08-03 01:17:50 +00:00
CaoE	f82e6ff29e	add channel last 3d support for batch_norm on CPU (#97774 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97774 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/mikaylagawarecki	2023-08-03 01:16:05 +00:00
Zachary DeVito	719c493f0b	MemoryViz: print stream 0 if other streams exist (#106483 ) It is confusing to not print stream 0, but print other streams. It makes stream 0 allocations seem like they are missing a stream annotation. This change will print streams for everything unless all the events are on stream 0, then it will just not print streams. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106483 Approved by: https://github.com/albanD ghstack dependencies: #106328, #106482	2023-08-03 00:42:13 +00:00
Zachary DeVito	6f07c57416	MemoryViz.js: format, move style (#106482 ) This updates the JS format of MemoryViz.js to match internal format. It also moves the style sheet into the JS so it is easier package for both oss and internal use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106482 Approved by: https://github.com/aaronenyeshi ghstack dependencies: #106328	2023-08-03 00:42:13 +00:00
Jerry Zhang	820e68b58a	[quant][pt2e] Add reference representation for quantized add - relu (#105707 ) Summary: Implementing reference representation for quantized ops we decided in https://docs.google.com/document/d/17h-OEtD4o_hoVuPqUFsdm5uo7psiNMY8ThN03F9ZZwg/edit#heading=h.ov8z39149wy8 Test Plan: python test/test_quantization.py TestQuantizePT2E.test_representation_add_relu Although right now it is not really testing things since there is some problem with dynamo export Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/105707 Approved by: https://github.com/andrewor14	2023-08-03 00:42:06 +00:00
Jerry Zhang	ba387b8830	[easy][be] operator_config -> quantization_config renaming (#106479 ) Summary: att Test Plan: CIs Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/106479 Approved by: https://github.com/andrewor14	2023-08-03 00:36:44 +00:00
drisspg	f533791cd0	[SDPA] Mirror c++ implementation in FlashAttention meta func (#106477 ) # Summary Test edge case and update meta function to match the c++ implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/106477 Approved by: https://github.com/eellison	2023-08-03 00:28:27 +00:00
gmagogsfm	b3c29cd1ec	Remove unused workflow.py (#106340 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106340 Approved by: https://github.com/zhxchen17	2023-08-02 23:42:06 +00:00
BowenBao	cebc11ae8f	Register ONNX exporter under PT2 logging (#105989 ) As first step of adopting PT2 logging for ONNX exporter. Also adds `torch/_logging` to `.github/merge_rules.yaml` for ONNX exporter for easier follow ups. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105989 Approved by: https://github.com/abock, https://github.com/ezyang	2023-08-02 23:33:38 +00:00
Andrew Gu	640a96dfbb	[FSDP][Easy] Allow `ModuleWrapPolicy` to take `Iterable` (#104999 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104999 Approved by: https://github.com/rohan-varma ghstack dependencies: #104427, #104967	2023-08-02 22:03:03 +00:00
Andrew Gu	031ce0fadc	[FSDP][7/N] Add warning about frozen params (#104967 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104967 Approved by: https://github.com/rohan-varma ghstack dependencies: #104427	2023-08-02 21:50:38 +00:00
Zhengxu Chen	bdcc454be4	[dynamo] Add missing fields for THPPyInterpreterFrame. (#103227 ) Fixes https://github.com/pytorch/pytorch/issues/103210 Test Plan: Before the fix: ``` pytest test/dynamo/test_export.py -k suppress_errors ``` got result: ``` File "/data/users/zhxchen17/pytorch/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/zhxchen17/pytorch/torch/nn/modules/module.py", line 1511, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/zhxchen17/pytorch/torch/_dynamo/eval_frame.py", line 295, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/data/users/zhxchen17/pytorch/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl return self._call_impl(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/zhxchen17/pytorch/torch/nn/modules/module.py", line 1511, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/zhxchen17/pytorch/torch/_dynamo/eval_frame.py", line 448, in catch_errors return callback(frame, cache_size, hooks, frame_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/zhxchen17/pytorch/torch/_dynamo/convert_frame.py", line 127, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/data/users/zhxchen17/pytorch/torch/_dynamo/convert_frame.py", line 360, in _convert_frame_assert return _compile( ^^^^^^^^^ File "/data/users/zhxchen17/pytorch/torch/_dynamo/utils.py", line 180, in time_wrapper r = func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/data/users/zhxchen17/pytorch/torch/_dynamo/convert_frame.py", line 511, in _compile exception_handler(e, code, frame) File "/data/users/zhxchen17/pytorch/torch/_dynamo/convert_frame.py", line 216, in exception_handler log.error(format_error_msg(e, code, record_filename, frame)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/zhxchen17/pytorch/torch/_dynamo/exc.py", line 248, in format_error_msg stack_above_dynamo = filter_stack(extract_stack(frame)) ^^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/traceback.py", line 231, in extract_stack stack = StackSummary.extract(walk_stack(f), limit=limit) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/traceback.py", line 393, in extract return klass._extract_from_extended_frame_gen( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/traceback.py", line 416, in _extract_from_extended_frame_gen for f, (lineno, end_lineno, colno, end_colno) in frame_gen: File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/traceback.py", line 390, in extended_frame_gen for f, lineno in frame_gen: File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/traceback.py", line 334, in walk_stack yield f, f.f_lineno ^^^^^^^^^^ AttributeError: 'torch._C.dynamo.eval_frame._PyInterpreterFrame' object has no attribute 'f_lineno' ``` After the fix: ``` pytest test/dynamo/test_export.py -k suppress_errors -s ``` Got Result: ``` File "/data/users/zhxchen17/pytorch/torch/_dynamo/exc.py", line 135, in unimplemented raise Unsupported(msg) torch._dynamo.exc.Unsupported: map() operator doesn't support scalar or zero-sized tensors during tracing. ========== The above exception occurred while processing the following code ========== File "/data/users/zhxchen17/pytorch/test/dynamo/test_export.py", line 3043, in forward def forward(self, xs): File "/data/users/zhxchen17/pytorch/test/dynamo/test_export.py", line 3047, in forward return map(body, xs) ========== unimplemented [("map() operator doesn't support scalar or zero-sized tensors during tracing.", 1)] . =============================== 1 passed, 133 deselected in 4.60s ================================ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103227 Approved by: https://github.com/williamwen42	2023-08-02 21:48:49 +00:00
Andrew Gu	a8c52863dd	[FSDP][6/N] Check valid param freezing for `ModuleWrapPolicy` (#104427 ) This PR adds improved error/warning messaging when auto wrapping with `ModuleWrapPolicy` in the presence of frozen parameters. - For `use_orig_params=False`, FSDP requires uniform `requires_grad` for each FSDP instance. This PR adds a `ValueError` at wrapping time with a message that mentions the violating module and the frozen/non-frozen parameter names. - For `use_orig_params=True`, FSDP allows non-uniform `requires_grad` for each FSDP instance. However, it will result in higher-than-expected gradient memory usage. This PR adds a `UserWarning` at wrapping time with a message that mentions the violating module, how much extra gradient memory will be used (in units of numel), and the frozen/non-frozen parameter names. - There is a possibility that this warning will be spammy/verbose, but my current thinking is that it is okay for now unless users complain. <details> <summary> Why DFS via named_children() vs. Using named_modules()</summary> ``` LoraModel( (embed_tokens): Embedding(100, 32) (layers): ModuleList( (0-3): 4 x LoraDecoder( (attn): LoraAttention( (q_proj): Linear(in_features=32, out_features=32, bias=False) (lora_A): Linear(in_features=32, out_features=8, bias=False) (lora_B): Linear(in_features=8, out_features=32, bias=False) (k_proj): Linear(in_features=32, out_features=32, bias=False) (v_proj): Linear(in_features=32, out_features=32, bias=False) (o_proj): Linear(in_features=32, out_features=32, bias=False) ) (mlp): LoraMLP( (proj1): Linear(in_features=32, out_features=128, bias=False) (proj2): Linear(in_features=128, out_features=32, bias=False) ) (inp_layernorm): LayerNorm((32,), eps=1e-05, elementwise_affine=True) (post_attn_layernorm): LayerNorm((32,), eps=1e-05, elementwise_affine=True) ) ) (norm): LayerNorm((32,), eps=1e-05, elementwise_affine=True) ) ``` Reverse topological order with stack-based DFS via `named_children()`: ``` [ 'embed_tokens', 'layers.0.attn.q_proj', 'layers.0.attn.lora_A', 'layers.0.attn.lora_B', 'layers.0.attn.k_proj', 'layers.0.attn.v_proj', 'layers.0.attn.o_proj', 'layers.0.attn', 'layers.0.mlp.proj1', 'layers.0.mlp.proj2', 'layers.0.mlp', 'layers.0.inp_layernorm', 'layers.0.post_attn_layernorm', 'layers.0', 'layers.1.attn.q_proj', 'layers.1.attn.lora_A', 'layers.1.attn.lora_B', 'layers.1.attn.k_proj', 'layers.1.attn.v_proj', 'layers.1.attn.o_proj', 'layers.1.attn', 'layers.1.mlp.proj1', 'layers.1.mlp.proj2', 'layers.1.mlp', 'layers.1.inp_layernorm', 'layers.1.post_attn_layernorm', 'layers.1', 'layers.2.attn.q_proj', 'layers.2.attn.lora_A', 'layers.2.attn.lora_B', 'layers.2.attn.k_proj', 'layers.2.attn.v_proj', 'layers.2.attn.o_proj', 'layers.2.attn', 'layers.2.mlp.proj1', 'layers.2.mlp.proj2', 'layers.2.mlp', 'layers.2.inp_layernorm', 'layers.2.post_attn_layernorm', 'layers.2', 'layers.3.attn.q_proj', 'layers.3.attn.lora_A', 'layers.3.attn.lora_B', 'layers.3.attn.k_proj', 'layers.3.attn.v_proj', 'layers.3.attn.o_proj', 'layers.3.attn', 'layers.3.mlp.proj1', 'layers.3.mlp.proj2', 'layers.3.mlp', 'layers.3.inp_layernorm', 'layers.3.post_attn_layernorm', 'layers.3', 'layers', 'norm', '' ] ``` Reverse topological order with `named_modules()`: ``` [ 'norm', 'layers.3.post_attn_layernorm', 'layers.3.inp_layernorm', 'layers.3.mlp.proj2', 'layers.3.mlp.proj1', 'layers.3.mlp', 'layers.3.attn.o_proj', 'layers.3.attn.v_proj', 'layers.3.attn.k_proj', 'layers.3.attn.lora_B', 'layers.3.attn.lora_A', 'layers.3.attn.q_proj', 'layers.3.attn', 'layers.3', 'layers.2.post_attn_layernorm', 'layers.2.inp_layernorm', 'layers.2.mlp.proj2', 'layers.2.mlp.proj1', 'layers.2.mlp', 'layers.2.attn.o_proj', 'layers.2.attn.v_proj', 'layers.2.attn.k_proj', 'layers.2.attn.lora_B', 'layers.2.attn.lora_A', 'layers.2.attn.q_proj', 'layers.2.attn', 'layers.2', 'layers.1.post_attn_layernorm', 'layers.1.inp_layernorm', 'layers.1.mlp.proj2', 'layers.1.mlp.proj1', 'layers.1.mlp', 'layers.1.attn.o_proj', 'layers.1.attn.v_proj', 'layers.1.attn.k_proj', 'layers.1.attn.lora_B', 'layers.1.attn.lora_A', 'layers.1.attn.q_proj', 'layers.1.attn', 'layers.1', 'layers.0.post_attn_layernorm', 'layers.0.inp_layernorm', 'layers.0.mlp.proj2', 'layers.0.mlp.proj1', 'layers.0.mlp', 'layers.0.attn.o_proj', 'layers.0.attn.v_proj', 'layers.0.attn.k_proj', 'layers.0.attn.lora_B', 'layers.0.attn.lora_A', 'layers.0.attn.q_proj', 'layers.0.attn', 'layers.0', 'layers', 'embed_tokens', '' ] ``` With the stack-based DFS via `named_children()`, reversing the topological order gives us each level in the module tree in the registered order, wheres with `named_modules()`, reversing the topological order gives us each level in reverse. Both are valid orders, but we prefer the former since it allows us to error/warn on the _first-registered_ module that violates the frozen/non-frozen condition. </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104427 Approved by: https://github.com/ezyang	2023-08-02 21:44:44 +00:00
atalman	aec8418bd9	Pin conda to 23.5.2 for Docker builds (#106473 ) Fixes #106470 Since conda released version 23.7.2 - https://github.com/conda/conda/releases our nightly Docker build started to fail ``` #28 12.53 ResolvePackageNotFound: #28 12.53 - conda==23.5.2 ``` This PR pins conda Docker install to 23.5.2 to fix nightly Docker release Pull Request resolved: https://github.com/pytorch/pytorch/pull/106473 Approved by: https://github.com/seemethere, https://github.com/malfet	2023-08-02 21:04:21 +00:00
ydwu4	1f1dfa9be9	Fix grad higher order handling TupleVariable (#106425 ) Previously, we assume the argnums is a ConstantVariable. However I accidentally triggered an error on CI where argnums could be a TupleVariable. In that case, we have an attribute error when access the .value of argnums. This PR adds support for the TupleVariable. It allows the unit test to pass without falling back to eager "PYTORCH_TEST_WITH_DYNAMO=1 python test/functorch/test_eager_transforms.py -k test_argnums_cpu" Test Plan: see modified test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106425 Approved by: https://github.com/yanboliang, https://github.com/anijain2305, https://github.com/kshitij12345	2023-08-02 20:57:05 +00:00
Huamin Li	f998869160	AOTInductor compile in prod env (#106442 ) Summary: This diff updates the Inductor internal compile workflow Reviewed By: wushirong Differential Revision: D47958727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106442 Approved by: https://github.com/houseroad	2023-08-02 20:39:00 +00:00
MooYeh	fb6652b56e	[profiler] add profiler parsing support for custom device. (#106142 ) We hope PyTorch profiling parsing ability can also be applicable to custom devices. Based on previous work https://github.com/pytorch/pytorch/pull/101554, we have made supplementary updates to PyTorch profiling to extend its parsing capabilities for custom devices. These modifications do not affect the original logic of the code and mainly include the following aspects: 1. Added the relevant logic for use_device in torch.profiler.profiler._KinetoProfile. 2. In torch.autograd.profiler and torch.autograd.profiler_util, custom devices profiling data parsing ability has been added using privateuse1 and use_device attributes. 3. In torch._C._autograd.pyi and torch._C._autograd.pyi, custom devices related attributes have been added. The underlying C++ logic will be added in subsequent pull requests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106142 Approved by: https://github.com/aaronenyeshi	2023-08-02 20:23:22 +00:00
angelayi	6339f57fae	Update export/export-aot-inductor benchmark code (#106323 ) Update export/export-aot-inductor benchmark code to use recent changes related to kwarg inputs and dataclass outputs. Updated [dashboard](https://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2031%20Jul%202023%2017%3A28%3A05%20GMT&stopTime=Tue%2C%2001%20Aug%202023%2017%3A28%3A05%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/benchmark&lCommit=f0987867a88b0b9510fcaf33307150e61517e7a1&rBranch=main&rCommit=f23d755e1f835485b8fef5661e7f983b520d844e) 80% pass rate on HF for export: P801372961 20% pass rate on HF for export-aot-inductor: [link](https://hud.pytorch.org/benchmark/huggingface/inductor_aot_inductor?startTime=Mon,%2031%20Jul%202023%2017:08:02%20GMT&stopTime=Tue,%2001%20Aug%202023%2017:08:02%20GMT&granularity=hour&mode=inference&dtype=bfloat16&lBranch=angelayi/benchmark&lCommit=f0987867a88b0b9510fcaf33307150e61517e7a1&rBranch=main&rCommit=f23d755e1f835485b8fef5661e7f983b520d844e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106323 Approved by: https://github.com/desertfire	2023-08-02 20:18:37 +00:00
Tarun Karuturi	3143d81f6c	Add support for edge dialect ops in exir/serde (#106371 ) Summary: Adding support for edge dialect ops in `exir/serde`. This diff does the following: - Moves the global `serialize_operator/deserialize_operator` implementations in`export/serde/serialize.py` into `GraphModuleSerializer` and `GraphModuleDeserializer` - Adds implementations of `serialize_operator/deserialize_operator` inside `GraphModuleSerializer` and `GraphModuleDeserializer` in `exir/serde/serialize.py` Test Plan: CI + Enabled edge dialect ops in `executorch/exir/tests/test_serde.py` Differential Revision: D47938280 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106371 Approved by: https://github.com/angelayi	2023-08-02 20:09:15 +00:00
Jared Levi De La Fuente Rodriguez	cc38d40cec	Document `f` parameter of `torch.save` (#106248 ) Fixes #104359 Changing the documentation for a better description of the function torch.save() compared to torch.jit.save() Pull Request resolved: https://github.com/pytorch/pytorch/pull/106248 Approved by: https://github.com/malfet	2023-08-02 19:32:44 +00:00
Richard Barnes	1534af2a5c	Add type annotations to torch/__init__.py (#106214 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106214 Approved by: https://github.com/Skylion007	2023-08-02 19:13:31 +00:00
Yukio Siraichi	bd84651e19	Replace `sympy.solve` with a new simplified one. (#105877 ) This PR implements `try_solve`: a function that tries to move terms of a relational expression around, so as to isolate a given variable on the left-hand side. For example: ```python >>> try_solve(Eq(a + 5, 3), a) Eq(a, -2) >>> try_solve(Gt(Mod(a, 3), 0), a) # returns None >>> try_solve(Gt(Mod(a, 3), 0), Mod(a, 3)) Gt(Mod(a, 3), 0), Mod(a, 3) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105877 Approved by: https://github.com/ezyang	2023-08-02 17:53:29 +00:00
leslie-fang-intel	bfed2da2e4	[Quant][PT2E] Re-enable test case of conv add/add_relu recipe for x86inductorquantizer (#105638 ) Summary Re-enable the test case of `test_conv2d_binary_with_quantizer_api` and `test_conv2d_binary_unary_with_quantizer_api` for X86InductorQuantizer. We disable these 2 testcases previously due to the time out issue in internal CI. Test Plan ``` python -m pytest test_x86inductor_quantizer.py -k test_conv2d_binary_with_quantizer_api python -m pytest test_x86inductor_quantizer.py -k test_conv2d_binary_unary_with_quantizer_api ``` Differential Revision: [D47745372](https://our.internmc.facebook.com/intern/diff/D47745372) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105638 Approved by: https://github.com/jerryzh168, https://github.com/andrewor14	2023-08-02 17:26:22 +00:00
Jane Xu	7e47343d64	[BE] document more of FSDP checkpointing logic with a sprinkle of cleaning (#106069 ) This PR should not make any functional difference. It: - adds clearer documentation - clarifies a type - revises minor typos - swaps a .keys for a .items call on a dictionary Pull Request resolved: https://github.com/pytorch/pytorch/pull/106069 Approved by: https://github.com/awgu	2023-08-02 17:19:04 +00:00
Fuzzkatt	ae1c0f42a3	update tf32 thresholds for H100 (#105879 ) Addresses tf32 threshold related failures from NVIDIA internal testing for following unit tests: H100: - test_nn.py: test_ConvTranspose2d_dilated_cuda_tf32, test_ConvTranspose2d_no_bias_cuda_tf32, test_Transformer_multilayer_coder_cuda_tf32 - test_torch.py: test_cdist_non_contiguous_batch Pull Request resolved: https://github.com/pytorch/pytorch/pull/105879 Approved by: https://github.com/ezyang	2023-08-02 16:44:01 +00:00
Yukio Siraichi	7820bd8404	Disable TV if Z3 is not found. (#106399 ) Fix: #106276 This PR disables translation validation when running _test/dynamo/test_dynamic_shapes.py_ if Z3 is not installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106399 Approved by: https://github.com/ezyang	2023-08-02 16:38:19 +00:00
Elias Ellison	57f2a8d3a8	freezing w aot (#105497 ) Freezing will take parameters and turn them into constants. A couple changes here: - move the setting of `flat_params[dropped_index]` before cpp compilation so that cpp_wrapper knows they have been dropped - compile_fx_aot is doesn't use aot_autograd for invocation, so we no longer add the wrapper which discards dropped param indices. Continuing to add arguments everywhere didn't seem great, so I added `_in_aot_compilation`, but maybe reviewers would prefer something else. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105497 Approved by: https://github.com/desertfire	2023-08-02 16:30:08 +00:00
Jouni K. Seppänen	63b7be5a6f	Use ProcessPoolExecutor in the ufmt adapter (#106123 ) When running on a host with multiple CPUs, the ufmt linter was not able to use them very effectively. The biggest single culprit seems to be debug logging inside blib2to3 trying to acquire a lock, but disabling that doesn't help much - I suppose this must be GIL contention. Changing to a ProcessPoolExecutor makes it much faster. The following timings are on a PaperSpace GPU+ instance with 8 vCPUs (the cores show up as Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60GHz but I'm not entirely clear if those are shared with other instances). On main: ``` $ time lintrunner --all-files --take UFMT ok No lint issues. real 7m46.140s user 8m0.828s sys 0m5.446s ``` On this branch: ``` $ time lintrunner --all-files --take UFMT ok No lint issues. real 1m7.255s user 8m13.388s sys 0m3.506s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106123 Approved by: https://github.com/ezyang	2023-08-02 16:28:36 +00:00
LINGAO XIAO	e7b2430818	add pruning method: Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration (#95689 ) add `class FPGMStructured` add `function FPGM_structured()` add `function _validate_distance_type()` add `function _compute_distance()` Implement method mentioned in issue #39765 --- FPGMSparsifier Implement with the new pytorch pruning API torch.ao.pruning. It is a structured pruning method, and it is added under torch.ao.pruning._experimental. Test cases are added at `test_structured_sparsifier.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95689 Approved by: https://github.com/jcaip	2023-08-02 16:24:42 +00:00
Jouni K. Seppänen	aa0b4dac46	Check that mypy is installed (#106212 ) Otherwise `python -mmypy filenames` just outputs something like "No module named mypy" and the lint run looks successful. Fixes #78695, or at least one possible cause for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106212 Approved by: https://github.com/ezyang	2023-08-02 16:11:58 +00:00
PyTorch MergeBot	dfcfd5cedb	Revert "Add nn.CircularPad{*}d for consistency + fix no_batch_dim support (#106148 )" This reverts commit 87d253697116eee12d6010233d0a57fd5b152e9e. Reverted https://github.com/pytorch/pytorch/pull/106148 on behalf of https://github.com/malfet due to Reverting as dependent PR https://github.com/pytorch/pytorch/pull/106147 was reverted as well ([comment](https://github.com/pytorch/pytorch/pull/106148#issuecomment-1662344543))	2023-08-02 14:46:00 +00:00
Jeff Daily	b37a50afda	[ROCm] fix ROCm 5.5 nightly build after hipblas change (#106438 ) Fixes ROCm 5.5 nightly build broken by #105881. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106438 Approved by: https://github.com/ezyang	2023-08-02 13:39:34 +00:00
Jesse Cai	f81f9093ec	[core][pruning][feature] cuSPARSELt build integration (#103700 ) Summary: This stack of PR's integrates cuSPARSELt into PyTorch. This PR adds support for cuSPARSELt into the build process. It adds in a new flag, USE_CUSPARSELT that defaults to false. When USE_CUSPASRELT=1 is specified, the user can also specify CUSPASRELT_ROOT, which defines the path to the library. Compiling pytorch with cusparselt support can be done as follows: `` USE_CUSPARSELT=1 CUSPARSELT_ROOT=/path/to/cusparselt python setup.py develop ``` Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/103700 Approved by: https://github.com/albanD	2023-08-02 12:48:39 +00:00
PyTorch MergeBot	d83b887f2a	Revert "Add error checking for padding modules (#106147 )" This reverts commit 0547b6279d6f7249c0e588508c2561589514d3aa. Reverted https://github.com/pytorch/pytorch/pull/106147 on behalf of https://github.com/jeanschmidt due to sadly it is breaking internal builds, and I can't coordinate a FF due to timezone differences ([comment](https://github.com/pytorch/pytorch/pull/106147#issuecomment-1661870970))	2023-08-02 09:37:40 +00:00
PyTorch MergeBot	fdd4b3aaa8	Revert "faketensor: prevent deepcopy from cloning FakeTensorMode (#104476 )" This reverts commit c54afea6eeb016b0a5ad7006f25746b2c83eaf9a. Reverted https://github.com/pytorch/pytorch/pull/104476 on behalf of https://github.com/jeanschmidt due to sadly it is breaking internal tests, and I can't coordinate a FF due to timezone differences ([comment](https://github.com/pytorch/pytorch/pull/104476#issuecomment-1661808343))	2023-08-02 08:56:33 +00:00
Jerry Zhang	d528a137e0	[quant][pt2e][quantizer] Suppoert set_module_type in XNNPACKQuantizer (#106094 ) Summary: Added support to allow users to set configurations based on module type in XNNPACKQuantizer, can also serve as an example for implementing new quantizers Test Plan: python test/test_quantization.py TestQuantizePT2E.test_xnnpack_quantizer_set_module_type Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/106094 Approved by: https://github.com/andrewor14 ghstack dependencies: #106087	2023-08-02 08:33:58 +00:00
FFFrog	936333fd5f	Fix the Requirement of CMake Version (#106254 ) Fix the Requirement of CMake Version Many CMakeLists.txt require cmake versions greater than 3.18.0, so the cmake version in cmake.py is not correct. `1da4115702/CMakeLists.txt (L1)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106254 Approved by: https://github.com/malfet	2023-08-02 08:02:52 +00:00
Zachary DeVito	8ee0b17990	Fix reference cycle in our test suite (#106328 ) In certain cases we capture ErrorMeta in a list. The ErrorMeta objects hold tracebacks which contain a frame with a local variable that refers to that list. This change mutates the list on exit from the frame so that it doesn't refer to the ErrorMeta objects, breaking the cycle. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106328 Approved by: https://github.com/huydhn	2023-08-02 07:58:32 +00:00
heidongxianhua	30442c039c	fix torch.norm for custom device (#106198 ) Fixes #ISSUE_NUMBER as title, fix torch.norm for custom device Pull Request resolved: https://github.com/pytorch/pytorch/pull/106198 Approved by: https://github.com/ezyang	2023-08-02 06:25:52 +00:00
Wang, Eikan	05bd24bb35	Extend Inductor to support the third-party backend (#100706 ) This PR intends to extend Inductor to support the third-party backend that only focuses on the code generation just like what C++/OpenMP and Triton backend have done. Currently, the generated code by Inductor contains two major parts. One is the kernel, and the other is the Python wrapper to glue the kernel. Therefore, the third-party backend needs to customize the two parts to generate its specific code. - Python wrapper code generation Inductor provides a `WrapperCodeGen` class to generate the Python wrapper code to glue the kernel. Therefore, it is straightforward for the third-party backend to generate the backend-specific Python wrapper code. It just needs to inherit the `WrapperCodeGen` class and purposely override the particular member functions. - Kernel code generation It is driven by different `Scheduling`. Hence, the third-party backend needs to provide a custom `Scheduling` for its specific kernel code generation. Currently, `CppScheduling` and `TritonScheduling` are for C++/OpenMP and Triton backend, respectively. But there is no common `Scheduling` class. Based on the scheduling invocation, this PR abstracts a common `Scheduling` class containing the following member functions. - [group_fn](`71c4becda7/torch/_inductor/scheduler.py (LL649C64-L649C64)`) - [flush](`71c4becda7/torch/_inductor/scheduler.py (L1150)`) - [can_fuse_vertical](`71c4becda7/torch/_inductor/scheduler.py (L1006)`) - [can_fuse_horizontal](`71c4becda7/torch/_inductor/scheduler.py (LL1008C45-L1008C64)`) - [codegen_template](`71c4becda7/torch/_inductor/scheduler.py (L1234)`) _This function is only available for triton. If the third-party backend behaves as a sub-class of `TritonScheduling`, it can override it or reuse it._ - [codegen_nodes](`71c4becda7/torch/_inductor/scheduler.py (L1234)`) - [codegen_sync](`71c4becda7/torch/_inductor/scheduler.py (LL1251C1-L1251C1)`). _This function is only available for triton debug purpose. But it might also be useful for other computation devices. Therefore, we'd prefer to keep this function._ The third-party backend needs to inherit from the `Scheduling` class and implement these functions. Regarding some other classes like `CppKernel` and `TritonKernel` for code generation, they are used by or part of the logic of either `Scheduling` or `WrapperCodeGen`. Hence, this PR does not define the interface and leaves the flexibility to the third-party backend. The third-party backend can decide to implement these classes from scratch or reuse them by inheriting and overriding them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100706 Approved by: https://github.com/jansel	2023-08-02 05:13:51 +00:00
Leon	850ad54139	correct spelling mistake (#106309 ) Fixes #ISSUE_NUMBER correct spelling mistake Pull Request resolved: https://github.com/pytorch/pytorch/pull/106309 Approved by: https://github.com/kit1980	2023-08-02 04:38:23 +00:00
Masaki Kozuki	04f20bb285	Use `isinstance(foreach_arg.type, ListType)` for correctness (#106428 ) the intention here is to check the element type of `List`, not all the types with `elem` attribute such as optional. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106428 Approved by: https://github.com/soulitzer	2023-08-02 04:03:58 +00:00
Nikita Shulga	92cac6bf32	InductorCpp: Fix "call to constructor is ambiguous" error (#106418 ) Not sure why `{{}}` is better that just calling a default constructor, but removing it fixes: ``` % python test_cpp_wrapper.py -v -k test_profiler_mark_wrapper_call_cpu_cpp_wrapper .... clang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=inline_extension_cwnqbbq5lr6hsaktauqhm5hulaxgvvwxphzkz3docrqablnmbd4v -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_clang\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1002\" -I/var/lib/jenkins/workspace/test/inductor/-I/var/lib/jenkins/workspace/torch/include -I/var/lib/jenkins/workspace/torch/include/torch/csrc/api/include -I/var/lib/jenkins/workspace/torch/include/TH -I/var/lib/jenkins/workspace/torch/include/THC -I/opt/conda/envs/py_3.9/include/python3.9 -isystem /var/lib/jenkins/workspace/torch/include -isystem /var/lib/jenkins/workspace/torch/include/torch/csrc/api/include -isystem /var/lib/jenkins/workspace/torch/include/TH -isystem /var/lib/jenkins/workspace/torch/include/THC -isystem /opt/conda/envs/py_3.9/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -std=c++17 -Wno-unused-variable -O3 -ffast-math -fno-finite-math-only -march=native -fopenmp -Wall -DCPU_CAPABILITY_AVX512 -D C10_USING_CUSTOM_GENERATED_MACROS -c /tmp/torchinductor_jenkins/py39_cpu/inline_extension_cwnqbbq5lr6hsaktauqhm5hulaxgvvwxphzkz3docrqablnmbd4v/main.cpp -o main.o /tmp/torchinductor_jenkins/py39_cpu/inline_extension_cwnqbbq5lr6hsaktauqhm5hulaxgvvwxphzkz3docrqablnmbd4v/main.cpp:41:50: error: call to constructor of 'c10::ArrayRef<c10::IValue>' is ambiguous RECORD_FUNCTION("inductor_wrapper_call", c10::ArrayRef<c10::IValue>({{}})); ^ ~~~~ /var/lib/jenkins/workspace/torch/include/ATen/record_function.h:580:38: note: expanded from macro 'RECORD_FUNCTION' at::RecordScope::FUNCTION, fn, inputs, ##__VA_ARGS__) ^~~~~~ /var/lib/jenkins/workspace/torch/include/ATen/record_function.h:561:20: note: expanded from macro 'RECORD_FUNCTION_WITH_SCOPE' guard, fn, inputs, ##__VA_ARGS__); \ ^~~~~~ /var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:40:7: note: candidate constructor (the implicit move constructor) class ArrayRef final { ^ /var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:40:7: note: candidate constructor (the implicit copy constructor) /var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:71:13: note: candidate constructor constexpr ArrayRef(const T& OneElt) : Data(&OneElt), Length(1) {} ^ /var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:126:28: note: candidate constructor /* implicit */ constexpr ArrayRef(const std::initializer_list<T>& Vec) ^ 1 error generated. ``` if clang12 is used as the host compiler Pull Request resolved: https://github.com/pytorch/pytorch/pull/106418 Approved by: https://github.com/desertfire	2023-08-02 04:02:15 +00:00
Jun Luo	17a3141696	Support is_mtia() (#106396 ) Summary: As title. Test Plan: CI tests. Reviewed By: yuhc Differential Revision: D47937061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106396 Approved by: https://github.com/yuhc, https://github.com/ezyang	2023-08-02 03:24:23 +00:00
PyTorch UpdateBot	4c3e137157	[vision hash update] update the pinned vision hash (#106434 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106434 Approved by: https://github.com/pytorchbot	2023-08-02 02:59:09 +00:00
Denis Vieriu	d1a2aa1909	[MPS] Fix MPS clamp issue with different dtypes between input and min/max tensors (#105747 ) - Fix the FP16 clamp issue (FP32 and FP16 are not broadcast compatible) - Fix clamp (cached graph nodes were previously replaced with the cast version) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105747 Approved by: https://github.com/kulinseth	2023-08-02 02:51:34 +00:00
Sergii Dymchenko	af37608276	Remove duplicate ops tests in test_quantized_op.py (#106398 ) The duplicates are after https://github.com/pytorch/pytorch/pull/94170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106398 Approved by: https://github.com/izaitsevfb, https://github.com/malfet, https://github.com/jerryzh168	2023-08-02 02:37:36 +00:00
Sergii Dymchenko	dd12c4c2cb	Fix wrong class name in comments (#106419 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106419 Approved by: https://github.com/izaitsevfb, https://github.com/malfet	2023-08-02 02:32:56 +00:00
David Berard	c29f8ccc02	[inductor][easy] Improved warning message for missing OMP on mac (#106241 ) If 'omp.h' file not found is encountered, link to https://github.com/pytorch/pytorch/issues/95708 for suggestions on how to work around this Error message after this change: ``` (pytorch-docs) dberard@dberard-mbp scripts % python inductor_mac_cpu.py [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT fn /Users/dberard/Documents/scripts/inductor_mac_cpu.py line 3 [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] due to: [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last): [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] File "/Users/dberard/Documents/pytorch/torch/_inductor/codecache.py", line 953, in compile_file [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] raise exc.CppCompileError(cmd, output) from e [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] CppCompileError: C++ compile error [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] Command: [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] g++ /var/folders/1k/9l4kxkgx6jn28jn_pp2th63c0000gn/T/torchinductor_dberard/4s/c4sg7sknpldhmeuikjbbjt7lcjvndzrr7h2ml2iqprmzyjjw6sn4.cpp -shared -fPIC -Wall -std=c++17 -Wno-unused-variable -I/Users/dberard/Documents/pytorch/torch/include -I/Users/dberard/Documents/pytorch/torch/include/torch/csrc/api/include -I/Users/dberard/Documents/pytorch/torch/include/TH -I/Users/dberard/Documents/pytorch/torch/include/THC -I/Users/dberard/miniconda3/envs/pytorch-docs/include/python3.9 -I/Users/dberard/miniconda3/envs/pytorch-docs/include -L/Users/dberard/miniconda3/envs/pytorch-docs/lib -lomp -O3 -ffast-math -fno-finite-math-only -Xclang -fopenmp -D C10_USING_CUSTOM_GENERATED_MACROS -o /var/folders/1k/9l4kxkgx6jn28jn_pp2th63c0000gn/T/torchinductor_dberard/4s/c4sg7sknpldhmeuikjbbjt7lcjvndzrr7h2ml2iqprmzyjjw6sn4.so [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] Output: [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] In file included from /var/folders/1k/9l4kxkgx6jn28jn_pp2th63c0000gn/T/torchinductor_dberard/4s/c4sg7sknpldhmeuikjbbjt7lcjvndzrr7h2ml2iqprmzyjjw6sn4.cpp:2: [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] /var/folders/1k/9l4kxkgx6jn28jn_pp2th63c0000gn/T/torchinductor_dberard/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h:8:10: fatal error: 'omp.h' file not found [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] #include <omp.h> [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] ^~~~~~~ [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] 1 error generated. [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] Try setting OMP_PREFIX; see https://github.com/pytorch/pytorch/issues/95708 [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] [2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] [2023-07-28 17:15:25,732] torch._dynamo.convert_frame: [WARNING] converting frame raised error, suppressing error ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106241 Approved by: https://github.com/jansel	2023-08-02 02:12:27 +00:00
James Donald	e7115dbecf	[pytorch] Suppress C4624 warnings on Windows (#106348 ) Summary: Building on Microsoft Visual Studio can show excessive warnings of the form: ``` caffe2\c10\util\Optional.h(212): warning C4624: 'c10::constexpr_storage_t<T>': destructor was implicitly defined as deleted with [ T=std::string ] caffe2\c10\util\Optional.h(411): note: see reference to class template instantiation 'c10::constexpr_storage_t<T>' being compiled with [ T=std::string ] caffe2\c10\util\Optional.h(549): note: see reference to class template instantiation 'c10::trivially_copyable_optimization_optional_base<T>' being compiled with [ T=std::string ] ``` While we have macros such as `C10_CLANG_DIAGNOSTIC_{PUSH,POP,IGNORE}`, no there's no equivalent `C10_MSVC_DIAGNOSTIC_*`, so just do the suppressions explicitly. Test Plan: CI should complete, but Windows build log will no longer contain C4624 warnings Differential Revision: D47736268 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106348 Approved by: https://github.com/albanD	2023-08-02 01:57:21 +00:00
Jerry Zhang	92a22a8098	[quant][pt2e][quantizer] Suppoert set_module_name in XNNPACKQuantizer (#106087 ) Summary: Added support to allow users to set configurations based on module name in XNNPACKQuantizer, can also serve as an example for implementing new quantizers Test Plan: python test/test_quantization.py TestQuantizePT2E.test_xnnpack_quantizer_set_module_name Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/106087 Approved by: https://github.com/andrewor14	2023-08-02 01:19:23 +00:00
Alex Settle	9ba0558d48	Add sequence_nr to aot_autograd to map forward ops to their corresponding backward ops (#103129 ) Fixes #102375 Sequence_nr increments in the forward pass and decrements in the backward pass. Backward ops with the same sequence_nr as a forward op represent the backward implementation for the op. The long term goal is to make this information available to the profiler so users can observe which ops are fused by the inductor openai triton kernels. Added a test for this feature test/dynamo/test_aot_autograd.py::AotAutogradFallbackTests::test_aot_sequence_nr. The test case uses aot_export_module() to create a joint fwd/bwd fx graph. Then it walks all the nodes in fx graph using fx_graph.graph.nodes. The seq_nr of each node is recorded in node.meta. During the fwd pass the seq_nr increments and it decrements during the bwd pass. This allows the user to map forward ops to their corresponding bwd ops which is useful for performance analysis. Expected output from the test case. SeqNr\|OrigAten\|SrcFn 0\|aten.convolution.default\|l__self___conv1 0\|aten.add.Tensor\|l__self___bn1 1\|aten._native_batch_norm_legit_functional.default\|l__self___bn1 2\|aten.relu.default\|l__self___relu1 3\|aten.add.Tensor\|add 4\|aten.view.default\|flatten 5\|aten.t.default\|l__self___fc1 6\|aten.unsqueeze.default\|l__self___fc1 7\|aten.mm.default\|l__self___fc1 8\|aten.squeeze.dim\|l__self___fc1 9\|aten.add.Tensor\|l__self___fc1 10\|aten.sub.Tensor\|l__self___loss_fn 11\|aten.abs.default\|l__self___loss_fn 12\|aten.mean.default\|l__self___loss_fn 12\|aten.ones_like.default\| 12\|aten.expand.default\| 12\|aten.div.Scalar\| 11\|aten.sgn.default\| 11\|aten.mul.Tensor\| 8\|aten.unsqueeze.default\| 7\|aten.t.default\| 7\|aten.mm.default\| 7\|aten.t.default\| 7\|aten.t.default\| 7\|aten.mm.default\| 6\|aten.squeeze.dim\| 5\|aten.t.default\| 4\|aten.view.default\| 2\|aten.threshold_backward.default\| 1\|aten.native_batch_norm_backward.default\| 0\|aten.convolution_backward.default\| 0\|aten.add.Tensor\| Pull Request resolved: https://github.com/pytorch/pytorch/pull/103129 Approved by: https://github.com/soulitzer	2023-08-02 00:52:52 +00:00
Iris	0cba33e176	[DTensor]Minor Docstring Update (#106250 ) Fix docstring to reflect change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106250 Approved by: https://github.com/wanchaol	2023-08-02 00:27:29 +00:00
PaliC	5ebb18c5c6	exclude internal folders from lint (#106291 ) Title Pull Request resolved: https://github.com/pytorch/pytorch/pull/106291 Approved by: https://github.com/ZainRizvi	2023-08-02 00:15:32 +00:00
Edward Z. Yang	76163a56c0	Refactor stack handling to always use TracingContext to populate real stack on exception (#106277 ) The basic gist of the PR is simple, but it's accompanied with some careful modifications and unit tests to make sure I got it right. Check inline comments for more details. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106277 Approved by: https://github.com/albanD, https://github.com/voznesenskym	2023-08-02 00:09:16 +00:00
Kefei Lu	753991b8c5	aot_inductor: properly split code gen compilation command (#105367 ) Summary: Proactively fix it so we don't run into strange things in the future. ``` In [5]: cmd = '''gcc "single arg with space"''' In [6]: print(cmd) gcc "single arg with space" In [7]: cmd.split(' ') Out[7]: ['gcc', '"single', 'arg', 'with', 'space"'] In [8]: shlex.split(cmd) Out[8]: ['gcc', 'single arg with space'] ``` Test Plan: CI Reviewed By: chenyang78 Differential Revision: D47532486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105367 Approved by: https://github.com/chenyang78	2023-08-01 23:32:44 +00:00
Danni Li	5e3aca6c5c	[BE] Input check for torch.nn.MultiheadAttention (#106363 ) Summary: Check `embed_dim` and `num_heads ` of `torch.nn.MultiheadAttention`. Test Plan: Please see GitHub Actions. Differential Revision: D47943134 Fix: #105630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106363 Approved by: https://github.com/mikaylagawarecki	2023-08-01 23:28:23 +00:00
drisspg	ef0576f203	[Benchmarks] Updated CSVs for improvement to visformer_small (#106414 ) # Summary Accuracy improvement from https://github.com/pytorch/pytorch/actions/runs/5730904800/job/15531967776 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106414 Approved by: https://github.com/kit1980	2023-08-01 23:02:22 +00:00
Jack Taylor	40184b28eb	[ROCm] enabling miopen_batch_norm lowering in inductor (#105740 ) Enabling miopen_batch_norm lowering for inductor only. This is to avoid errors observed in some models and perf difference is very close from initial benchmarks. ``` LoweringException: RuntimeError: Expected contiguous tensor, but got non-contiguous tensor for argument #1 'input' (while checking arguments for miopen_batch_norm) target: aten.miopen_batch_norm.default ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105740 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2023-08-01 22:39:17 +00:00
Masaki Kozuki	7a3503dfd8	Add `_foreach_sign` (#106343 ) Rel: - #106221 Should we add foreach of [`torch.sgn`](https://pytorch.org/docs/stable/generated/torch.sgn.html) as well? Pull Request resolved: https://github.com/pytorch/pytorch/pull/106343 Approved by: https://github.com/janeyx99	2023-08-01 22:33:34 +00:00
Sergii Dymchenko	b46a89bcfb	Remove skipIfROCm from test_cuda_repro.py (#106416 ) This is to fix tests failures after land race of https://github.com/pytorch/pytorch/pull/105662 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106416 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-08-01 22:28:42 +00:00
FFFrog	3ce7abe111	Fix whenRegisteringAutogradKernelWithCatchAllKernel_thenCanCallAutogradKernel (#106023 ) Both test cases should call catchallkernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/106023 Approved by: https://github.com/bdhirsh	2023-08-01 22:27:42 +00:00
Peter Stefek	97e5055a69	Add cumprod support for device mps (#104688 ) Related to #77764 Add support for the cumprod operation (which in turn allows its gradient). This also allows us to compute the gradient of prod since it was blocked behind cumprod in the case where exactly one element of the tensor was 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104688 Approved by: https://github.com/kulinseth	2023-08-01 21:51:20 +00:00
Tugsbayasgalan Manlaibaatar	fadd0859ca	Expose module method in ExportedProgram (#105575 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105575 Approved by: https://github.com/zhxchen17	2023-08-01 21:28:57 +00:00
Jack Taylor	60e65a70e5	[ROCm] enable cudagraph inductor UTs on ROCm (#105662 ) These tests can now be enabled after a hipGraph fix landed in 5.6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105662 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2023-08-01 20:55:27 +00:00
Andrew Gu	506b55fc29	[FSDP][Easy] Move `_FSDPState` attrs to avoid comment confusion (#106392 ) Resubmit of https://github.com/pytorch/pytorch/pull/106333 after rebasing (I lost the original branch locally) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106392 Approved by: https://github.com/kwen2501	2023-08-01 20:39:22 +00:00
BowenBao	5c3aae8385	[ONNX] Support type promoting sym number representing scalar output (#106178 ) Summary: * Add test cases distilled from models that requires setting dynamo config `capture_scalar_outputs` and `capture_dynamic_output_shape_ops` to True. Kudos to #105962 both configs are on by default for export now. * Improve type promotion to support fx.Node of sym number representing scalar output. * Bug fix: `onnxfunction_dispatcher` would crash if an input was mis-aligned to be attribute when doing schema matching. * Misc: re-enable op tests that are already passing. * Needs https://github.com/microsoft/onnxscript/pull/931. Waiting for merge and the publishing of the new whl. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106178 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms, https://github.com/thiagocrepaldi	2023-08-01 20:08:49 +00:00
Zachary DeVito	449f481de0	[memory snaphots] record for all devices (#106346 ) Previously calling _record_memory_history would only start recording for a single device because snapshots were also device specific. Now the visualizer packages all devices into a single page, so we snapshot recording should also enable recording for all devices. Verified locally that calling the method does not initialize cuda context on devices that have not previously been used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106346 Approved by: https://github.com/eellison	2023-08-01 19:56:15 +00:00
wz337	d2a9b256f0	[DCP][Test]Remove broken 2d checkpoint test (#106367 ) Removing this broken test as we are not going to land the fix for 2D regression. Instead, we are going to migrate to use device_mesh and dtensor state_dict for 2D. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106367 Approved by: https://github.com/fduwjj	2023-08-01 19:54:40 +00:00
Jenny	4b2c6282e0	Modify signature for tensor.tile in doc (#106295 ) Fixes #71476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106295 Approved by: https://github.com/soulitzer	2023-08-01 19:51:52 +00:00
BowenBao	05b2a6c8db	[ONNX] Do not run 'deduplicate_initializers' when 'keep_initializers_as_inputs' is True (#96320 ) ### Proposal When arg of 'keep_initializers_as_inputs' is True, it's quite possible that parameters are set by initializer of input. Hence we should disable de-duplicate initializer optimization when 'keep_initializers_as_inputs==True'. - [x] Update doc related to `keep_initializers_as_inputs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96320 Approved by: https://github.com/abock, https://github.com/thiagocrepaldi	2023-08-01 19:42:57 +00:00
drisspg	cfa4edcde0	[SDPA] Update dispatch checks to catch last_dim_stride != 1. Also update mask padding logic (#106102 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at bb1fc29</samp> This pull request simplifies and refactors the code for fused scaled dot product attention kernels in `attention.cu` and `sdp_utils.cpp`, and adds new input validation checks and tests. It also modifies the `sdp_params` struct to store optional mask tensors directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106102 Approved by: https://github.com/cpuhrsch	2023-08-01 19:13:01 +00:00
Danni Li	1a6f1d816d	[Doc] Add proj_size < hidden_size in LSTM (#106364 ) Summary: Add parameter constraint: `proj_size` has to be smaller than `hidden_size` in RNNBase doc. Ref: `ceea08a986/torch/nn/modules/rnn.py (L83)` `ceea08a986/torch/nn/modules/rnn.py (L458)` Test Plan: Please see GitHub Actions. Differential Revision: D47943365 Fix: #105628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106364 Approved by: https://github.com/mikaylagawarecki	2023-08-01 18:58:27 +00:00
Thiago Crepaldi	6d2162e644	Remove fake_mode arg from torch._dynamo.export API (#106345 ) #105477 removes the need of explicitly specifying `fake_mode`. The same effect can be achieved by wrapping `torch._dynamo.export` around a `torch._subclasses.FakeTensorMode` context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106345 Approved by: https://github.com/ezyang	2023-08-01 17:52:06 +00:00
Yukio Siraichi	596491f1f5	Propagate dynamic int on `__setitem__`. (#105923 ) Fix: #105533 This PR propagates dynamic ints used as indices for `__setitem__`. In summary, we: - Replace the integer type for `TensorIndex` (both the enum and the corresponding functions) - Accordingly modify _python_variable_indexing.cpp_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/105923 Approved by: https://github.com/ezyang	2023-08-01 17:34:03 +00:00
albanD	cf012c43f4	Do not call decref if python runtime is already dead (#106334 ) Same treatment as many other objects such as https://github.com/pytorch/pytorch/blob/main/torch/csrc/autograd/python_hook.cpp#L99 This one can outlive the python runtime due to structs like: `2f35715f0d/torch/csrc/autograd/python_cpp_function.cpp (L232)` With the pybind patch and this one, the 3.12 build at https://github.com/pytorch/pytorch/pull/106083 stops segfaulting and runs test_autograd.py just fine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106334 Approved by: https://github.com/ezyang	2023-08-01 17:22:42 +00:00
PyTorch MergeBot	6d86a255e6	Revert "Add scalar conversion using avx instructions for half (#102140 )" This reverts commit 888bdddb1ed0f3bfbbfc964f3b6080b0ea431dfd. Reverted https://github.com/pytorch/pytorch/pull/102140 on behalf of https://github.com/jeanschmidt due to This is breaking internal tests @cpuhrsch can share more context and help with a follow up ([comment](https://github.com/pytorch/pytorch/pull/102140#issuecomment-1660686075))	2023-08-01 16:35:23 +00:00
ydwu4	aaaafa1bcf	[Export] remove unused flags in export (#106336 ) Remove unused flags from export_dynamo_config: Among them: - capture_scalar_outputs: bool = True. True by default in dynamo.export: - capture_dynamic_output_shape_ops: bool = True. True by default in dynamo.export - specialize_int: bool = True: True by default in dynamo.export. - guard_nn_modules: bool = True: this flag is not being used as we never look at nn module guards and assume modules are forzen. See the [doc](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/config.py#L77) of this flag. - dynamic_shapes: bool = True: deprecated by dynamo: [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/config.py#L55 ) test plan: Added new test for allow_rnn to test its effectiveness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106336 Approved by: https://github.com/tugsbayasgalan	2023-08-01 16:10:09 +00:00
v-s-2	e35950cd0d	[caffe2] Move CAFFE SDT macros' definitions to `c10/util/` (#105856 ) Summary: Moving static tracepoint macros header to a location where it can be easily used by various PyTorch components (`c10/utill`). Test Plan: Same as for D47159249: Tested the following macros on test scripts with libbpf USDTs: * `CAFFE_SDT` * `CAFFE_DISABLE_SDT` * `CAFFE_SDT_WITH_SEMAPHORE` Reviewed By: EDG-GH Differential Revision: D47636258 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105856 Approved by: https://github.com/EDG-GH, https://github.com/chaekit	2023-08-01 14:42:55 +00:00
Peter Bell	4dc063821d	[inductor] Fix lowerings that create unexpected aliases (#105173 ) This may give the wrong result in some cases, e.g. ```python @torch.compile() def fn(x): tmp = x.ceil() x.add_(10) return tmp a = torch.zeros((), dtype=torch.int64) fn(a) # tensor(10) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105173 Approved by: https://github.com/lezcano	2023-08-01 14:09:13 +00:00
Yukio Siraichi	e514386315	Normalize builtin types to dtypes. (#106074 ) Fix: #105052 Follow-up: #105588 This PR normalizes builtin Python types (e.g. `int` and `float`) into PyTorch data types when these are passed as argument, instead of used as functions. In summary, we: - Implement `BuiltinVariable.as_proxy`, mapping Python types into PyTorch data types Pull Request resolved: https://github.com/pytorch/pytorch/pull/106074 Approved by: https://github.com/ezyang, https://github.com/lezcano	2023-08-01 13:32:19 +00:00
Mikayla Gawarecki	87d2536971	Add nn.CircularPad{*}d for consistency + fix no_batch_dim support (#106148 ) Fixes #105749 https://github.com/pytorch/pytorch/issues/95320 (tldr is that input should always be `[N, C, H, (W, D])` where only H, W and D dimensions get circular padding, so the 2D case where user wants both dimensions to be padded --> they should `.unsqueeze(0)` (as is the case for `Reflection/ReplicationPad`) but we didn't document this for circular padding. [This seems to be the old docstring](`277b05014a/torch/nn/functional.py (L4689)`) that was somehow lost. Fixes no_batch_dim support https://github.com/pytorch/pytorch/issues/104860 - Adds missing documentation for circular padding - Adds missing CircularPad modules - Migrates legacy test_nn tests from circular padding to ModuleInfo - Adds no_batch_dim support + sample inputs that test this Pull Request resolved: https://github.com/pytorch/pytorch/pull/106148 Approved by: https://github.com/albanD ghstack dependencies: #106325, #106147	2023-08-01 12:49:58 +00:00
Mikayla Gawarecki	0547b6279d	Add error checking for padding modules (#106147 ) Fixes https://github.com/pytorch/pytorch/issues/105627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106147 Approved by: https://github.com/albanD ghstack dependencies: #106325	2023-08-01 12:49:58 +00:00
Mikayla Gawarecki	c9be60cd0e	Add error inputs to ModuleInfo (mirroring OpInfo) (#106325 ) Add infra for error inputs to ModuleInfos, migrate first few error inputs tests from test_nn.py (more to come!) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106325 Approved by: https://github.com/albanD	2023-08-01 12:49:56 +00:00
yanpengquan	16df54239f	remove tensorpipe code which forgot to delete (#106301 ) When analyzing the rpc code, I found redundant tensorpipe code, which was submitted synchronously with #40846, but was not deleted synchronously when the code was subsequently deleted. The tensorpipe namespace is not useful in neither utils.h nor utils.cpp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106301 Approved by: https://github.com/ezyang, https://github.com/fduwjj	2023-08-01 08:14:50 +00:00
Nikita Karetnikov	f23d755e1f	[pt2] add meta for `ormqr` (#106278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106278 Approved by: https://github.com/ezyang	2023-08-01 06:47:48 +00:00
Justin Chu	780b90ba6c	[opinfo] Fix logic in sample_inputs_linspace (#106353 ) Previously in `sample_inputs_linspace` the logic ``` dtype == torch.uint8 and end < 0 or start < 0 ``` is equivalent to ``` (dtype == torch.uint8 and end < 0) or start < 0 ``` which skipped all `start < 0` cases. I think this is unintended and the negative inputs should only be skipped when dtype is `unit8`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106353 Approved by: https://github.com/BowenBao	2023-08-01 06:04:44 +00:00
Jane Xu	59d0dea90f	Only make a shallow copy when loading optimizer state_dict (#106082 ) The thing we do still deep copy is the param_groups, which is much lighter weight. This should also save memory when loading from a checkpoint. The deepcopy was introduced in `ecfcf39f30`, but module.py had only a shallow copy at that point so it did not actually bring parity. Incorporates an XLA fix, which is why I'm updating the pin to `ca5eab87a7` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106082 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-08-01 05:33:31 +00:00
PyTorch UpdateBot	ceea08a986	[vision hash update] update the pinned vision hash (#106350 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106350 Approved by: https://github.com/pytorchbot	2023-08-01 03:31:36 +00:00
Lucy Qiu	aa2cee44b7	[Pytorch][Vulkan] Reuse broadcast checks (#105960 ) Summary: Place broadcast checks into `Broadcast.h` and `Broadcast.cpp` for code re-use. Rename `check_inputs` to `is_broadcastable` https://pytorch.org/docs/stable/notes/broadcasting.html Test Plan: All tests https://www.internalfb.com/phabricator/paste/view/P797165124 ``` QueryPool is not available [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms) [----------] 318 tests from VulkanAPITest (8693 ms total) [----------] Global test environment tear-down [==========] 318 tests from 1 test suite ran. (8693 ms total) [ PASSED ] 317 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log ``` Differential Revision: D47741937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105960 Approved by: https://github.com/SS-JIA	2023-08-01 02:48:28 +00:00
David Berard	f456c504b9	Update kineto submodule to 465ff4cd7 (#106293 ) Repeat #106154 but without ghstack, because ghstack isn't letting me import it. Original commit message: > Reland update to kineto after https://github.com/pytorch/pytorch/pull/105866 was reverted. This new update contains a patch to check CUPTI_API_VERSION instead of CUDA_VERSION to handle cases where CUPTI_API_VERSION is behind CUDA_VERSION. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106293 Approved by: https://github.com/aaronenyeshi	2023-08-01 02:34:30 +00:00
Yang Chen	f11090288c	create benchmark example tensors with correct sizes (#106238 ) We need to consider the node's offset when we create benchmark example tensors with test_cat_addmm. Otherwise, we would fail with applying torch.as_strided to the return tensor value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106238 Approved by: https://github.com/jansel	2023-08-01 01:14:53 +00:00
Yanbo Liang	03e85be9b0	[Inductor][FX passes] New group/batch fusion pattern searching algorithm + group mm fusion + preserve memory layout (#106279 ) Summary: Major changes: * Implement a new group/batch fusion pattern searching algorithm: only fuse patterns that are in a certain depth difference (locally). * Search FX graph in reverse order since most of ops have more inputs than outputs. * Support fuse mm (linear backward) * Preserve memory layout for fbgemm.gmm. We tested in Ads models and saw consistent gains. Test Plan: Unit tests and integration test. Differential Revision: D47581710 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106279 Approved by: https://github.com/jansel, https://github.com/Skylion007	2023-08-01 01:10:44 +00:00
Kamil Dzieniszewski	a2e8ac1d34	Update Anaconda download link (#106335 ) I found out reading the readme that link is broken and thought that would be great first time contribution Pull Request resolved: https://github.com/pytorch/pytorch/pull/106335 Approved by: https://github.com/msaroufim	2023-08-01 00:57:59 +00:00
Nikita Shulga	e075f91dcc	Extend workflow sync to more workflow (#106331 ) To `slow.yml` and `mac-mps.yaml`, based on the results of the following grep: ``` % grep "sync-tag: " .github/workflows/*.yml .github/workflows/mac-mps.yml: sync-tag: macos-12-py3-arm64-build .github/workflows/mac-mps.yml: sync-tag: macos-12-py3-arm64-mps-test .github/workflows/pull.yml: sync-tag: asan-build .github/workflows/pull.yml: sync-tag: asan-test .github/workflows/pull.yml: sync-tag: win-cpu-build .github/workflows/pull.yml: sync-tag: rocm-build .github/workflows/slow.yml: sync-tag: asan-build .github/workflows/slow.yml: sync-tag: asan-test .github/workflows/trunk.yml: sync-tag: macos-12-py3-arm64-build .github/workflows/trunk.yml: sync-tag: macos-12-py3-arm64-mps-test .github/workflows/trunk.yml: sync-tag: win-cpu-build .github/workflows/trunk.yml: sync-tag: win-cuda-build .github/workflows/trunk.yml: sync-tag: rocm-build ``` Allow synced workflows to diverge with regards to `test-matrix`, to allow for both `mac-mps` and slow part of ASAN tests. Discovered while working on https://github.com/pytorch/pytorch/pull/105260 that slow sync-tag is not checked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106331 Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/seemethere	2023-08-01 00:47:28 +00:00
XiaobingSuper	55f9359d36	fix sdpa math accuracy issue when scale is negative (#105202 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105202 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/drisspg	2023-08-01 00:19:14 +00:00
drisspg	788c825837	Higher order operator util for raising if inputs require grads (#106078 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 08bd685</samp> Added a utility function `autograd_not_implemented_check` to `torch._higher_order_ops.utils` and used it in `out_dtype_autograd` to simplify and standardize the error handling for higher order operators that do not support autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106078 Approved by: https://github.com/zou3519	2023-08-01 00:13:13 +00:00
Kefei Lu	57d0bec306	aot_inductor_interface: surface previously eaten error messages (#105366 ) Summary: tldr: * change glog -> cout for important logging inside aot_inductor.so * bring a small amount of important python logging from debug to info Test Plan: CI Differential Revision: D47464665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105366 Approved by: https://github.com/desertfire	2023-08-01 00:06:53 +00:00
Jouni K. Seppänen	186352a625	[inductor] Make autotune_process.py pass mypy (#105791 ) `TensorMeta.from_irnodes` handles either a single `IRNode` or a tuple or list of them. I tried to express this with overloading, but because this file is in MYPYNOFOLLOW, the `IRNode` subclasses become `Any`, which causes the overloads to be overlapping. This changes the type of the argument to `benchmark_in_sub_process` to the more specific `TritonTemplateCaller`, since that one has the `bmreq` member and existing docstrings indicate that only the triton template benchmark is handled. The `rand_strided` call caused a mypy error because the default value for device was a string. This is fixed by adding type hints to `rand_strided` in `torch/_dynamo/testing.py`. Likewise, the return value of `PyCodeCache.load_by_key_path` can be inferred from the type hint on `PyCodeCache.cache`. Fixes one part of #105230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105791 Approved by: https://github.com/jansel, https://github.com/Skylion007	2023-07-31 23:58:38 +00:00
William Wen	018ac76362	fix x.numpy() breaks in #106211 (#106327 ) Fixes https://github.com/pytorch/pytorch/issues/106316. Need to promote [this](https://dev-discuss.pytorch.org/t/supporting-dynamo-in-python-3-11-null/1393) a little more I guess. I'm going to make a PR soon that will add `push_null` arg to `load_import_from` and other function call codegen methods that are missing the field, so that we can push null as early in the function call sequence as possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106327 Approved by: https://github.com/lezcano	2023-07-31 21:19:27 +00:00
CYuxian	2f35715f0d	[onnx] Fix output shape mismatch issue of max_pool (#106270 ) For onnx MaxPool with ceil_mode=1, the sliding windows that starts in the right padded region won't be ignored, which causes different output shape with torch. Therefore, need to add Pad op before and not to set ceil_mode for MaxPool op like what is done in symbolic_opset9 when convertting torch max_pool to onnx MaxPool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106270 Approved by: https://github.com/thiagocrepaldi	2023-07-31 21:03:08 +00:00
Vipan Nalla	2138aaa978	[FSDP] Validate buffer dtype in pre-forward hook for FSDP mixed precision tests (#106231 ) Summary: https://github.com/pytorch/pytorch/issues/104740 Test Plan: buck2 test mode/dev-nosan caffe2/test/distributed/fsdp:fsdp_mixed_precision -- test_full_precision_in_eval_buffers Reviewed By: awgu Differential Revision: D47858769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106231 Approved by: https://github.com/awgu	2023-07-31 21:01:33 +00:00
Jane Xu	23f47f746b	[optim][rprop] Minimize intermediates=1 for foreach to save memory (#105193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105193 Approved by: https://github.com/albanD	2023-07-31 20:59:26 +00:00
wangxiyuan	4eeda6616c	Correct URL Link for torchDynamo (#105903 ) Correct some error or 404 urls for torchDynamo doc Pull Request resolved: https://github.com/pytorch/pytorch/pull/105903 Approved by: https://github.com/malfet	2023-07-31 20:50:09 +00:00
Jeff Daily	5379b5f927	[ROCm] use hipblas instead of rocblas (#105881 ) - BatchLinearAlgebraLib.cpp is now split into one additional file - BatchLinearAlgebraLib.cpp uses only cusolver APIs - BatchLinearAlgebraLibBlas.cpp uses only cublas APIs - hipify operates at the file level and cannot mix cusolver and cublas APIs within the same file - cmake changes to link against hipblas instead of rocblas - hipify mappings changes to map cublas -> hipblas instead of rocblas Pull Request resolved: https://github.com/pytorch/pytorch/pull/105881 Approved by: https://github.com/albanD	2023-07-31 20:42:55 +00:00
Rodrigo Kumpera	c9c66819a1	Move more TCPStorestate from BackgroundThread to TCPStoreMasterDaemon as it won't be used by the libuv backend. (#105674 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105674 Approved by: https://github.com/H-Huang ghstack dependencies: #105163, #105164, #105184, #105672	2023-07-31 20:10:16 +00:00
BowenBao	57a47ed905	[ONNX] Log instead of crash when 'tabulate' is not installed (#106228 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106228 Approved by: https://github.com/justinchuby	2023-07-31 19:16:37 +00:00
Nikita Shulga	196372415a	Use nodejs-16 for docs builds (#106312 ) As node-12 EOLed long time ago and not available for Ubuntu-22.04 (Discovered while working on bionic deprecation). Remove artificial constraint on gcc-10 downgrade (and some in-pace patching) for Jammy, as CUDA-11.8+ works perfectly fine with gcc-11. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 6367120</samp> > _`nodejs` version_ > _upgraded for security_ > _autumn leaves fall fast_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/106312 Approved by: https://github.com/DanilBaibak, https://github.com/albanD, https://github.com/atalman	2023-07-31 19:11:47 +00:00
hongxyan	1b757fb60b	Enable distributed cpp test for rocm (#106132 ) There are some cpp tests that did not run for ROCm platform. This is the part of effort to enable them. Specifically, this change is to enable the distributed cpp tests. Test plan: Tested by using rocm/pytorch-nightly:latest image, and verified the distributed cpp tests PASSED locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106132 Approved by: https://github.com/huydhn	2023-07-31 19:05:58 +00:00
Brian Hirsh	4a549dd57a	AOTAutograd: correctness fix when tracing custom autograd functions that alias inputs (#102992 ) Fixes https://github.com/pytorch/pytorch/issues/102970. See the comment [here](https://github.com/pytorch/pytorch/issues/102970#issuecomment-1577223773) for details. We normally treat "outputs that alias inputs" specially in AOTAutograd, by replaying the views at runtime, instead of baking them into the graph. For views that are part of custom autograd functions though, we can't do that view-replay, since it will clobber the backwards function that the user specified in their custom autograd.Function. Right now in this PR, I distinguish between "aliased inputs that are normal views" vs. "aliased inputs that are views that came from an autograd.Function call" by checking the outputs `.grad_fn` field, to see if it inherits from our custom CBackward function class. Then I added a new `OutputType` enum value, that we effectively treat the "normal" way (the same way that we treat ordinary, non-aliased outputs). The new enum val is mostly for debugging - so we can print it and know that our graph had custom autograd.Function aliased outputs in it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102992 Approved by: https://github.com/ezyang, https://github.com/zou3519	2023-07-31 19:02:12 +00:00
Andrew Gu	a0b6c0d1da	Remove @penguinwu from distributed codeowners (#106322 ) @penguinwu said she found a different way to get notified, so she can be removed from codeowners. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106322 Approved by: https://github.com/ezyang	2023-07-31 18:20:45 +00:00
Elias Ellison	0010a8f753	Deallocate constant when it is no longer needed in constant folding (#106216 ) Differential Revision: [D47881214](https://our.internmc.facebook.com/intern/diff/D47881214) tested locally with : ``` @torch.compile() def foo(): size_gb = 1 size_bytes = size_gb * 1024 * 1024 * 1024 * 20 # Allocate the tensor on the GPU tensor = torch.empty(size_bytes // 4, device='cuda') # Divide by 4 to allocate float32 elements for _ in range(10): tensor = tensor + 1 return tensor foo() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106216 Approved by: https://github.com/Skylion007	2023-07-31 18:20:22 +00:00
Ilya Sherstyuk	cdc9127733	[ONNX] Perform Shape inference on added "Cast" node (#106093 ) This commit fixes a bug where some "If" nodes blocked shape inference during the onnx graph building. In fixup_onnx_controlflow, a "Cast" node is added to conditions in "If" and "Loop" nodes if the condition type is not bool. This commit performs shape inference on this new "Cast" node which allows its output to be marked as "reliable" in ConstantValueMap during further shape inference. This would have eventually happened when shape inference is performed on the entire graph, but the inferred shapes are also useful to have during onnx graph building, since it allows some ops (like Squeeze) to export into simpler subgraphs. Also adds a test for this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106093 Approved by: https://github.com/thiagocrepaldi	2023-07-31 18:20:19 +00:00
angelayi	66c537429e	[export] Move attrs to properties and add BC decorator (#106170 ) @SherlockNoMad mentioned that it's not bc safe to directly access these attributes, so I moved them to @property fields, and added a `@compatibility` decorator. For now I just set it to True for graph_module/graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106170 Approved by: https://github.com/SherlockNoMad	2023-07-31 18:13:07 +00:00
Sehoon Kim	0dc251323d	`torch::nn::functional::batch_norm()`: add a shape check of input tensor (#105930 ) Fixes #105458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105930 Approved by: https://github.com/albanD	2023-07-31 18:03:12 +00:00
Fuzzkatt	1cebfef8a4	sm90 efficient attention test fixes (#105978 ) Fixes the following two test cases involving efficient attention on sm90: Explanations: functorch/test_ops.py: test_vjp_nn_functional_scaled_dot_product_attention_cuda_float32 * originally the test had xfail for all sm * in https://github.com/pytorch/pytorch/issues/102029, we found that it was unexpectedly passing on sm90 * I made https://github.com/pytorch/pytorch/pull/102131 to update the test to let it pass * @drisspg seems to have made changes to the behavior such that the original xfail was getting triggered (https://github.com/pytorch/pytorch/issues/102029#issuecomment-1560071148) * the CI began complaining about the failure again: https://github.com/pytorch/pytorch/issues/102663 * I'm now reverting https://github.com/pytorch/pytorch/pull/102131 to bring back the original xfail now that the behavior has been fixed by @drisspg to trigger the xfail in sm90 similar to all other sm test_transformers.py: test_mem_efficient_fail_sm90_cuda * the test as it's currently written seems to expect the sdp dispatcher to fail for mem efficient attention on sm90; however, testing this on H100, it actually succeeds, so I'm disabling the test for now as the current expected result may be outdated Pull Request resolved: https://github.com/pytorch/pytorch/pull/105978 Approved by: https://github.com/eqy, https://github.com/kshitij12345, https://github.com/zou3519	2023-07-31 17:59:40 +00:00
Mikayla Gawarecki	d8e5f2aa6d	Reland "Make adding buffers more like adding parameters (#104069 )" (#106224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106224 Approved by: https://github.com/atalman, https://github.com/albanD	2023-07-31 17:18:56 +00:00
Jeff Daily	50e3f9cbbb	[ROCm] HIP stream priority fix post #101956 (#106157 ) PR #101956 introduced additional stream priorities for cuda streams. HIP streams have slightly different semantics. - HIP: 1=low, 0=default, -1=high - CUDA: 0=default, -1=high, -2=higher, etc. This PR forces HIP stream priority to just 0 and -1 to match the pytorch semantics. This fixes a broken unit test. ``` python3 test_cuda_multigpu.py TestCudaMultiGPU.test_streams_priority -v Test results will be stored in test-reports/python-unittest/test_cuda_multigpu Running tests... ---------------------------------------------------------------------- test_streams_priority (__main__.TestCudaMultiGPU) ... ERROR (0.200s) ====================================================================== ERROR [0.200s]: test_streams_priority (__main__.TestCudaMultiGPU) ---------------------------------------------------------------------- Traceback (most recent call last): File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2354, in wrapper method(args, *kwargs) File "test_cuda_multigpu.py", line 656, in test_streams_priority low, high = torch.cuda.Stream.priority_range() RuntimeError: least_priority == 0 INTERNAL ASSERT FAILED at "/var/lib/jenkins/pytorch-upstream/c10/hip/HIPStream.h":184, please report a bug to PyTorch. Unexpected HIP stream priority range ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106157 Approved by: https://github.com/malfet	2023-07-31 16:57:20 +00:00
PyTorch MergeBot	2b427ae3a7	Revert "Reland "Add forward mode AD to out-place foreach functions (#102409 ) (#106043 )" This reverts commit e773f28ee307e2a246a4b765f3a51117661b45ba. Reverted https://github.com/pytorch/pytorch/pull/106043 on behalf of https://github.com/DanilBaibak due to Break slow tests ([comment](https://github.com/pytorch/pytorch/pull/106043#issuecomment-1658642734))	2023-07-31 15:50:36 +00:00
Jeffrey Dunn	c5b9dc1f40	Optimize stack frame inspection in `torch._custom_op.impl:CustomOp._register_impl` (#105940 ) Summary: This is surprisingly expensive when the stack is deep. We can instead just process the specific stack frame that's relevant -- it's much faster. Test Plan: ``` import inspect import sys import time def make_deep_stack(fn, n: int = 10): if n > 0: return make_deep_stack(fn, n - 1) return fn() def full_stack(): return inspect.stack()[1][3] def via_current_frame(): return inspect.getframeinfo(sys._getframe(1))[2] start = time.perf_counter() for _ in range(1000): make_deep_stack(full_stack) print(f"full_stack took {time.perf_counter() - start}s") start = time.perf_counter() for _ in range(1000): make_deep_stack(via_current_frame) print(f"via_current_frame took {time.perf_counter() - start}s") > full_stack took 31.788201928138733s > via_current_frame took 2.33455612603575s ``` Differential Revision: D47674015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105940 Approved by: https://github.com/zou3519	2023-07-31 15:49:33 +00:00
Brian Hirsh	c54afea6ee	faketensor: prevent deepcopy from cloning FakeTensorMode (#104476 ) fixes https://github.com/pytorch/pytorch/issues/104465 A more detailed repro is here, which uses `nn.TransformerLayer` (this breaks with AOTAutograd today, due to the presence of multiple FakeTensorMode objects lying around) https://github.com/pytorch/pytorch/issues/103505#issuecomment-1614817132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104476 Approved by: https://github.com/ezyang	2023-07-31 15:49:08 +00:00
Edward Z. Yang	d3b508d068	Fix typo which suppresses user exception reporting (#106289 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106289 Approved by: https://github.com/albanD	2023-07-31 14:35:33 +00:00
shibo19	0af3203c72	fix torchrun script for custom device (#105443 ) Fixes #ISSUE_NUMBER as the title,add torchrun support for custom device Pull Request resolved: https://github.com/pytorch/pytorch/pull/105443 Approved by: https://github.com/kumpera	2023-07-31 05:46:23 +00:00
PyTorch MergeBot	0a0abd0ecf	Revert "Update kineto submodule to 465ff4cd7 (#106154 )" This reverts commit efeb46e507eb7827e0fb5751d9556f31bafa8300. Reverted https://github.com/pytorch/pytorch/pull/106154 on behalf of https://github.com/PaliC due to breaks diff train importing ([comment](https://github.com/pytorch/pytorch/pull/106154#issuecomment-1657665353))	2023-07-31 05:43:18 +00:00
Edward Z. Yang	3c70d4bda7	UFMT torch/jit/frontend.py, manually fix mypy suppression (#106268 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106268 Approved by: https://github.com/Skylion007 ghstack dependencies: #106266, #106267	2023-07-30 19:10:59 +00:00
Edward Z. Yang	af88e6d09d	UFMT torch/jit/_script.py, manually move mypy suppressions (#106267 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106267 Approved by: https://github.com/Skylion007 ghstack dependencies: #106266	2023-07-30 19:10:59 +00:00
Edward Z. Yang	b581e03850	Apply UFMT to torch/distributions/distribution.py, manually resolve fstrings (#106266 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106266 Approved by: https://github.com/Skylion007	2023-07-30 19:10:57 +00:00
XiaobingSuper	eab3b2637a	only collect fx node for user_visible_outputs when doing output stride conversion (#106194 ) For yolo3, there has a subgraph that output has int value, and AttributeError: 'int' object has no attribute 'name` caused by collecting ser_visible_outputs to do output stride conversion. This PR will add a check only that the output is a fx node before being added in user_visible_outputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106194 Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/shunting314	2023-07-30 13:48:22 +00:00
CaoE	888bdddb1e	Add scalar conversion using avx instructions for half (#102140 ) ### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up. ### Testing Test maxpool, and compared with the results of #98819. Single socket (28 cores): shape \| fp16 forward / ms \| bf16 forward / ms \| fp16 backward / ms \| bf16 backward / ms \| speedup ratio (fp16 forward) \| speedup ratio (fp16 backward) -- \| -- \| -- \| -- \| -- \| -- \| -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig \| 5.07165 \| 5.418 \| 0.5798 \| 0.5123 \| 1.373694951 \| 3.430786 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL \| 1.37455 \| 1.2505 \| 8.8336 \| 9.7684 \| 1.373635008 \| 4.132924 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig \| 28.72 \| 30.7069 \| 3.813 \| 3.75 \| 1.31977124 \| 2.783006 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL \| 4.5783 \| 4.703 \| 4.703 \| 5.1 \| 1.028980189 \| 3.1293 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig \| 13.896 \| 14.8138 \| 1.6635 \| 1.6274 \| 1.298704663 \| 2.982699 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL \| 2.11291 \| 2.1158 \| 2.26778 \| 2.272 \| 0.951105348 \| 3.179012 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig \| 0.4204 \| 0.3843 \| 0.0649 \| 0.0633 \| 2.102711703 \| 1.779492 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d \| 0.1134 \| 0.11 \| 0.1476 \| 0.143 \| 2.23042328 \| 3.612398 Single core: shape \| fp16 forward / ms \| bf16 forward / ms \| fp16 backward / ms \| bf16 backward / ms \| speedup ratio (fp16 forward) \| speedup ratio (fp16 backward) -- \| -- \| -- \| -- \| -- \| -- \| -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig \| 124.413 \| 114.44 \| 10.553 \| 11.2486 \| 1.31395433 \| 3.923844 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL \| 28.99 \| 28.0781 \| 9.5092 \| 10.9258 \| 1.324296999 \| 3.888377 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig \| 640.8276 \| 591.964 \| 59.18776 \| 60.854 \| 1.334956391 \| 3.704458 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL \| 88.57 \| 90.214 \| 54.358 \| 59.205 \| 1.031258214 \| 3.75285 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig \| 318.6197 \| 285.155 \| 28.4999 \| 29.4387 \| 1.315298144 \| 3.759747 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL \| 31.3981 \| 34.0544 \| 25.6557 \| 28.7811 \| 1.068505738 \| 3.841587 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig \| 8.87882 \| 8.207 \| 0.386056 \| 0.3939 \| 1.567866 \| 3.50387 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d \| 2.4167 \| 2.38295 \| 0.3769 \| 0.4066 \| 1.39402491 \| 3.30061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102140 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/cpuhrsch	2023-07-30 11:25:28 +00:00
Xiao Wang	21fd2bc32e	Allow setting TORCH_LINALG_PREFER_CUSOLVER=1 to prefer cusolver as linear algebra library globally (#106226 ) setting TORCH_LINALG_PREFER_CUSOLVER=1 This will allow users to prefer cusolver as linear algebra backend in their container use case. The switch is not enabled by default so it won't change any existing default behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106226 Approved by: https://github.com/lezcano	2023-07-30 09:38:46 +00:00
PyTorch UpdateBot	858ca65c8a	[vision hash update] update the pinned vision hash (#106262 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106262 Approved by: https://github.com/pytorchbot	2023-07-30 03:44:11 +00:00
Michael Voznesensky	8549abc347	Grab bag of DTensor enablement stuff (Enable whole graph capture for DTensor) (#105787 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105787 Approved by: https://github.com/ezyang	2023-07-30 00:17:45 +00:00
Edward Z. Yang	3bf922a6ce	Apply UFMT to low traffic torch modules (#106249 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106249 Approved by: https://github.com/Skylion007	2023-07-29 23:37:30 +00:00
Edward Z. Yang	a4ebc61f15	Ignore UFMT revs in blame (#106246 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106246 Approved by: https://github.com/Skylion007	2023-07-29 23:37:21 +00:00
Ivan Zaitsev	d2aa3f5fa9	[GHF][mergebot] record ghstack dependencies in the commit message (#105251 ) Currently all information about the dependencies of ghstack PRs (e.g. #105010) is stripped away: `c984885809/.github/scripts/trymerge.py (L1077-L1078)` This PR adds this information back in a more compact form. All dependencies (PR numbers) of each PR in ghstack are recorded. The resulting commit message will look like this (the last line is new): > Mock title (#123) > > Mock body text > Pull Request resolved: https://github.com/pytorch/pytorch/pull/123 > Approved by: https://github.com/Approver1, https://github.com/Approver2 > ghstack dependencies: #1, #2 --- ### Testing Unit tests. --- ### Note Re: `# type: ignore[assignment]` in unit tests. I did my due diligence to find alternatives. Unfortunately mypy [doesn't](https://github.com/python/mypy/issues/6713) support this [way of patching methods](https://docs.python.org/3/library/unittest.mock-examples.html#mock-patching-methods), and the alternatives are either extremely verbose or don't work for this case. I decided it's not worth the effort (since the problem is limited only to the unit test). Pull Request resolved: https://github.com/pytorch/pytorch/pull/105251 Approved by: https://github.com/huydhn	2023-07-29 20:32:10 +00:00
Nikita Karetnikov	0ee3b84021	[pt2] add meta for `cholesky_inverse` (#106120 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106120 Approved by: https://github.com/ezyang	2023-07-29 17:16:20 +00:00
Nikita Karetnikov	80755884be	[pt2] add meta for `cholesky` (#106115 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106115 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2023-07-29 17:16:20 +00:00
chuboning	329a9a90c0	fix some typos (#106253 ) Fixes typos Pull Request resolved: https://github.com/pytorch/pytorch/pull/106253 Approved by: https://github.com/awgu	2023-07-29 16:11:52 +00:00
Jason Ansel	3ecd05d9f3	Fix FakeTensor issues with copy_ between devices (#106172 ) Used to fail with: ``` RuntimeError: Unhandled FakeTensor Device Propagation for aten.copy_.default, found two different devices cpu, cuda:0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106172 Approved by: https://github.com/eellison	2023-07-29 15:55:32 +00:00
shibo19	7047d132fd	add context support for custom device (#105056 ) Fixes #ISSUE_NUMBER as the title, add context support for custom device and testcase. And in the future, we may want to refactor these hooks for different device to unify the APIs, would you agree my idea？ @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/105056 Approved by: https://github.com/albanD	2023-07-29 12:56:03 +00:00
Edward Z. Yang	1da4115702	Make _dynamo.export return a NamedTuple (#106062 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106062 Approved by: https://github.com/voznesenskym	2023-07-29 06:17:33 +00:00
Tugsbayasgalan Manlaibaatar	df50f91571	Support fx_pytree in dynamo (#105574 ) This PR does two things: 1. Make dynamo trace through fx_pytree (on top of torch.utils._pytree) so that generated graph modules can be retraced. 2. Fix bug where unflatten not returning dynamo VariableTracker. Differential Revision: [D47734623](https://our.internmc.facebook.com/intern/diff/D47734623) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105574 Approved by: https://github.com/yanboliang, https://github.com/ydwu4	2023-07-29 05:08:15 +00:00
David Berard	f160a972aa	[inductor][easy] "unhandled ValueRange op" - log at debug level (#106215 ) Set this log line to debug level - it appears frequently for many ops that don't have implementations following https://github.com/pytorch/pytorch/pull/102611. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106215 Approved by: https://github.com/lezcano	2023-07-29 03:40:40 +00:00
Edward Z. Yang	e6ec0efaf8	Apply UFMT to all non test/torch files (#106205 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106205 Approved by: https://github.com/albanD	2023-07-29 02:56:24 +00:00
Thomas Ortner	1163800d0f	Upgraded triton pin to allow PTX ISA 8.2 (#106195 ) Fixes #105522 @ezyang, could you please review? Pull Request resolved: https://github.com/pytorch/pytorch/pull/106195 Approved by: https://github.com/ezyang	2023-07-29 02:55:15 +00:00
Michael Lazos	7b14a14e27	[Inductor] Optimize finding users of buffers for mutation (#105882 ) Rather than visiting all nodes in the current environment to determine the users of a buffer, register the users of a buffer after node execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105882 Approved by: https://github.com/jansel	2023-07-29 02:04:03 +00:00
Jason Ansel	9b94dcf2ac	[Compiled Autograd] Remove duplicate code from double-merge (#106233 ) Something awfuly wierd is going on. Somehow the changes in #105808 got applied twice, which caused a lint error on main. Notice how the two block of code are both copies of #105808: Line 273: `505dd319ef/test/inductor/test_compiled_autograd.py (L273-L369)` Line 372: `505dd319ef/test/inductor/test_compiled_autograd.py (L372-L479)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106233 Approved by: https://github.com/malfet	2023-07-29 01:57:03 +00:00
Rohan Varma	c11412b4a8	[DDP] Support optim in backward after DDP init (#105995 ) This allows in backward optimizers to be configured after DDP init, in addition to before as was previously supported. Differential Revision: [D47783347](https://our.internmc.facebook.com/intern/diff/D47783347/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105995 Approved by: https://github.com/fegin	2023-07-29 01:36:25 +00:00
Rohan Varma	5d4e170d58	[Optim in backward] API to retrieve in-backward optimizers (#105991 ) API to retrieve in backward optimizer for checkpointing purposes Differential Revision: [D47782225](https://our.internmc.facebook.com/intern/diff/D47782225/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105991 Approved by: https://github.com/awgu	2023-07-29 01:36:25 +00:00
sokkaofthewatertribe	86237dc59b	fix typo in serialization.md (#106191 ) Found this minor typo while reviewing the TorchScript docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/106191 Approved by: https://github.com/Skylion007	2023-07-29 00:01:59 +00:00
Michael Lazos	bd669d52d2	Print env var name instead of flag name for commandline repros (#106223 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/106223 Approved by: https://github.com/seemethere, https://github.com/malfet	2023-07-28 23:22:27 +00:00
Aaron Gokaslan	52d4b1ae31	[BE]: Enable ruff rules PIE807 and PIE810 (#106218 ) * Enables PIE807 + PIE810. PIE807 is do not reimplement list builtin function using lambda and PIE810 is to always fuse startswith / endswith calls (I applied the autofixes for this before we had ruff enabled). Pull Request resolved: https://github.com/pytorch/pytorch/pull/106218 Approved by: https://github.com/albanD	2023-07-28 22:35:56 +00:00
Richard Zou	f3d165bf61	[fake_tensor] Don't run fallback for fbgemm ops (#106210 ) Summary: This diff also adds more warning messages around allowing a namespace into the fallback. We need to grandfather in an operator to actually merge this diff. Test Plan: - existing tests Differential Revision: D47873841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106210 Approved by: https://github.com/eellison	2023-07-28 22:31:54 +00:00
Scott Wolchok	505dd319ef	[caffe2] Don't evaluate message in CAFFE_ENFORCE_THAT unless the check fails (#106145 ) D26829714 improved CAFFE_ENFORCE_THAT, but made us eagerly evaluate the message, which has costs. Differential Revision: [D47809432](https://our.internmc.facebook.com/intern/diff/D47809432/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106145 Approved by: https://github.com/davidberard98	2023-07-28 21:54:07 +00:00
Jason Ansel	26d29d9639	[Compiled Autograd] Support CopySlices and CopyBackwards (#105809 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105809 Approved by: https://github.com/albanD	2023-07-28 21:42:51 +00:00
Jason Ansel	099345f1e5	[Compiled Autograd] Handle aten.sym_size/aten.sym_stride (#105814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105814 Approved by: https://github.com/voznesenskym	2023-07-28 21:42:51 +00:00
Lucy Qiu	6da8825f20	[Pytorch][Vulkan] sum.dim_IntList with keepdim (#106159 ) Summary: Add Vulkan support for [sum](https://pytorch.org/docs/stable/generated/torch.sum.html).dim_IntList) with `keep_dim=true` [sum.dim_IntList](https://www.internalfb.com/code/fbsource/[49b7951b7eb6]/xplat/caffe2/aten/src/ATen/native/native_functions.yaml?lines=5466) ``` if keepdim is true, the output tensor is of the same size as input except in the dimension(s) dim, where it is of size 1 otherwise, the dim is squeezed, result in the output tensor having 1 fewer dimension/s. ``` Test Plan: ``` lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter=".sum" Action graph will be rebuilt because files have been added or removed. Parsing buck files: finished in 1.4 sec Downloaded 4/58 artifacts, 3.08 Mbytes, 50.0% cache miss (for updated rules) Building: finished in 41.2 sec (100%) 536/536 jobs, 13/536 updated Total time: 42.8 sec BUILD SUCCEEDED Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = .sum [==========] Running 6 tests from 1 test suite. [----------] Global test environment set-up. [----------] 6 tests from VulkanAPITest [ RUN ] VulkanAPITest.sum_dim_2d [ OK ] VulkanAPITest.sum_dim_2d (558 ms) [ RUN ] VulkanAPITest.sum_dim_3d [ OK ] VulkanAPITest.sum_dim_3d (7 ms) [ RUN ] VulkanAPITest.sum_dim_4d [ OK ] VulkanAPITest.sum_dim_4d (14 ms) [ RUN ] VulkanAPITest.sum_dim_keepdim_2d [ OK ] VulkanAPITest.sum_dim_keepdim_2d (4 ms) [ RUN ] VulkanAPITest.sum_dim_keepdim_3d [ OK ] VulkanAPITest.sum_dim_keepdim_3d (7 ms) [ RUN ] VulkanAPITest.sum_dim_keepdim_4d [ OK ] VulkanAPITest.sum_dim_keepdim_4d (18 ms) [----------] 6 tests from VulkanAPITest (612 ms total) [----------] Global test environment tear-down [==========] 6 tests from 1 test suite ran. (612 ms total) [ PASSED ] 6 tests. ``` Reviewed By: SS-JIA Differential Revision: D47652931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106159 Approved by: https://github.com/SS-JIA	2023-07-28 21:36:05 +00:00
Rohan Varma	2ec7cd2db2	[CheckpointWrapper] Test for kwarg propagation, remove checkpoint_fn_arg support (#102679 ) Closes https://github.com/pytorch/pytorch/issues/100576 Differential Revision: [D46342398](https://our.internmc.facebook.com/intern/diff/D46342398/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102679 Approved by: https://github.com/awgu	2023-07-28 21:18:35 +00:00
Svetlana Karslioglu	4d3ea5df65	Restructure torch.compile docs (#105376 ) Current torch.compile docs have become a bit of a mess with the docs expanded in the left nav. This PR moves them under the torch.compiler menu item in the left nav. A bunch of rewrites were made in collaboration with @msaroufim to address formatting issues, latest updates that moved some of the APIs to the public torch.compiler namespace were addressed as well. The documentation is broken down in three categories that address three main audiences: PyTorch users, Pytorch Developers and PyTorch backend vendors. While, the user-facing documentation was significantly rewritten, dev docs and vendor docs kept mostly untouched. This can be addressed in the follow up PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105376 Approved by: https://github.com/msaroufim	2023-07-28 20:58:57 +00:00
Sam Larsen	0cf918947d	[inductor] Support using the 'stream' param in AOT mode (#105589 ) Summary: When in AOT mode, make use of the existing stream param: - Pass through and use the stream param in the launchKernel helper function. - In non-AOT mode, assign the stream param in the caller and pass to launchKernel - Use a CUDAStreamGuard so all fallback ops execute on the stream - CUDAStreamGuard subsumes CUDAGuard in AOT mode since it sets both stream and device Test Plan: - Ran cpp_wrapper tests: pytest test/inductor/test_cpp_wrapper.py - Manually inspected cpp output from the alexnet benchmark: a) In AOT mode: ``` static inline void launchKernel( CUfunction func, int gridX, int gridY, int gridZ, int numWraps, int sharedMemBytes, cudaStream_t stream) { AT_CUDA_DRIVER_CHECK_OVERRIDE(cuLaunchKernel( func, gridX, gridY, gridZ, 32*numWraps, 1, 1, sharedMemBytes, stream, args, nullptr)); ... at::cuda::CUDAStreamGuard stream_guard(at::cuda::getStreamFromExternal(stream, 0)); ... launchKernel(triton_poi_fused_convolution_0, 1, 784, 1, 4, 4352, kernel_args_var_0, stream); ... ``` b) Regular cpp wrapper: ``` ... at::cuda::CUDAGuard device_guard(0); cudaStream_t stream0 = at::cuda::getCurrentCUDAStream(0); ... launchKernel(triton_poi_fused_convolution_0, 1, 784, 1, 4, 4352, kernel_args_var_0, stream0); ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105589 Approved by: https://github.com/desertfire	2023-07-28 20:26:27 +00:00
Mikayla Gawarecki	035124774a	Enable registering fallthroughs to (op, dk) from torch.library (#106086 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106086 Approved by: https://github.com/zou3519, https://github.com/albanD	2023-07-28 19:37:59 +00:00
Jane Xu	ad3af0aead	Change phrasing on optim state hook docs (#106209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106209 Approved by: https://github.com/albanD	2023-07-28 18:59:21 +00:00
Andrew Gu	800287fb56	[FSDP] Optimize away intermediate `div_` for HSDP (#106034 ) ### Background: Gradient Pre-Divide Consider $N$ data parallel workers. Define $g_i$ to be the $i$ th worker's local unsharded gradient. Data parallel gradient reduction computes $\overline g = \frac{1}{N} \sum_{i \in [N]} g_i$. $\sum_{i \in [N]} g_i$ increases the magnitude by a factor of $N$, which may overflow for fp16. However, if we pre-divide and compute $\sum_{i \in [N]} \frac{g_i}{N}$, then the $\frac{g_i}{N}$ may underflow. The current solution from Myle for FSDP is to pre-divide by $\sqrt{N}$ and post-divide by $\sqrt{N}$: $$\overline{g} = \frac{1}{\sqrt{N}} \sum_{i \in [N]} \frac{g_i}{\sqrt{N}}.$$ Now, consider HSDP with $N = S \cdot R$ data parallel workers, sharding over $S$ workers and replicating over $R$ workers. Define $g_{i,j}$ to be the $i \cdot S + j$ th worker's local unsharded gradient (so sharding indexes with $i$ and replication indexes with $j$). The existing implementation computes $$\overline{g} = \frac{1}{\sqrt{R}} \sum_{j \in [R]} \textcolor{red}{ \frac{1}{\sqrt{R}} \frac{1}{\sqrt{S}} } \sum_{i \in [S]} \frac{g_i}{\sqrt{S}},$$ where the $\frac{1}{\sqrt{R}} \frac{1}{\sqrt{S}}$ involves two separate `aten::div_` kernels. ### Revisiting Pre-Divide for HSDP A minor optimization that we can do is with this intermediate `div_`. There are two options: 1. Compute $\overline{g}$ in the same way as FSDP: $$\overline{g} = \frac{1}{\sqrt{N}} \sum_{j \in [R]} \sum_{i \in [S]} \frac{g_{i,j}}{\sqrt{N}}.$$ 2. Compute $\overline{g}$ still with an intermediate division for rescaling but coalescing the two `divs_` into one: $$\overline{g} = \frac{1}{\sqrt{R}} \sum_{j \in [R]} \textcolor{red}{ \frac{1}{\sqrt{N}} } \sum_{i \in [S]} \frac{g_i}{\sqrt{S}}$$ This PR goes with the 1st approach prioritizing performance because (1) it matches the existing FSDP behavior and (2) it avoids a memor-bandwidth bound `div_` kernel that blocks all-reduce launch. ### Implementation Details In order to accommodate this, we need to refactor the communication hook logic that baked the gradient pre/post-division into the default hook. - We raise an error if registering a communication hook for HSDP since the current implementation would only apply the hook to the reduce-scatter, not the all-reduce, which may be unexpected. - We change it so that `state._comm_hook is not None` iff a communication hook is registered. This makes the collectives and the pre/post-division in the default no-communication-hook path more visible in the code. Differential Revision: [D47852459](https://our.internmc.facebook.com/intern/diff/D47852459) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106034 Approved by: https://github.com/rohan-varma	2023-07-28 18:36:26 +00:00
Andrew Gu	c7b122b2b5	[FSDP] Add HSDP parity unit test (#106131 ) With >= 4 GPUs: ``` python -m pytest test/distributed/fsdp/test_fsdp_hybrid_shard.py -k test_fsdp_hybrid_shard_parity ``` Differential Revision: [D47852458](https://our.internmc.facebook.com/intern/diff/D47852458) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106131 Approved by: https://github.com/rohan-varma	2023-07-28 18:31:12 +00:00
Andrew Gu	3841be80de	[FSDP] Improve `test_fsdp_hybrid_shard_basic_setup` (#106072 ) Differential Revision: [D47852460](https://our.internmc.facebook.com/intern/diff/D47852460) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106072 Approved by: https://github.com/rohan-varma	2023-07-28 18:31:12 +00:00
cyy	b3e24c53eb	use performance-unnecessary-value-param in clang-tidy (#102615 ) performance-unnecessary-value-param has been disabled in clang-tidy for a long time. However, this check is actually useful and able to some interesting performance problems. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102615 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-07-28 17:37:03 +00:00
Elias Ellison	8f4d8b3773	More descriptive graph diagram names in svg (#106146 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106146 Approved by: https://github.com/jansel, https://github.com/Chillee	2023-07-28 17:34:09 +00:00
ydwu4	5237ed55e6	[export] allow register dataclass as pytree node (#106160 ) In this pr, we allow users to register a customized flatten/unflatten/serialization/deserialization for a dataclass. We provide some default implementation for flatten/unflatten. We could implement a decorator based on it when needed. ## Motivation: HuggingFace and many internal models return dataclass output and torch.export wants to maintain the invariant that export result (i.e. exported_program) has the same calling convention and result as the original callable. This is not supported in export yet: we cannot recover the original dataclass from flattened output produced by the underlying graph module (produced by dynamo and processed further by aot_export). We need to have a place to store the metadata of the dataclass so that we can re-construct it. To avoid adding hacky code in export and allow princinpled extensibility, we think extending pytree may be a good option. ## Implementation: @zou3519 mentioned https://github.com/pytorch/pytorch/pull/93214/files and [jax-2371](https://github.com/google/jax/issues/2371#issuecomment-805361566), which suggests that it's not a good idea to make dataclass a default pytree node but it could be good to provide a default implementation for dataclass. Since currently, this seems to be an export-only feature, we added this extension point in export. We also add "return_none_fields" flag to control whether none fields are returned after flattening, which is expected to be False in produce_matching of dynamo.export. Also added some tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106160 Approved by: https://github.com/zhxchen17	2023-07-28 17:33:13 +00:00
Elias Ellison	37cfe944bb	add support for mutated params (#106098 ) Previously, this didn't work because of the warmup run. Now that we do not run warmup, and then execution on one inductor invocation this works. llama inference 1.6->4.4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106098 Approved by: https://github.com/ezyang	2023-07-28 17:27:06 +00:00
Michael Lazos	db2239706e	Fix TORCH_COMPILE_DEBUG incompatibility with aot inductor (#106169 ) Record replay tries to record a module which is already available Pull Request resolved: https://github.com/pytorch/pytorch/pull/106169 Approved by: https://github.com/anijain2305, https://github.com/jansel	2023-07-28 17:17:58 +00:00
Elias Ellison	76a2ec49d7	[Dynamo] Ignore no-op tensor assignment (#106092 ) Ignore no-op `self.attr = self.attr` on NN Modules when attr is a Tensor attribute. This comes from a [llama pattern](https://github.com/pytorch/benchmark/blob/main/torchbenchmark/models/llama/model.py#L121-L122). Normally, when a set attr occurs on an nn module we turn it into an `UnspecializedNNModuleVariable` which prevents static buffers and parameters. In subsequent pr i will add support for cudagraph mutation of buffers/params, which with this pr takes llama 1.6x -> 4.4x in inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/106092 Approved by: https://github.com/yanboliang	2023-07-28 17:16:19 +00:00
Albert Chen	7c8efc9049	[PT][FSDP] Combine _utils.py into _common_utils.py [2/2] (#106181 ) Summary: https://github.com/pytorch/pytorch/issues/97813 This diffs moves `_no_dispatch_record_stream` and `_same_storage_as_data_ptr` Test Plan: CI Differential Revision: D47706114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106181 Approved by: https://github.com/awgu	2023-07-28 17:15:25 +00:00
Elias Ellison	e5bd63a7f3	run freezing in nightly (#106097 ) We could switch to every other day or something but inference is cheaper than training anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106097 Approved by: https://github.com/desertfire	2023-07-28 17:12:01 +00:00
Nikita Shulga	c2e948edca	Fix lint in test/inductor/test_compiled_autograd.py Regression introduced by https://github.com/pytorch/pytorch/pull/105808	2023-07-28 09:54:22 -07:00
Catherine Lee	57f23ca58b	Bot message changes for -f and rebase (#106150 ) * Encourage people to use -i instead of -f for mergebot * Add additional info for when rebase fails due to lacking permissions <details><summary>dryrun</summary> ```` csl@csl-mbp ~/zzzzzzzz/pytorch [csl/errormsgs] $ (forpytorch) python3 .github/scripts/tryrebase.py 106089 --branch viable/strict --dry-run + git -C /Users/csl/zzzzzzzz/pytorch rev-parse --verify refs/remotes/origin/viable/strict @pytorchbot started a rebase job onto [refs/remotes/origin/viable/strict](`7c97c943fb`). Check the current status [here](None) + git -C /Users/csl/zzzzzzzz/pytorch fetch origin pull/106089/head:pull/106089/head + git -C /Users/csl/zzzzzzzz/pytorch rebase refs/remotes/origin/viable/strict pull/106089/head + git -C /Users/csl/zzzzzzzz/pytorch rev-parse --verify pull/106089/head + git -C /Users/csl/zzzzzzzz/pytorch rev-parse --verify refs/remotes/origin/viable/strict + git -C /Users/csl/zzzzzzzz/pytorch push --dry-run -f https://github.com/Lightning-Sandbox/pytorch.git pull/106089/head:fix/spaces stdout: remote: Permission to Lightning-Sandbox/pytorch.git denied to clee2000. fatal: unable to access 'https://github.com/Lightning-Sandbox/pytorch.git/': The requested URL returned error: 403 stderr: Rebase failed due to Command `git -C /Users/csl/zzzzzzzz/pytorch push --dry-run -f https://github.com/Lightning-Sandbox/pytorch.git pull/106089/head:fix/spaces` returned non-zero exit code 128 ``` remote: Permission to Lightning-Sandbox/pytorch.git denied to clee2000. fatal: unable to access 'https://github.com/Lightning-Sandbox/pytorch.git/': The requested URL returned error: 403 ``` This is likely because the author did not allow edits from maintainers on the PR or because the repo has additional permissions settings that mergebot does not qualify. ```` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106150 Approved by: https://github.com/huydhn	2023-07-28 16:13:51 +00:00
Jason Ansel	f15b6ec6d6	[Compiled Autograd] Add eager autograd tests (#105808 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105808 Approved by: https://github.com/albanD, https://github.com/soulitzer	2023-07-28 15:59:49 +00:00
Jason Ansel	2e02dfae9a	[Compiled Autograd] Fix handling of undefined gradients in hooks (#105813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105813 Approved by: https://github.com/albanD	2023-07-28 15:59:35 +00:00
Jason Ansel	ac6d8fb16e	[Compiled Autograd] Add eager autograd tests (#105808 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105808 Approved by: https://github.com/albanD, https://github.com/soulitzer	2023-07-28 15:59:35 +00:00
Jason Ansel	23a1eda890	[Compiled Autograd] Inplace updates of gradients (#105713 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105713 Approved by: https://github.com/albanD	2023-07-28 15:58:49 +00:00
DanilBaibak	7b73b1e8a7	Fixed test_get_classifications_pending_unstable (#106203 ) Fixed `test_get_classifications_pending_unstable` test. [Broken test](https://github.com/pytorch/pytorch/actions/runs/5690543018/job/15424383198) on main branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106203 Approved by: https://github.com/malfet	2023-07-28 14:15:17 +00:00
Nikita Shulga	bb0b283e5a	Do not force -Werror on Pooling.cpp As new versions of compilers are likely find new types of violation s as shown in https://github.com/pytorch/pytorch/issues/105728	2023-07-28 07:08:59 -07:00
chunyuan	cb6c3cbc91	inductor: enable weight prepack for LSTM (#103071 ) - Enabled LSTM weight prepack in inductor. - Added a mkldnn decomposition for lstm which won't change for different `seq_lens`. With the previous decomposition, for dynamic shapes use case where `seq_lens` changes, the graph will be different. - Extended several inductor utility functions to support `List(Tensor`) as input. Previously those functions only supported `Tensor` input. Update 2023-07-26: - https://github.com/pytorch/pytorch/pull/103851 has moved CPU weight packing to be after AOTAutograd. Fixed the support in this PR to follow the same way (mainly in `3b207f7f1c (diff-6dffed1ade0ba3e887f9a4eafa3bfcec267ab2365b8adcb91bd391f49b3fd2e3)`). LSTM is decomposed in `aten.mkldnn_rnn_layer` by layer and by direction. The weight prepack is done at the `mkldnn_rnn_layer` level. - Add a fix in rnn `__get_state__` function in case we need to recompile an `LSTM` module. When compiling the module, the weights tensors which are the `named_parameters` of the module are converted to `functional_tensor` here: `76fb72e24a/torch/nn/utils/stateless.py (L125-L128)` The forward function of LSTM will be called: `76fb72e24a/torch/_functorch/aot_autograd.py (L3379-L3381)` In the forward function, the `_flat_weights` are updated to be the same as the weights, thus becoming `functional_tensor`: `76fb72e24a/torch/nn/modules/rnn.py (L775-L778)` The weights tensors are converted back to the original tensors (which are not `functional_tensor` anymore) before exiting the `_reparametrize_module` context here: `76fb72e24a/torch/nn/utils/stateless.py (L130-L142)` But since `_flat_weights` is not in the `named_parameters` of the module, it's still `functional_tensor` ([link of the parameters that will be converted to functional and reverted back](`76fb72e24a/torch/_functorch/aot_autograd.py (L3695-L3698)`)). At this moment, if we need to recompile the model, `deepcopy` will be called: `76fb72e24a/torch/_dynamo/utils.py (L915-L917)` And it will report `UnImplemented` since we have `functional_tensor` (`_flat_weights`) and will trigger graph break which is not what we expect: `76fb72e24a/torch/_subclasses/meta_utils.py (L514)` Added a fix in the `__get_state__` to update the `_flat_weights` if ever weights have changed to fix this issue. The fix is covered in the `test_lstm_packed` UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103071 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-28 13:54:32 +00:00
Richard Zou	dad65d09f2	Update custom op API (#105947 ) As described in https://docs.google.com/document/d/1aGWtgxV3HppuxQAdddyPrs74_aEntpkYt9MalnCKnhk/edit This PR changes the CustomOp API to be private and adds new public wrappers around it so that the user does not need to know about the "CustomOp" object. We've effectively changed the "CustomOp" object to be some metadata about the operator that the user does not directly interact with. The "updated custom op API" is in torch._custom_ops. Pending good customer feedback, we will promote this module to torch.custom_ops. NB: I cannot move around the older torch._custom_op APIs yet because people are already using them. Test Plan: - I changed all of our tests to use the new `torch._custom_ops` module instead of the old CustomOp API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105947 Approved by: https://github.com/soulitzer	2023-07-28 13:30:58 +00:00
Richard Zou	6d553a42fe	Move most custom op related tests to test_custom_ops.py (#106036 ) This PR moves most custom op related tests from test/test_python_dispatch.py to test/test_custom_ops.py. Motivation is that I had a difficult time finding the custom op tests inside test_python_dispatch.py. This doesn't preserve blame, but it's OK - I'm the only person who has really touched the moved tests so far :). Test Plan: - run tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/106036 Approved by: https://github.com/bdhirsh, https://github.com/soulitzer	2023-07-28 13:30:58 +00:00
Richard Zou	db365e1fb5	Create test/test_custom_ops.py, move test_custom_op_testing to it (#106035 ) I'm in the process of putting all the custom op tests into this file. I got tired of trying to find the custom ops tests in test_python_dispatch.py, which (1) is getting long and (2) should actually be the torch_dispatch and python torch.library tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106035 Approved by: https://github.com/bdhirsh, https://github.com/soulitzer	2023-07-28 13:30:58 +00:00
Edward Z. Yang	0b8fbfe9de	automatic_dynamic_shapes is on by default (#106188 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106188 Approved by: https://github.com/albanD	2023-07-28 13:26:54 +00:00
Rodrigo Kumpera	2636751fb9	[C10d] Add skeleton of LibUV backend. (#105672 ) This commit hooks up tcpstore creation and build flags. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105672 Approved by: https://github.com/fduwjj	2023-07-28 13:19:06 +00:00
Jane Xu	dffa4e14b9	Add Optimizer state_dict hooks (#105953 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105953 Approved by: https://github.com/albanD	2023-07-28 11:52:41 +00:00
Huy Do	4fe407ad73	Add details about ic, broken, flaky, and unstable checks to merge records (#106162 ) At the moment, we only record the list of pending and failed check on Rockset merge records. This is enough to compute the force merge KPI(s), but isn't enough for more in-depth analysis on what happened at the time of the merge: * If the number of `ok_failed_checks` is less than `ok_failed_checks_threshold`, the list of `failed_checks` would be empty (expectedly). So Rockset would only record an empty list. * We support retry in PR, so the classifications on Dr.CI could be different than what dev observed at the time of the merge if retry completed successfully ### Testing `python .github/scripts/trymerge.py --comment-id 1654010315 106095 --dry-run` (need to comment out some of the code to actually write a test record to Rockset), then manually verify it with ``` SELECT * FROM commons.merges WHERE pr_num = 106095 ``` to see that `ignore_current_checks`, `broken_trunk_checks`, `flaky_checks`, and `unstable_checks` shows up correctly Pull Request resolved: https://github.com/pytorch/pytorch/pull/106162 Approved by: https://github.com/clee2000	2023-07-28 09:41:02 +00:00
XiaobingSuper	6366ed6edd	inductor: using dummy input to pack the linear weight for bfloat16 dynamic shape path (#106122 ) For the dynamic bfloat16 path, if we use plain weight, we can't call in amx path, so there use a dummy input(given a None value) to do the weight packing for better performance. before: ``` onednn_verbose,exec,cpu,inner_product,x64:gemm:jit,forward_training,src_bf16::blocked:ab:f0 wei_bf16::blocked:ab:f0 bia_bf16::blocked:a:f0 dst_bf16::blocked:ab:f0,attr-scratchpad:user ,,mb64ic256oc256,9.4292 ``` after: ``` onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core_amx_bf16,forward_training,src_bf16::blocked:ab:f0 wei_bf16::blocked:AB16b32a2b:f0 bia_bf16::blocked:a:f0 dst_bf16::blocked:ab:f0,attr-scratchpad:user ,,mb64ic256oc256,0.35498 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106122 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-07-28 07:57:05 +00:00
David Berard	359aa17125	[inductor] realize boundaries in bucketize() lowering (#106107 ) ops.bucketize() implements a binary search: it takes values and offsets; offsets defines a set of buckets, and ops.bucketize() returns, for each value, the index of the bucket it lies in. The op is elemenwise with regard to the values and outputs, but it needs access to the entire offsets tensor in global memory so that it can perform the binary search. So, we need to realize the boundaries into global memory before running the op. The scheduler won't try to fuse the two kernels together because the input to ops.bucketize() is marked as a StarDep. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106107 Approved by: https://github.com/jansel	2023-07-28 07:19:35 +00:00
Huy Do	32175d794a	No need to wait for pending unstable jobs when merging (#106095 ) No need to wait if the job classification is unstable as it would be ignored anyway. This is useful to not need to wait for scarce resources like ROCm, which is also frequently in unstable mode (There is a ROCm queue atm) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106095 Approved by: https://github.com/clee2000	2023-07-28 07:08:23 +00:00
Huy Do	3b5fb7c0d4	Support regex flaky rules in trymerge (#106103 ) This goes together with https://github.com/pytorch/test-infra/pull/4423 to support regex flaky rules in `trymerge`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106103 Approved by: https://github.com/clee2000	2023-07-28 07:05:12 +00:00
AllenTiTaiWang	eebfb921c6	[ONNX] Support complex in FX exporter (#100554 ) Previous to the PR, the complex dtype would only fail. This PR keeps torch.fx.Graph with complex dtype, while mapping them to float dtype in torchscript(onnx) graph with real representation. The change happens in multiple files: 1. `placeholder`: Apply torch.view_as_real() before sending fake tensor to graph building. 2. `call_function`: Fill in TorchScriptTensor dtype and shape with real representation dtype and shape. 3. Registry: Add `is_complex`, and supports complex onnxfunction. 4. Dispatcher: Filter with/out complex onnxfunction before opschema matching, based on the dtype in torch args 5. Test cases: input/output view_as_real for result comparisons. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100554 Approved by: https://github.com/BowenBao	2023-07-28 07:03:07 +00:00
Zachary DeVito	3e5a52cedd	[memory snapshot] track context for segments (#106113 ) We want to display the stack for the original cudaMalloc that created a segment. Previously we could only report the last time the segment memory was used, or the record of the segment_alloc could appear in the list of allocator actions. This PR ensure regardless of whether we still have the segment_alloc action, the context for a segment is still available. The visualizer is updated to be able to incorporate this information. This PR adds a new field to Block. However the previous stacked cleanup PR removed a field of the same size, making the change to Block size-neutral. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106113 Approved by: https://github.com/aaronenyeshi	2023-07-28 06:45:48 +00:00
Zachary DeVito	45b564766d	[memory snapshots] removed chained history (#106079 ) For free blocks of memory in the allocator, we previously kept a linked list of the stack frames of previous allocations that lived there. This was only ever used in one flamegraph visualization and never proved useful at understanding what was going on. When memory history tracing was added, it became redundant, since we can see the history of the free space from recording the previous actions anyway. This patch removes this functionality and simplifies the snapshot format: allocated blocks directly have a 'frames' attribute rather than burying stack frames in the history. Previously the memory history tracked the real size of allocations before rounding. Since history was added, 'requested_size' has been added directly to the block which records the same information, so this patch also removes that redundancy. None of this functionality has been part of a PyTorch release with BC guarentees, so it should be safe to alter this part of the format. This patch also updates our visualization tools to work with the simplified format. Visualization tools keep support for the old format in `_legacy` functions so that during the transition old snapshot files can still be read. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106079 Approved by: https://github.com/eellison	2023-07-28 06:45:48 +00:00
PyTorch UpdateBot	73a8544d8a	[vision hash update] update the pinned vision hash (#106182 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106182 Approved by: https://github.com/pytorchbot	2023-07-28 06:16:51 +00:00
Eli Kobrin	32844be3cf	[JIT] Fix getting type for subscript assignments. (#106041 ) ### Description Hi! We've been fuzzing `pytorch` with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz) and found error of out of bounds access in `torch::jit` module. pytorch version: 18bcf62bbcf7ffd47e3bcf2596f72aa07a07d65f The error occurs in `import_source.cpp:560` when we get the type from the `assign.rhs()`. `assign.rhs()` has `Maybe` type, as well as `assign.type()`, so one of them can be not presented. According to [grammar](`22f93852a2/torch/csrc/jit/frontend/tree_views.h`), we can have `Assign` statement, which `lhs` will be `Subscript`, `rhs` will be empty (`Maybe` type with no subtrees) and `type` will be presented. But in `import_source.cpp:560` we try to get `rhs` expression from the assignment with no check whether it is presented. This is example from the how to reproduce section from the testing input: ``` class Module(Module): __parameters__ = ["0", ] __buffers__ = [] __annotations__ = [] __annotations__["0"] : Tensor ``` When we parse the last statement of class definition, we set the type of `lhs` to `Subscript`, because the lookahead is `[` `76fb72e24a/torch/csrc/jit/frontend/parser.cpp (L205-L207)` Then in `parseAssignment` we get `maybeOp` and `type` depending on the next symbol (if it is `:`, we get only the type) `76fb72e24a/torch/csrc/jit/frontend/parser.cpp (L437-L447)` So after that, in `import_source.cpp:560`, parsing attributes, one of which is assignment with subscript type of `lhs`, we try to get type from `rhs` expression and out of bounds access occurs. To fix the error, we need to check whether the `rhs` or `type` are presented and get the type from corresponding expression. ### How to reproduce Build docker container from [here](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch): ```bash $ sudo docker build -t oss-sydr-fuzz-pytorch ``` Run docker container: ```bash $ sudo docker run --rm --privileged -v `pwd`:/fuzz -it oss-sydr-fuzz-pytorch /bin/bash ``` Run the `load_fuzz` target on the [input.txt](https://github.com/pytorch/pytorch/files/12173962/input.txt) ```bash /load_fuzz input.txt ``` You will see the following output: ``` AddressSanitizer:DEADLYSIGNAL ================================================================= ==157==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x00000c163764 bp 0x7ffee71d0070 sp 0x7ffee71d0050 T0) ==157==The signal is caused by a READ memory access. ==157==Hint: this fault was caused by a dereference of a high value address (see register values below). Disassemble the provided pc to learn which register was used. #0 0xc163764 in c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> >::retain_() /pytorch/c10/util/intrusive_ptr.h:265:54 #1 0xc1697fd in c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> >::intrusive_ptr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&) /pytorch/c10/util/intrusive_ptr.h:354:5 #2 0xc1697fd in torch::jit::Expr::Expr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&) /pytorch/torch/csrc/jit/frontend/tree_views.h:270:49 #3 0xc1f02cb in torch::jit::Maybe<torch::jit::Expr>::get() const /pytorch/torch/csrc/jit/frontend/tree_views.h:212:12 #4 0xd194369 in torch::jit::SourceImporterImpl::importClass(c10::QualifiedName const&, torch::jit::ClassDef const&, bool) /pytorch/torch/csrc/jit/serialization/import_source.cpp:560:70 #5 0xd18c701 in torch::jit::SourceImporterImpl::importNamedType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::ClassDef const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:288:5 #6 0xd18a84c in torch::jit::SourceImporterImpl::findNamedType(c10::QualifiedName const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:140:5 #7 0xd1913a8 in torch::jit::SourceImporterImpl::resolveType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::SourceRange const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:261:10 #8 0xc2e422f in torch::jit::ScriptTypeParser::parseTypeFromExpr(torch::jit::Expr const&) const /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:238:24 #9 0xc2e4697 in torch::jit::ScriptTypeParser::parseType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:312:10 #10 0xd1a37d4 in torch::jit::SourceImporter::loadType(c10::QualifiedName const&) const /pytorch/torch/csrc/jit/serialization/import_source.cpp:786:27 #11 0xd121c47 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0::operator()(c10::QualifiedName const&) const /pytorch/torch/csrc/jit/serialization/import.cpp:146:33 #12 0xd121c47 in c10::StrongTypePtr std::__invoke_impl<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>(std::__invoke_other, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14 #13 0xd121ad0 in std::enable_if<is_invocable_r_v<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>, c10::StrongTypePtr>::type std::__invoke_r<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>(torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:113:9 #14 0xd121926 in std::_Function_handler<c10::StrongTypePtr (c10::QualifiedName const&), torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0>::_M_invoke(std::_Any_data const&, c10::QualifiedName const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/std_function.h:291:9 #15 0xd17ec49 in std::function<c10::StrongTypePtr (c10::QualifiedName const&)>::operator()(c10::QualifiedName const&) const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/std_function.h:622:14 #16 0xd26b802 in torch::jit::Unpickler::readGlobal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/unpickler.cpp:844:9 #17 0xd2615fb in torch::jit::Unpickler::readInstruction() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:520:7 #18 0xd25f917 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:253:27 #19 0xd25f5b2 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:206:3 #20 0xd186403 in torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) /pytorch/torch/csrc/jit/serialization/import_read.cpp:53:20 #21 0xd12152d in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/import.cpp:184:10 #22 0xd117bae in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::deserialize(c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:287:19 #23 0xd114074 in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::istream&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:389:25 #24 0xd113a27 in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::istream&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:325:10 #25 0xd11bb64 in torch::jit::load(std::istream&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:485:10 #26 0x610c5c in LLVMFuzzerTestOneInput /load.cc:42:14 #27 0x537701 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15 #28 0x52160c in fuzzer::RunOneTest(fuzzer::Fuzzer, char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6 #29 0x52735b in fuzzer::FuzzerDriver(int, char*, int ()(unsigned char const*, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9 #30 0x550912 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10 #31 0x7f06e8323082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee) #32 0x51bf2d in _start (/load_fuzz+0x51bf2d) AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV /pytorch/c10/util/intrusive_ptr.h:265:54 in c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> >::retain_() ==157==ABORTING ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106041 Approved by: https://github.com/davidberard98	2023-07-28 05:04:00 +00:00
David Berard	efeb46e507	Update kineto submodule to 465ff4cd7 (#106154 ) Reland update to kineto after https://github.com/pytorch/pytorch/pull/105866 was reverted. This new update contains a patch to check CUPTI_API_VERSION instead of CUDA_VERSION to handle cases where CUPTI_API_VERSION is behind CUDA_VERSION. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106154 Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi	2023-07-28 05:02:27 +00:00
nikitaved	01069ad4be	sparse.mm.backward: fix for non-contiguous grad values on CPU (#106127 ) Fixes https://github.com/pytorch/pytorch/issues/102493. The problem was that the backward implementation assumed inputs to be contiguous. This might supersede https://github.com/pytorch/pytorch/pull/104520. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106127 Approved by: https://github.com/cpuhrsch	2023-07-28 01:25:00 +00:00
PyTorch MergeBot	93b2036bef	Revert "[quant][pt2e] store scale/zero_point as tensor attributes to support serialization (#105894 )" This reverts commit 3ca71ed735257cb7ad377b57a45057c265893a40. Reverted https://github.com/pytorch/pytorch/pull/105894 on behalf of https://github.com/huydhn due to breaking executorch tests internally ([comment](https://github.com/pytorch/pytorch/pull/105894#issuecomment-1654831950))	2023-07-28 01:16:02 +00:00
Jason Ansel	cb14ff294b	[inductor] Pass to remove pointless clones (#105994 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105994 Approved by: https://github.com/yanboliang, https://github.com/eellison	2023-07-28 00:57:09 +00:00
Brian Coutinho	e31855d0d6	[pytorch profiler] fix profiler test for windows (#106156 ) Summary: Fixes Windows tests, I am not seeing any CUDA events in the output so this test does not apply there https://github.com/pytorch/pytorch/pull/105187#issuecomment-1652669293 Test Plan: buck2 run //caffe2/caffe2:caffe2_test_gpu Reviewed By: chaekit, aaronenyeshi, huydhn Differential Revision: D47841663 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106156 Approved by: https://github.com/aaronenyeshi, https://github.com/huydhn	2023-07-28 00:53:09 +00:00
Andrew Gu	88c400e03b	Add @penguinwu to distributed codeowners (#105945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105945 Approved by: https://github.com/ezyang	2023-07-27 23:42:24 +00:00
Elias Ellison	ce63389246	Allow graph breaks in inductor opinfo tests (#105480 ) Previously, we would have test failures for operators which graph broke bc dynamic shape or data dependent ops. Those would appear as a failure because we were running with `nopython=True`. Those test "failures" (which is expected behavior) obfuscated the actual correctness errors, and made this test lower signal. If we wanted to do something like full-op export, that should be different than inductor opinfos. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105480 Approved by: https://github.com/desertfire	2023-07-27 23:23:35 +00:00
Jane Xu	ec0ffac33b	[BE] Document optimizer state_dict better, use example (#105958 ) ![image](https://github.com/pytorch/pytorch/assets/31798555/50ce293c-d884-47ab-b5f5-9ba41e3b4bad) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105958 Approved by: https://github.com/albanD	2023-07-27 23:08:42 +00:00
Michael Gschwind	723bc136a1	Add context for warning about batch_first (#106139 ) Summary: Add context for warning about batch_first Test Plan: sandcastle github Differential Revision: D47809651 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106139 Approved by: https://github.com/mikaylagawarecki	2023-07-27 23:02:05 +00:00
Edward Z. Yang	7b9d250f06	Change _dynamo.export to be export(f)(args, *kwargs) (#106109 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106109 Approved by: https://github.com/voznesenskym	2023-07-27 21:41:13 +00:00
Michael Lazos	5cbd3fc412	[Inductor] Fuse non-foreach ops with foreach ops without iterating over all subnodes (#106008 ) Previously, when fusing a single node into a foreach op, the scheduler would iterate over each subnode and check if it can be fused, this PR adds a mapping so that the node to be fused with can be found more quickly by checking dependencies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106008 Approved by: https://github.com/jansel	2023-07-27 21:40:24 +00:00
Catherine Lee	d4136c9088	Add pull request target to bc lint (#106065 ) In order to get around the approval needed for first time contributors, add pull_request_target trigger for bc lint like for check labels. Documentation about approvals here https://docs.github.com/en/actions/managing-workflow-runs/approving-workflow-runs-from-public-forks specifically: > Note: Workflows triggered by pull_request_target events are run in the context of the base branch. Since the base branch is considered trusted, workflows triggered by these events will always run, regardless of approval settings. For more information about the pull_request_target event, see "[Events that trigger workflows](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request_target)." Pull Request resolved: https://github.com/pytorch/pytorch/pull/106065 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-07-27 20:42:01 +00:00
Mikayla Gawarecki	ca7ece9b50	[easy] improve hint on error message in nn.Module.load_state_dict (#106042 ) Fix #105963 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106042 Approved by: https://github.com/albanD	2023-07-27 19:56:02 +00:00
Stephen Jia	70bc1b0f48	Tag functions to core IR in native_functions.yaml (#105849 ) Summary: Based on operator review meetings, tag appropriate functions as part of the Core IR. [Operator Review Tracking Sheet](https://docs.google.com/spreadsheets/d/1u9jQ-uGlKu-fe9nLy-jS2AIPtpE8sGTmELOFYgKOhXU/edit#gid=0) Test Plan: Use N3940835 to load the YAML and check updated core op list. Reviewed By: mergennachin, kimishpatel, SherlockNoMad Differential Revision: D47673670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105849 Approved by: https://github.com/SherlockNoMad	2023-07-27 19:40:18 +00:00
Elias Ellison	d960664842	Lower batch on cait_m36_384 (#106091 ) The memory compression for this model is 0.9839, but we OOM w cudagraphs because we interleave the eager runs with cudagraph so it duplicates the memory bc of cudagraph memory pool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106091 Approved by: https://github.com/anijain2305	2023-07-27 19:33:38 +00:00
Elias Ellison	27ece5fad4	[Easy] remove unneeded sort (#106090 ) This isn't needed now that we call stable_topological_sort in `freezing_passes`. The non-stable sort can also hurt perf. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106090 Approved by: https://github.com/Chillee, https://github.com/Skylion007	2023-07-27 19:09:48 +00:00
Zhengxu Chen	10f55a2a94	[export] Handle the case for no placeholders during in runtime assertion pass. (#106134 ) Summary: as title Differential Revision: D47835210 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106134 Approved by: https://github.com/angelayi	2023-07-27 18:36:51 +00:00
Lucy Qiu	9ff36c2b3f	[Pytorch][Vulkan] sum.dim_IntList (#105612 ) Summary: Add Vulkan support for [sum](https://pytorch.org/docs/stable/generated/torch.sum.html).dim_IntList [sum.dim_IntList](https://www.internalfb.com/code/fbsource/[49b7951b7eb6]/xplat/caffe2/aten/src/ATen/native/native_functions.yaml?lines=5466): ``` func: sum.dim_IntList(Tensor self, int[1]? dim, bool keepdim=False, , ScalarType? dtype=None) ``` Some explanation For each pos - Iterate over the out_texel and summed dimension - For H,W; rearrange pos.x, pos.y - For C,H,W; When CHW are summed, batch moves into channel The src N is determined by pos.z 4 + out_index Follow up: Add support for `keepdim=true` ``` if keepdim is true, the output tensor is of the same size as input except in the dimension(s) dim, where it is of size 1 otherwise, the dim is squeezed, result in the output tensor having 1 fewer dimension/s. ``` Add support for [sum](https://www.internalfb.com/code/fbsource/[49b7951b7eb6]/xplat/caffe2/aten/src/ATen/native/native_functions.yaml?lines=5457) ``` func: sum(Tensor self, , ScalarType? dtype=None) -> Tensor ``` Test Plan: New tests: ``` lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter=".sum" Downloaded 0/53 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules) Building: finished in 47.4 sec (100%) 536/536 jobs, 8/536 updated Total time: 47.5 sec BUILD SUCCEEDED Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = .sum* [==========] Running 5 tests from 1 test suite. [----------] Global test environment set-up. [----------] 5 tests from VulkanAPITest [ RUN ] VulkanAPITest.sum_2d [ OK ] VulkanAPITest.sum_2d (426 ms) [ RUN ] VulkanAPITest.sum_3d [ OK ] VulkanAPITest.sum_3d (2 ms) [ RUN ] VulkanAPITest.sum_4d [ OK ] VulkanAPITest.sum_4d (3 ms) [ RUN ] VulkanAPITest.sum_3d_combined [ OK ] VulkanAPITest.sum_3d_combined (1 ms) [ RUN ] VulkanAPITest.sum_4d_combined [ OK ] VulkanAPITest.sum_4d_combined (5 ms) [----------] 5 tests from VulkanAPITest (437 ms total) [----------] Global test environment tear-down [==========] 5 tests from 1 test suite ran. (438 ms total) [ PASSED ] 5 tests. ``` clang-format on Sum.cpp and sum_dim.glsl Differential Revision: D47580428 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105612 Approved by: https://github.com/SS-JIA	2023-07-27 18:35:50 +00:00
dependabot[bot]	78fffe8906	Bump certifi from 2023.5.7 to 2023.7.22 in /tools/build/bazel (#105983 ) Bumps [certifi](https://github.com/certifi/python-certifi) from 2023.5.7 to 2023.7.22. - [Commits](https://github.com/certifi/python-certifi/compare/2023.05.07...2023.07.22) --- updated-dependencies: - dependency-name: certifi dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-27 10:23:56 -07:00
fduwjj	487ebcac3b	Clean up unsed MHA code to avoid confusion (#105956 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105956 Approved by: https://github.com/wz337, https://github.com/ezyang, https://github.com/wanchaol	2023-07-27 17:10:17 +00:00
Nikita Shulga	1a59be2c9e	[BE] Use `C10_CLANG_DIAGNOSTIC` macros (#106084 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/106084 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2023-07-27 16:53:50 +00:00
Jason Ansel	977df45a0f	[inductor] Call render() once for templates (#105987 ) This is more code, but perhaps easier to understand? Both @Chillee and @ipiszy expressed confusion that we rendered templates twice to reach a fixed point. This removes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105987 Approved by: https://github.com/Chillee	2023-07-27 16:34:38 +00:00
angelayi	377f306b4c	[inductor] Add has_mkl check (#106049 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106049 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-07-27 15:34:02 +00:00
Aleksei Nikiforov	6f1042c049	Make sure that little endian is default case when __BYTE_ORDER__ is not defined (#104249 ) This is a follow up to discussion in https://github.com/pytorch/pytorch/pull/96422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104249 Approved by: https://github.com/malfet	2023-07-27 13:33:35 +00:00
XiaobingSuper	7c97c943fb	inductor: always convert weight to channels_last for cpu conv (#105517 ) For the CPU backend, we always use channels_last to get good performance by avoiding format reorder(block to plain or plain to black), and they also assume that the weight is channels_last when doing the weight packing, so there always convert weight format and doing layout optimization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105517 Approved by: https://github.com/jgong5, https://github.com/shunting314	2023-07-27 08:37:32 +00:00
James Donald	a1d0db1c60	[pytorch] Fix MSVC unexpected tokens following preprocessor directive (#105922 ) Summary: Fix this warning: ``` caffe2\c10\macros\Macros.h(138): warning C4067: unexpected tokens following preprocessor directive - expected a newline ``` `caffe2/c10/util/variant.h` already has a similar to check and define a stub for `__has_attribute(x)`, so this would not be new to caffe2/pytorch. Test Plan: CI should complete, still with plenty of caffe2 warnings but this one should be gone from the Windows build log Differential Revision: D47735319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105922 Approved by: https://github.com/kit1980	2023-07-27 06:03:31 +00:00
Scott Wolchok	b435bff53a	[PyTorch] Add tests for empty tensors w/storage null data_ptr (#101426 ) Further investigation seems to show that changing this behavior (making empty tensors sometimes have non-null data_ptr) was the real problem with #98090 . Adding tests to lock down this behavior so we don't change it by accident again. Differential Revision: [D45873002](https://our.internmc.facebook.com/intern/diff/D45873002/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101426 Approved by: https://github.com/zou3519	2023-07-27 05:19:42 +00:00
haozhe.zhu	5d8596292b	fix atomic add for bf16/fp16 (#104592 ) Enable atomic_add for fp16 and fix atomic_add issue for bf16/fp16. Previously the constructor `bfloat16(addr->x);` will invoke https://github.com/pytorch/pytorch/blob/main/c10/util/BFloat16.h#L99 (construct a `bfloat16` from `float`). Instead, we actually wish to invoke https://github.com/pytorch/pytorch/blob/main/c10/util/BFloat16.h#L97 (construct a `bfloat16/float16` from `bits`. Test Plan: Remove expected failure for `float16` in `test_torchinductor_opinfo` with op `scatter_reduce, sum`, `scatter_add`, `index_add`, `amax/amin` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104592 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-27 04:57:56 +00:00
Yukio Siraichi	707aadeedd	Track global Numpy variables as side-effect. (#105959 ) Fix: #105074 This PR makes dynamo handle Numpy global variables the same way as PyTorch tensor global variables by tracking them as side-effect. In summary, we add `NumpyNdarrayVariable` to the `VariableBuilder._can_lift_attrs_to_inputs` function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105959 Approved by: https://github.com/ezyang	2023-07-27 03:49:48 +00:00
Nikita Karetnikov	b812e35a75	[pt2] add meta for `argsort.stable`, use `sort` samples in `OpInfo` (#106025 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106025 Approved by: https://github.com/ezyang, https://github.com/zou3519	2023-07-27 03:49:17 +00:00
Edward Z. Yang	edebdaf182	Change _dynamo.explain to be explain(f)(args, *kwargs) (#106066 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106066 Approved by: https://github.com/wanchaol, https://github.com/voznesenskym	2023-07-27 03:21:52 +00:00
Masaki Kozuki	e773f28ee3	Reland "Add forward mode AD to out-place foreach functions (#102409 ) (#106043 ) forward-mode AD of out-of-place foreach functions, finally. rel: - #102409 - #105504 - #58833 - #100695 --- # Generated Foreach ```c++ ::std::vector<at::Tensor> _foreach_sinh(c10::DispatchKeySet ks, at::TensorList self) { auto self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); std::vector<bool> _any_has_forward_grad_result(self.size()); for (const auto& i : c10::irange(self.size())) { _any_has_forward_grad_result[i] = isFwGradDefined(self[i]); } std::shared_ptr<ForeachSinhBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<ForeachSinhBackward0>(new ForeachSinhBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->self_ = make_saved_variable_list(self); grad_fn->self_size_ = self.size(); } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::_foreach_sinh(ks & c10::after_autograd_keyset, self_); })(); auto result = std::move(_tmp); #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt); for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { if (_any_has_forward_grad_result[i]) { auto self_t_raw = toNonOptFwGrad(self[i]); auto self_tensor = toNonOptTensor(self[i]); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self[i]); result_new_fw_grad_opts[i] = (self_t.conj() * self_p.cosh().conj()).conj(); } } for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i]; if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level / 0, / is_inplace_op / false); } } return result; } ::std::vector<at::Tensor> _foreach_norm_Scalar(c10::DispatchKeySet ks, at::TensorList self, const at::Scalar & ord) { auto self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); std::vector<bool> _any_has_forward_grad_result(self.size()); for (const auto& i : c10::irange(self.size())) { _any_has_forward_grad_result[i] = isFwGradDefined(self[i]); } std::shared_ptr<ForeachNormBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<ForeachNormBackward0>(new ForeachNormBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->ord = ord; grad_fn->self_ = make_saved_variable_list(self); grad_fn->self_size_ = self.size(); } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::_foreach_norm(ks & c10::after_autograd_keyset, self_, ord); })(); auto result = std::move(_tmp); #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt); for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { if (_any_has_forward_grad_result[i]) { auto self_t_raw = toNonOptFwGrad(self[i]); auto self_tensor = toNonOptTensor(self[i]); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self[i]); result_new_fw_grad_opts[i] = norm_jvp(self_p, self_t, ord, result[i]); } } for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i]; if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result[i]._set_fw_grad(result_new_fw_grad_opt.value(), / level / 0, / is_inplace_op / false); } } if (grad_fn) { grad_fn->result = result; } return result; } ``` # Reference ```c++ at::Tensor sinh(c10::DispatchKeySet ks, const at::Tensor & self) { auto& self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self)); std::shared_ptr<SinhBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<SinhBackward0>(new SinhBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->self_ = SavedVariable(self, false); } #ifndef NDEBUG c10::optional<Storage> self__storage_saved = self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt; c10::intrusive_ptr<TensorImpl> self__impl_saved; if (self_.defined()) self__impl_saved = self_.getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::sinh(ks & c10::after_autograd_keyset, self_); })(); auto result = std::move(_tmp); #ifndef NDEBUG if (self__storage_saved.has_value() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage())); if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr()); if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) { TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: sinh"); } if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: sinh"); #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt; if (_any_has_forward_grad_result && (result.defined())) { auto self_t_raw = toNonOptFwGrad(self); auto self_tensor = toNonOptTensor(self); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self); result_new_fw_grad_opt = (self_t.conj() self_p.cosh().conj()).conj(); } if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result._set_fw_grad(result_new_fw_grad_opt.value(), /* level / 0, / is_inplace_op / false); } return result; } at::Tensor norm_Scalar(c10::DispatchKeySet ks, const at::Tensor & self, const at::Scalar & p) { auto& self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self)); std::shared_ptr<NormBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<NormBackward0>(new NormBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->p = p; grad_fn->self_ = SavedVariable(self, false); } #ifndef NDEBUG c10::optional<Storage> self__storage_saved = self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt; c10::intrusive_ptr<TensorImpl> self__impl_saved; if (self_.defined()) self__impl_saved = self_.getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::norm(ks & c10::after_autograd_keyset, self_, p); })(); auto result = std::move(_tmp); #ifndef NDEBUG if (self__storage_saved.has_value() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage())); if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr()); if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) { TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: norm_Scalar"); } if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: norm_Scalar"); #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } throw_error_for_complex_autograd(result, "norm"); c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt; if (_any_has_forward_grad_result && (result.defined())) { auto self_t_raw = toNonOptFwGrad(self); auto self_tensor = toNonOptTensor(self); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self); result_new_fw_grad_opt = norm_jvp(self_p, self_t, p, result); } if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result._set_fw_grad(result_new_fw_grad_opt.value(), / level / 0, / is_inplace_op */ false); } if (grad_fn) { grad_fn->result_ = SavedVariable(result, true); } return result; } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106043 Approved by: https://github.com/soulitzer	2023-07-27 03:13:24 +00:00
Edward Z. Yang	49e047e0f9	Delete dead summarize_dim_constraints (#106053 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106053 Approved by: https://github.com/ydwu4	2023-07-27 03:08:24 +00:00
PyTorch MergeBot	076781ba9b	Revert "fix building errors on FreeBSD (#105897 )" This reverts commit 5c5eece6d85d5be3485f96a6da3905f2dd28331b. Reverted https://github.com/pytorch/pytorch/pull/105897 on behalf of https://github.com/PaliC due to causing regressions on internal models ([comment](https://github.com/pytorch/pytorch/pull/105897#issuecomment-1652840218))	2023-07-27 03:01:44 +00:00
PyTorch MergeBot	c5f6c2de15	Revert "update kineto submodule to a94f97b (#105866 )" This reverts commit 8af25cfc245848708e2f24ab2dbed31f6a34f5dc. Reverted https://github.com/pytorch/pytorch/pull/105866 on behalf of https://github.com/davidberard98 due to Apparently breaks for some older CUDA versions due to symbols that are not available in CUDA <=11.6, I'll take a look and re-update the module tomorrow ([comment](https://github.com/pytorch/pytorch/pull/105866#issuecomment-1652836973))	2023-07-27 02:56:15 +00:00
Jason Ansel	457d01bcfd	[Compiled Autograd] Remove TORCH_API from generated autograd nodes (#105286 ) This works around the Windows symbol count issues in #103822. Unfortunately, removing TORCH_API only works on Windows, but causes build issues on Linux, so we need the `#ifdef`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105286 Approved by: https://github.com/albanD	2023-07-27 02:33:14 +00:00
haozhe.zhu	952021934f	inductor: legalize fp16 (#100857 ) This PR aims to vectorize FP16 for CPU with what BF16 has done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100857 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-27 02:31:40 +00:00
GwiHwan	2d41fa9d38	Revise err msgs for weight param of Multimarginloss (#106047 ) Summary: fix lint issue of #106019 Fix: https://github.com/pytorch/pytorch/issues/106020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106047 Approved by: https://github.com/Skylion007	2023-07-27 01:44:13 +00:00
Bin Bao	fd4f8e194e	[dashboard] Replace cpp_wrapper with aot_inductor on the perf dashboard (#106077 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106077 Approved by: https://github.com/gmagogsfm, https://github.com/huydhn	2023-07-27 01:39:36 +00:00
Wanchao Liang	f026b32008	[device_mesh][BE] reduce_scatter fallback to funcol and remove from DM (#105642 ) For the reason similar to https://github.com/pytorch/pytorch/pull/105605 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105642 Approved by: https://github.com/kumpera, https://github.com/wz337, https://github.com/fduwjj	2023-07-27 01:33:05 +00:00
Wanchao Liang	2fa063e1e0	[device_mesh][BE] remove allgather from DM (#105614 ) For the reason similar to https://github.com/pytorch/pytorch/pull/105605 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105614 Approved by: https://github.com/rohan-varma, https://github.com/wz337, https://github.com/fduwjj	2023-07-27 01:33:05 +00:00
Wanchao Liang	4a49f1f46e	[device mesh][BE] remove allreduce from DM (#105605 ) This PR removes allreduce from DM and use functional collective instead, the rationle is that we don't want to maintain yet another set of collective apis, and since the DM's collective is now a thin wrapper to functional collective so we don't really need these collective to live in DM Pull Request resolved: https://github.com/pytorch/pytorch/pull/105605 Approved by: https://github.com/kumpera, https://github.com/wz337, https://github.com/fduwjj	2023-07-27 01:33:02 +00:00
Michael Gschwind	06dd850dd5	Simplify check (#106044 ) Summary: Simplify check / refactor for readability Test Plan: sandcastle, github Differential Revision: D47800732 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106044 Approved by: https://github.com/mikaylagawarecki	2023-07-27 01:18:25 +00:00
Edward Z. Yang	6847c965f5	Turn on capture_dynamic_output_shape_ops/capture_scalar_outputs by default for export (#105962 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105962 Approved by: https://github.com/tugsbayasgalan	2023-07-27 01:02:09 +00:00
Edward Z. Yang	f70844bec7	Enable UFMT on a bunch of low traffic Python files outside of main files (#106052 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106052 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-07-27 01:01:17 +00:00
Jason Ansel	5a114f72bf	[Compiled Autograd] Move to torch::dynamo::autograd namespace (#105854 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105854 Approved by: https://github.com/albanD	2023-07-27 00:36:47 +00:00
Nicolas Macchioni	f20ead0aea	[pt2][inductor] guard for __package__ is None (#106056 ) Summary: Guard against errors when __package__ is NoneType Created from CodeHub with https://fburl.com/edit-in-codehub Test Plan: sandcastle + CI Sandcastle run Differential Revision: D47803386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106056 Approved by: https://github.com/jansel	2023-07-27 00:34:47 +00:00
Furkan Akkurt	3959695fbd	Fix typo ; Update grad_mode.py (#106045 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/106045 Approved by: https://github.com/albanD, https://github.com/soulitzer	2023-07-27 00:24:50 +00:00
Huy Do	c05eb77f09	Increase ignorable failures limit (#105998 ) Given the number of unstable job atm (rocm, distributed), having the limit of 3 for ignorable failures is too low. When I manually look into force merges, I could find many examples like like https://github.com/pytorch/pytorch/pull/105848 where there are 3+ unrelated failures. As the classification is getting more accurate, we can aim to ignore all flaky and broken trunk failures. * Default `ok_failed_checks_threshold` to `-1` to ignore all unrelated failures * Increase the `IGNORABLE_FAILED_CHECKS_THESHOLD` to 10. The only concern I have before setting it to `-1` is the fog of war situation when a sev occurs. So 10 is a good middle ground before we agree to set it to `-1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105998 Approved by: https://github.com/clee2000	2023-07-27 00:14:37 +00:00
BowenBao	5a2b9ca754	[ONNX] Limit number of elements to display for list/tuple/dict in diagnostics (#106048 ) In a recent change, diagnostics started logging contents within tuple/list/dict for diagnosed function arguments and return types. This brought slow down to export due to some extremely large container instances, such as the fx to onnx node mapping dictionary. This PR adds a limit to how many elements the diagnostic would record for these types. Together with https://github.com/microsoft/onnxscript/pull/922, the performance of export w/ diagnostics is restored and improved. As shown by pyinstrument: GPT2 time for `fx_to_onnx_interpreter.run` 17.767s -> 1.961s xcit_large_24_p8_224 time for `fx_to_onnx_interpreter.run` 144.729s -> 4.067s Pull Request resolved: https://github.com/pytorch/pytorch/pull/106048 Approved by: https://github.com/titaiwangms, https://github.com/justinchuby	2023-07-26 23:45:34 +00:00
Khushi Agrawal	64843993a4	[mypy] autotune_process.py (#105732 ) Follows: #105571 / #105230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105732 Approved by: https://github.com/eellison	2023-07-26 22:45:04 +00:00
drisspg	cb9a4fbbf2	[BE] Improve test_transformers test structure (#105938 ) # Summary We have a vast majority of test that only run on cuda. Decorating with @onlycuda causes pytest to instantiate 2x the tests and skip half of them. This overhead is non trivial when the #tests cross larger like it has for this file. This breaks up the cuda only tests into a separate class Pull Request resolved: https://github.com/pytorch/pytorch/pull/105938 Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet	2023-07-26 22:16:20 +00:00
cyy	646fa36875	Add const reference in opportunities detected by clang-tidy (#105931 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105931 Approved by: https://github.com/Skylion007	2023-07-26 21:38:10 +00:00
Fuzzkatt	b69e5302b5	add skip if sm < 80 check (#105888 ) Fix issue where we were testing `test_schema_correctness_nn_functional_scaled_dot_product_attention_cuda_bfloat16` from `test_schema_check.py` on V100, but bfloat16 support on cuda doesn't exist for sm < 80. Added skip if sm < 80 to the failing test. cc @ptrblck @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/105888 Approved by: https://github.com/kit1980	2023-07-26 21:25:24 +00:00
Syed Tousif Ahmed	38861ba39f	Fixes netName assignment for NCCL Config (#105776 ) Fixes #104340 The core issue is described here https://github.com/pybind/pybind11/issues/1168#issuecomment-341969643. Note that NCCL calls free on the netName pointer when destroying the communicator. So memory is safely managed here. CC: @kwen2501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105776 Approved by: https://github.com/kwen2501	2023-07-26 21:13:56 +00:00
Rohan Varma	43b3632215	[Composable] Add hybrid shard AC compile test (#105207 ) This was request to ensure hybrid shard + AC + compile works. Differential Revision: [D47462393](https://our.internmc.facebook.com/intern/diff/D47462393/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105207 Approved by: https://github.com/awgu, https://github.com/fegin	2023-07-26 21:03:55 +00:00
Rohan Varma	4137d6e499	[Composable FSDP] Enable HSDP (#105206 ) Need to pass in strategy to _init_process_group_state to enable hsdp for composable. Differential Revision: [D47462394](https://our.internmc.facebook.com/intern/diff/D47462394/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105206 Approved by: https://github.com/awgu, https://github.com/fegin	2023-07-26 21:03:55 +00:00
Wang, Xiao	dc19f8a6b5	Fix cuSparse CSR SPMM for using nullptr in csrRowOffsets (#105957 ) cusparse from cuda 12.2 no longer allows passing nullptr to csrRowOffsets Internal NVIDIA ref: 4208400 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105957 Approved by: https://github.com/IvanYashchuk	2023-07-26 20:15:30 +00:00
Jerry Zhang	3ca71ed735	[quant][pt2e] store scale/zero_point as tensor attributes to support serialization (#105894 ) Summary: Currently scale/zero_point for per tensor quant is stored as burnt in literals, this means these values can't be serialized in state_dict, this PR changes them to buffers/Tensors so that they can be serialized Test Plan: python test/test_quantization.py TestQuantizePT2E Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D47770963](https://our.internmc.facebook.com/intern/diff/D47770963) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105894 Approved by: https://github.com/kimishpatel	2023-07-26 20:15:06 +00:00
Andrew Gu	841b4acf1e	[FSDP][Easy] Rename to `_comm_hook`, `_comm_hook_state` (#106033 ) This is just out of preference to make the naming convention consistent with `register_comm_hook()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106033 Approved by: https://github.com/fegin	2023-07-26 19:59:11 +00:00
Andrew Gu	035704e88d	[FSDP][Easy] Move post-bwd hook logging to own func (#106032 ) This is to help make `_post_backward_hook()` easier to read. I plan to refactor some other parts in future PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106032 Approved by: https://github.com/fegin	2023-07-26 19:59:11 +00:00
Oren Leung	f725e6374d	doc: fix fake quantize per channel doc (#105955 ) another doc bug for fake_quantize_per_channel function doc now matches `e7142700ed/aten/src/ATen/native/quantized/FakeQuantPerChannelAffine.cpp (L32)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105955 Approved by: https://github.com/kit1980	2023-07-26 19:17:41 +00:00
AllenTiTaiWang	60ad46f49d	[ONNX] Clean up outdated skip ort < 1.15 decorator in tests (#105951 ) `skip_min_ort_version` is not needed anymore, as the ort version is now officially > 1.15. But the function is kept for future usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105951 Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi	2023-07-26 19:04:43 +00:00
FFFrog	9a1cdcb8a0	Format: fixing multiple string concatenation in single line (#106013 ) Fixing multiple string concatenation in single line Pull Request resolved: https://github.com/pytorch/pytorch/pull/106013 Approved by: https://github.com/albanD	2023-07-26 18:39:18 +00:00
Jerry Zhang	3a77f9aaaf	[quant][api] Move torch.ao.quantization.pt2e.quantizer to torch.ao.quantization.quantizer (#105885 ) Summary: moving quantizer to torch.ao.quantization to make it a public api, since pt2e is a folder for implementations Test Plan: CIs sanity check: "buck test //executorch/backends/xnnpack/test:test_xnnpack_quantized_models -- test_resnet18" Differential Revision: D47727838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105885 Approved by: https://github.com/andrewor14	2023-07-26 18:20:09 +00:00
Alan Ji	70b0f1b248	fix some typos (#106018 ) Fixes #ISSUE_NUMBER Fix typos in `test_static_module.cc`, `backend_cutting_test.cc` and `types_base.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106018 Approved by: https://github.com/awgu	2023-07-26 18:14:44 +00:00
shibo19	21ede4547a	remove duplicated code in optimizer (#106022 ) Fixes #ISSUE_NUMBER as the title, the check code has duplicates Pull Request resolved: https://github.com/pytorch/pytorch/pull/106022 Approved by: https://github.com/janeyx99	2023-07-26 17:01:28 +00:00
Edward Z. Yang	716f37cef8	If we can't statically prove 32-bit indexing OK, only add guard if hint exists (#106004 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106004 Approved by: https://github.com/lezcano, https://github.com/albanD	2023-07-26 16:36:29 +00:00
Danni Li	c0c208516b	[Doc] Add `Tensor.Shape` (#104750 ) Summary: Add `Tensor.Shape` doc. Fix: #104038 Ref: - https://github.com/pytorch/pytorch/issues/5544 - https://github.com/pytorch/pytorch/issues/1980 Differential Revision: D47278630 CC: @svekars @carljparker Pull Request resolved: https://github.com/pytorch/pytorch/pull/104750 Approved by: https://github.com/mikaylagawarecki	2023-07-26 16:30:15 +00:00
Alexander Pivovarov	28a4fc8d8a	Fixe some typos (#105869 ) ### Description: - Fixes for typos in comments Pull Request resolved: https://github.com/pytorch/pytorch/pull/105869 Approved by: https://github.com/mikaylagawarecki, https://github.com/Skylion007	2023-07-26 16:23:57 +00:00
Zhengxu Chen	2dbadd1eae	[export] Remove experimental runtime assertion configs from export API. (#105043 ) Test Plan: CI Differential Revision: D47390794 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105043 Approved by: https://github.com/larryliu0820	2023-07-26 16:21:29 +00:00
Jeff Daily	8af25cfc24	update kineto submodule to a94f97b (#105866 ) New submodule commit: `c23a1fdbf6` Fixes #97167. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105866 Approved by: https://github.com/aaronenyeshi	2023-07-26 16:04:30 +00:00
drisspg	c4b7311fc2	Meff Attn Bias (#104310 ) # Summary ### Review Points - Automatically pad tensors to create aligned masks when seqlen_kv is not multiple of 16. This will cause memory spike ~ 2 * attn_mask size which could in theory be big. At appears though that doing this + mem_eff is faster than no_pad + math. SO seems to be worth it - Using expand to view the attn_mask in 4d. This is a little different to how we enforce q,k,v to be viewed in 4d prior to calling. Also not supprint bn_heads, seq_lenq, seq_lenkv case. - Should enable, #96099 ### Profiling I ran a bunch of comparisons between sdpa.MATH and sdp.MemEffAttention. I added a attn_bias of shape (1, 1, seqlen_q, seqln_k). For these experiments seqlen_q == seqlen_k. These were all ran on an a100 80gb gpu. Configs: ``` # Run a bunch of experiments batch_sizes = [8, 16, 32] num_heads = [16, 32] max_seq_lens = [15, 64, 128, 512, 555, 1024] embed_dims = [32, 64, 128] dtypes = [torch.float16, torch.bfloat16, torch.float32] pad_percentages = [None] backends = [SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH] run_backward = True attn_mask = True ``` The function calls `sdpa(input*).sum().backward()`. I calculated the geomean speedup of the efficient attention path of the math path for all these configs: `Geomean Speedup: 1.977` An example comparision with batchsize = 8, num_heads = 32, embed_dim = 64, and dtype = torch.float16: ![attn_mask_compare_bsz_8_num_heads_32_embed_dim_64_dtype_fp16](https://github.com/pytorch/pytorch/assets/32754868/0d75bffe-350b-43f2-a37f-514f9158dcff) This was done using the current state of the branch where we force alignment of mask when the last dim is not divisible by 16, which shows up in seq_len = 15 and 555 case. The full data can be found here: [attn_mask_sweep.csv](https://github.com/pytorch/pytorch/files/11962399/attn_mask_sweep.csv) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104310 Approved by: https://github.com/cpuhrsch	2023-07-26 15:51:59 +00:00
AllenTiTaiWang	45322fafd6	[ONNX] Add comment on test_view_dynamic_zero_dim (#105950 ) From https://github.com/pytorch/pytorch/issues/105066. The case was meant to test 0 bbox generated by vision models, but the bboxes still have `.view()` operated. The case was disabled, and not supported. Added a comment to clear the potential confusion in the future. We will wait for model example to proceed on this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105950 Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi	2023-07-26 15:20:40 +00:00
Masaki Kozuki	72f2c87a5a	[foreach] Set `SavedVariable.is_output` to `true` for `grad_fn->result_` (#105504 ) fixes #105502 The scope of this pull request is out-of-place foreach functions that depend on their output tensorlist for backward such as `_foreach_exp`. An example of the generated code with this update is as follows: ```c++ variable_list ForeachExpBackward0::apply(variable_list&& grads) { std::lock_guard<std::mutex> lock(mutex_); TORCH_CHECK(!result_released_, ERR_BACKWARD_TWICE); IndexRangeGenerator gen; auto self_ix = gen.range(self_size_); variable_list grad_inputs(gen.size()); auto result = unpack_list(result_, shared_from_this()); if (task_should_compute_output({ self_ix })) { std::vector<Tensor> grad_result; grad_result.reserve(grads.size()); for (const auto & i : c10::irange(grads.size())) { if (grads[i].defined()) { grad_result.emplace_back(grads[i] * result[i].conj()); } else { grad_result.emplace_back(Tensor()); } } copy_range(grad_inputs, self_ix, grad_result); } return grad_inputs; } ::std::vector<at::Tensor> _foreach_exp(c10::DispatchKeySet ks, at::TensorList self) { auto self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); std::shared_ptr<ForeachExpBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<ForeachExpBackward0>(new ForeachExpBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->self_size_ = self.size(); } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif auto _tmp = ([&]() { if ((isFwGradDefinedTensorList(self))) { static c10::OperatorName full_name("aten::_foreach_exp", ""); static c10::optional<c10::OperatorHandle> opt_op = c10::Dispatcher::singleton().findSchema(full_name); return impl::run_jit_decomposition_with_args_for_jvp<::std::vector<at::Tensor>>("_foreach_exp", *opt_op, ks, self); } else { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::_foreach_exp(ks & c10::after_autograd_keyset, self_); } })(); auto result = std::move(_tmp); #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } if (grad_fn) { grad_fn->result_ = make_saved_variable_list(result, true); } return result; } ``` A bit of context: - https://github.com/pytorch/pytorch/pull/105368#issuecomment-1640912479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105504 Approved by: https://github.com/soulitzer	2023-07-26 14:29:32 +00:00
albanD	9d2e15882e	Add torch.utils to the docs page, remove dead code and fix docstrings (#105142 ) As per title. Note that the c++ side code for the minidumps part was removed. So trying to call any of these 3 functions today results in an error saying that `torch._C` doesn't have these attributes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105142 Approved by: https://github.com/janeyx99	2023-07-26 14:24:58 +00:00
Daniel Dale	6b6702f506	Enhance `no_grad`-context FSDP backward handling (#105374 ) Fixes #105369 Fixes #105371 Addressing two somewhat distinct issues that involve the same test in this PR: 1. To fix #105369: - Add a `no_grad` guard to [`_register_post_backward_reshard_only_hooks`](`93f852f201/torch/distributed/fsdp/_runtime_utils.py (L1406)`) to avoid registering post-backward hooks that would not be removed in that context. 2. To fix #105371: - Add a `grad` context condition to [`_use_sharded_flat_param`](`93f852f201/torch/distributed/fsdp/flat_param.py (L1645C9-L1645C32)`) logic to trigger post-forward `_use_sharded_views` in a `no_grad` context for `NO_RESHARD_AFTER_FORWARD_HANDLE_STRATEGIES` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105374 Approved by: https://github.com/awgu	2023-07-26 14:12:13 +00:00
janEbert	66b73b08df	Allow disabling bias for Transformer (#101687 ) As used by T5 and PaLM, citing "increased training stability for large models" (https://arxiv.org/abs/2204.02311). Depends on #101683, which allows disabling bias for `LayerNorm`s. Marked as draft due to this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101687 Approved by: https://github.com/mikaylagawarecki	2023-07-26 13:50:41 +00:00
David Berard	76fb72e24a	[profiler] Fix profiling shapes with PT2 + lists of dynamic shapes (#105893 ) Fixes #105748 Follow-up to https://github.com/pytorch/pytorch/pull/104320. If we have a list that contains tensors with dynamic shapes, just mark the entire list as undefined. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105893 Approved by: https://github.com/aaronenyeshi	2023-07-26 13:41:07 +00:00
David Berard	6938494d03	[jit] move get_annotations out of infer_concrete_type_builder (#105197 ) There's an internal use case for get_annotations - basically 1. make a module 2. copy annotations from the module to a fx-traced module 3. script the fx module This isn't a public API but for internal use it's probably fine to reuse this logic instead of copying. Differential Revision: [D47460370](https://our.internmc.facebook.com/intern/diff/D47460370/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105197 Approved by: https://github.com/eellison	2023-07-26 13:39:39 +00:00
Wei Lu	22f93852a2	Fix error message about enabling InferenceMode in Python (#105948 ) Summary: The old error message shows ``` ... add `c10::InferenceMode mode;` before model.forward(). Note this guard is only available in C++ but not Python at present." ``` However InferenceMode for Python has been enabled since D28390595. It can be used as a context manager with `torch.inference_mode()`. The error message is fixed as so. Test Plan: Easy Reviewed By: yipjustin Differential Revision: D47655392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105948 Approved by: https://github.com/albanD	2023-07-26 13:21:11 +00:00
Andrew Gu	c099b80073	[FSDP] Add `record_function` for explicit prefetching (#105985 ) Example: <img width="568" alt="Screenshot 2023-07-25 at 7 41 43 PM" src="https://github.com/pytorch/pytorch/assets/31054793/5f3f07b3-97f4-4493-9cab-5619484e2f6d"> This can be particularly help when `with_stack=False`, in which case it is harder to tell the prefetch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105985 Approved by: https://github.com/fegin	2023-07-26 12:16:35 +00:00
Andrew Gu	a9a3c45649	Revert "Simplify handle indexing (#105006 )" (#105984 ) This reverts commit 429d45f91a5b636844954363851be309d8203b56. Unfortunately, https://github.com/pytorch/pytorch/pull/105006 broke backward prefetching (where backward prefetching working correctly was not captured in our unit tests). I need more time to dig into this (tomorrow), but I think the issue is related to: `429d45f91a (diff-9a6937168d232432c34c2c4605b96f3147afa2786e287f74b6074b20aa5980e6R143-R146)` Follow-ups: 1. Investigate this thoroughly 2. Add unit tests to capture backward prefetch functionality Pull Request resolved: https://github.com/pytorch/pytorch/pull/105984 Approved by: https://github.com/fegin	2023-07-26 12:12:14 +00:00
XiaobingSuper	854fe470cd	fix check issue for replace_params_with_constants (#105909 ) fix check issue for replace_params_with_constants to make llama mode const folding work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105909 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-07-26 12:04:01 +00:00
Matthew Hoffman	0616952d13	Merge and improve torch optim optimizer type stubs (#102593 ) Fixes #102428 Also improves hook registration type hints: ```python from typing import Any, Dict, Tuple from torch import nn from torch.optim import Adam, Adagrad, Optimizer linear = nn.Linear(2,2) optimizer = Adam(linear.parameters(), lr=0.001) def pre_hook_fn_return_none(optimizer: Adam, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None: return None def pre_hook_fn_return_modified( optimizer: Optimizer, inputs: Tuple[Any, ...], kwargs: Dict[str, Any] ) -> Tuple[Tuple[Any, ...], Dict[str, Any]]: return inputs, kwargs def hook_fn(optimizer: Optimizer, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None: return None def hook_fn_other_optimizer(optimizer: Adagrad, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None: return None optimizer.register_step_post_hook(hook_fn) # OK optimizer.register_step_pre_hook(pre_hook_fn_return_none) # OK optimizer.register_step_pre_hook(pre_hook_fn_return_modified) # OK optimizer.register_step_post_hook(hook_fn_other_optimizer) # Parameter 1: type "Adam" cannot be assigned to type "Adagrad" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102593 Approved by: https://github.com/janeyx99, https://github.com/malfet	2023-07-26 11:56:42 +00:00
PyTorch UpdateBot	1b9faf22ef	[vision hash update] update the pinned vision hash (#105988 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105988 Approved by: https://github.com/pytorchbot	2023-07-26 11:39:02 +00:00
dependabot[bot]	46b74ab9cf	Bump pygments from 2.12.0 to 2.15.0 in /.github/requirements (#105669 ) Bumps [pygments](https://github.com/pygments/pygments) from 2.12.0 to 2.15.0. - [Release notes](https://github.com/pygments/pygments/releases) - [Changelog](https://github.com/pygments/pygments/blob/master/CHANGES) - [Commits](https://github.com/pygments/pygments/compare/2.12.0...2.15.0) --- updated-dependencies: - dependency-name: pygments dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-07-26 04:28:15 -07:00
Jerry Zhang	d767cff7c7	[quant][fx] Fix docs for prepare_fx/prepare_qat_fx (#105979 ) Summary: Fixes: https://github.com/pytorch/pytorch/issues/103661 Test Plan: visual inspectation of docs https://pytorch.org/docs/2.0/generated/torch.ao.quantization.quantize_fx.prepare_fx.html#torch.ao.quantization.quantize_fx.prepare_fx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/105979 Approved by: https://github.com/andrewor14	2023-07-26 09:56:18 +00:00
Nikita Karetnikov	0c65a2d58f	[pt2] add meta for `_adaptive_avg_pool3d_backward` (#105816 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105816 Approved by: https://github.com/ezyang	2023-07-26 09:30:17 +00:00
lezcano	36ae359655	Update matmul decomp to match eager (#105850 ) The decomposition was not updated after https://github.com/pytorch/pytorch/pull/95261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105850 Approved by: https://github.com/Chillee	2023-07-26 09:24:51 +00:00
lezcano	9bde7f4e27	Fix the docs for cosine_similarity (#104772 ) The behaviour of `cosine_similarity` was subtly changed in https://github.com/pytorch/pytorch/pull/31378, but the docs were not updated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104772 Approved by: https://github.com/albanD, https://github.com/svekars	2023-07-26 09:23:09 +00:00
lezcano	fff4a9db8a	Fuse ops in eager cosine_similarity while keeping the stability and the gradients (#104771 ) There was a regression in https://github.com/pytorch/pytorch/pull/31378 which was reported in https://github.com/pytorch/pytorch/issues/104564. This PR should keep the efficiency and memory usage from the original implementation, while keeping the stability of the latter. This solution was already discussed in https://github.com/pytorch/pytorch/pull/31378, but it was discarded because it modified the vector_norm in-place. The only magic ingredient that was missing for that solution to work was to add a `clone()` after calling the `vector_norm`. I hope this PR takes shorter to land than https://github.com/pytorch/pytorch/issues/104564. Fixes https://github.com/pytorch/pytorch/issues/104564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104771 Approved by: https://github.com/albanD	2023-07-26 09:23:09 +00:00
nikitaved	a61a0fe490	test_linalg: triangular_solve - make well_conditioned well conditioned (#105919 ) `well_contioned=True` does not guarantee that the samples for `triangular_solve` are actually well-conditioned. This PR fixes that. This issues was discovered in https://github.com/pytorch/pytorch/pull/104425. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105919 Approved by: https://github.com/lezcano	2023-07-26 09:21:12 +00:00
Michael Voznesensky	aabdd2b7a1	Add support for tensor.tolist() for static sized int tensors (#105976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105976 Approved by: https://github.com/ezyang	2023-07-26 08:13:22 +00:00
cyy	5c5eece6d8	fix building errors on FreeBSD (#105897 ) Although FreeBSD is not officially supported, this PR fixes some errors on FreeBSD. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105897 Approved by: https://github.com/kit1980	2023-07-26 08:11:42 +00:00
XiaobingSuper	afd621ddde	inductor: fix CSE issue when have symbolic shape input at the freezing path (#105651 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105651 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-07-26 08:07:31 +00:00
XiaobingSuper	9c1802f8e3	inductor: using binary folding path to do conv+bn folding (#105650 ) This path will use binary folding to do conv+bn folding to avoid using ```make_fx``` which meets tracing errors in some model dynamic shape path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105650 Approved by: https://github.com/eellison	2023-07-26 07:37:47 +00:00
kshitij12345	920b446da9	dynamo: support disable_saved_tensors_hooks (#104869 ) Functorch transforms use this context manager which will lead to graph-breaks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104869 Approved by: https://github.com/zou3519	2023-07-26 07:27:37 +00:00
Tugsbayasgalan Manlaibaatar	7b31732a6f	Delete unused experimental export (#105873 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105873 Approved by: https://github.com/ezyang	2023-07-26 07:22:58 +00:00
Rohan Varma	03e2ca9d9c	[Composable] Add more sharding strategies to runtime test (#105205 ) Add more sharding strategies to ensure equivalence Differential Revision: [D47462392](https://our.internmc.facebook.com/intern/diff/D47462392/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105205 Approved by: https://github.com/awgu	2023-07-26 07:03:09 +00:00
Rohan Varma	a326f5621e	composable fsdp, checkpoint, + compile test (#105180 ) Test to ensure that composable FSDP, checkpoint, and compile all work together. Includes a change from https://github.com/pytorch/pytorch/pull/105090 which we can land in that PR first. Differential Revision: [D47452973](https://our.internmc.facebook.com/intern/diff/D47452973/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105180 Approved by: https://github.com/awgu	2023-07-26 07:03:09 +00:00
Rohan Varma	5d70fe0165	[Composable] Use non-reentrant generator, remove reentrant (#105176 ) Removes reentrant support for the composable checkpoint, as non-reentrant is the recommended approach and we should use this when rolling out composable checkpoint API. Also removes the standalone implementation for non-reentrant and instead uses the generator from below diff to reuse the original implemenetation. Differential Revision: [D47451375](https://our.internmc.facebook.com/intern/diff/D47451375/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105176 Approved by: https://github.com/awgu, https://github.com/fegin	2023-07-26 07:03:03 +00:00
Wanchao Liang	c76c84bde4	[dynamo] make ProcessGroupVariable a DistributedVariable (#105593 ) This PR move the ProcessGroupVariable from UDO to DistributedVT so that Distributed VTs are consolidated together Pull Request resolved: https://github.com/pytorch/pytorch/pull/105593 Approved by: https://github.com/voznesenskym	2023-07-26 06:42:50 +00:00
AllenTiTaiWang	15442915cf	[ONNX] Fix the warnings of `aten overload fallback to default` in onnx dispatcher (#105972 ) Without this PR, the warning message is misleading as it says the default is found before the error message popped. Next PR will start refactoring aten overload fallback with adding overloads supported by torchlib into OpSchema matching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105972 Approved by: https://github.com/BowenBao	2023-07-26 05:42:33 +00:00
Brian Coutinho	8d9c8897ed	[profiler] add option for kineto synchronization events in the trace (#105187 ) Summary: ## About Sync Events For CUDA profiling mode, we can enable tracing CUDA synchronization events. * This feature captures synchronization events in CUDA including 1) context/device sync, 2) stream sync, 3) CUDA event sync, 4) CUDA stream wait event (inter stream synchronization). Read more * We add this flag using the profiler's experimental config option. * This PR relies on `7b003638c6` change in pytorch/kineto ## Usage Just set the `enable_cuda_sync_events` option in `_ExperimentalConfig` ``` from torch.autograd.profiler import profile, _ExperimentalConfig with profile(use_kineto=True, use_cuda=True, experimental_config=_ExperimentalConfig(enable_cuda_sync_events=True), ) as prof: workload() ``` Please wait for PyTorch github repo to point to `7b003638c6` or later commit in Kineto Test Plan: ## Unit Test Added a unit test buck2 test mode/dev-nosan caffe2/test:profiler --local-only -- test_profiler_cuda_sync_events Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ttps://www.internalfb.com/intern/testinfra/testrun/281475298097379 Reviewed By: davidberard98 Differential Revision: D46244591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105187 Approved by: https://github.com/aaronenyeshi	2023-07-26 03:45:04 +00:00
Sherlock Huang	a770295af4	Don't alter original node's meta in Interpreter (#105880 ) Test Plan: OSS CI Reviewed By: angelayi Differential Revision: D47740058 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105880 Approved by: https://github.com/angelayi	2023-07-26 03:44:58 +00:00
PyTorch MergeBot	6dd4b99ec2	Revert "Disable torchrec/sparse from top-level Dynamo tracing (#105733 )" This reverts commit 60d5efdb154da766b9f1c4b39bb6260b1427e45b. Reverted https://github.com/pytorch/pytorch/pull/105733 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/105733#issuecomment-1650931609))	2023-07-26 03:44:47 +00:00
Edward Z. Yang	884cd53e49	Unconditionally record when FakeTensorMode is allocated and report it on inconsistency (#105927 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105927 Approved by: https://github.com/albanD	2023-07-26 03:38:42 +00:00
Edward Z. Yang	523100a2f1	Make _CURRENT_TRACING_CONTEXT thread local (#105942 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105942 Approved by: https://github.com/albanD, https://github.com/voznesenskym	2023-07-26 03:38:01 +00:00
fduwjj	0003d5135d	[TP] Enable partial tensor add without redistribute (#105939 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105939 Approved by: https://github.com/wanchaol	2023-07-26 03:12:39 +00:00
Mikayla Gawarecki	e18d53e2df	Added ModuleInfo test for meta device ctx init (#105871 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105871 Approved by: https://github.com/albanD	2023-07-26 01:57:54 +00:00
XiaobingSuper	837363c72f	inductor: support conv+binary foldinig for freezing path (#105048 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105048 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-07-26 01:50:30 +00:00
Edward Z. Yang	78b28e884a	Fix error formatting string (#105935 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105935 Approved by: https://github.com/Skylion007	2023-07-26 01:20:19 +00:00
Edward Z. Yang	4af9a914ab	Improve FakeTensor to work with mixed meta-cpu embedding bag arguments (#105924 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105924 Approved by: https://github.com/mikaylagawarecki, https://github.com/eellison	2023-07-26 01:19:08 +00:00
Edward Z. Yang	dd3a77bc96	Apply UFMT to all files in benchmarks/ (#105928 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105928 Approved by: https://github.com/albanD	2023-07-26 01:18:48 +00:00
Rodrigo Kumpera	a361fceef3	[C10d] Move TCPStoreMasterDaemon to TCPStoreBackend. (#105184 ) This makes TCPServer interface to the store server be through BackgroundThread. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105184 Approved by: https://github.com/fduwjj	2023-07-25 21:59:12 +00:00
Rodrigo Kumpera	1880852830	[C10d] Move protocol constants to TCPStoreBackend.hpp (#105164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105164 Approved by: https://github.com/fduwjj	2023-07-25 21:43:32 +00:00
PyTorch MergeBot	e60af5c8e4	Revert "[Compiled Autograd] Move to torch::dynamo::autograd namespace (#105854 )" This reverts commit 26e3b4020f01d4fc2b7f63e1de4c94d2c8b362b5. Reverted https://github.com/pytorch/pytorch/pull/105854 on behalf of https://github.com/PaliC due to breaking internal embedded device tests (details shared with author) ([comment](https://github.com/pytorch/pytorch/pull/105854#issuecomment-1650559375))	2023-07-25 21:09:18 +00:00
Nikita Karetnikov	a4cffaae67	[pt2] add metas for `_cholesky_solve_helper` and `cholesky_solve` (#105867 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105867 Approved by: https://github.com/ezyang	2023-07-25 20:21:47 +00:00
PyTorch MergeBot	48cd8e29c1	Revert "Slightly improve AOTAutograd logging with ViewAndMutationMeta (#105702 )" This reverts commit cc137342d0ae3fcc95560bc10699bc829a83bf95. Reverted https://github.com/pytorch/pytorch/pull/105702 on behalf of https://github.com/PaliC due to breaking internal export tests (relevant details shared with author) ([comment](https://github.com/pytorch/pytorch/pull/105702#issuecomment-1650492077))	2023-07-25 20:17:27 +00:00
Richard Zou	3eef86dbf4	Only do TLS access when necessary in basicAutogradNotImplementedFallback (#105845 ) This is an optimization that may or may not matter (it's difficult to see the impact on benchmarks). This PR is a resubmit of #105737 to go directly through github first. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/105845 Approved by: https://github.com/soulitzer	2023-07-25 18:41:22 +00:00
Richard Zou	da4f3fdca1	Fix bug in basicAutogradNotImplementedFallback (#105660 ) In some situations we were registering a hook on a Tensor that does not require grad, which immediately raises an error. This PR fixes that by skipping the hook registration if the Tensor in question does not require grad. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/105660 Approved by: https://github.com/soulitzer	2023-07-25 18:41:22 +00:00
Huy Do	e7142700ed	Update expected inference for torchbench sam (#105891 ) This is currently failing `inductor_torchbench` trunk job https://github.com/pytorch/pytorch/actions/runs/5650538848/job/15308150238. The job was marked as unstable to mitigate another issue few weeks ago (https://github.com/pytorch/pytorch/issues/104337) but was left open and hide the failure from view. As the model passes as expected, I just add it into the list with `python benchmarks/dynamo/ci_expected_accuracy/update_expected.py becb8dc91a80e03455f7574dc0739fe93a2d199b` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105891 Approved by: https://github.com/msaroufim	2023-07-25 18:40:33 +00:00
Rodrigo Kumpera	fe284b0d97	[C10D] Extract some bits of TCPStore into TCPStoreBackend. (#105163 ) This moves BackgroundThread to TCPStoreBackend.hpp. This will eventually be the interface shared between the current TCPStore backend and the new libuv one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105163 Approved by: https://github.com/fduwjj, https://github.com/H-Huang	2023-07-25 17:59:15 +00:00
Albert Chen	b65b9e6ff4	[PT][FSDP] Combine _utils.py into _common_utils.py [1/3] (#105857 ) Summary: https://github.com/pytorch/pytorch/issues/97813 This diffs moves `_override_module_mixed_precision` Test Plan: CI Differential Revision: D47706059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105857 Approved by: https://github.com/awgu	2023-07-25 17:37:08 +00:00
PyTorch MergeBot	340ec1f460	Revert "Meff Attn Bias (#104310 )" This reverts commit 5453508115a2746eeeaaf306f22b0aec23b543d1. Reverted https://github.com/pytorch/pytorch/pull/104310 on behalf of https://github.com/DanilBaibak due to PR introduced cuda OOM issue ([comment](https://github.com/pytorch/pytorch/pull/104310#issuecomment-1650171538))	2023-07-25 16:37:32 +00:00
BowenBao	3bc047fb9a	[ONNX] Detailed diagnostics for 'perfect_match_inputs' (#105892 ) Log reasoning behind each unsuccessful perfect match into diagnostic. <img width="614" alt="image" src="https://github.com/pytorch/pytorch/assets/9376104/c6eb3012-7175-459f-91a5-653bbdf04eb4"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105892 Approved by: https://github.com/thiagocrepaldi, https://github.com/titaiwangms	2023-07-25 16:35:48 +00:00
BowenBao	8282c53789	[ONNX] Add primitives formatting for diagnostics (#105889 ) E.g., `<type: int>` -> `2`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105889 Approved by: https://github.com/thiagocrepaldi, https://github.com/titaiwangms	2023-07-25 16:35:48 +00:00
BowenBao	00c6a2ecd5	[ONNX] Diagnostic option 'warnings_as_errors' (#105886 ) If set, diagnostics with level as WARNING will be logged as level with ERROR, and immediately raised. TODO: bikeshed public export api. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105886 Approved by: https://github.com/thiagocrepaldi	2023-07-25 16:35:47 +00:00
Andrew Gu	c9edf11073	[FSDP][Docs] Make model/optim state dict configs visible in docs (#105848 ) This closes https://github.com/pytorch/pytorch/issues/104717. Rendered docs: ![Screenshot 2023-07-25 at 11 15 23 AM](https://github.com/pytorch/pytorch/assets/31054793/3c38166a-70c0-472c-805d-452d3bd9c700) ![Screenshot 2023-07-25 at 11 15 30 AM](https://github.com/pytorch/pytorch/assets/31054793/6d275d94-020a-44a2-a64c-0eeba083d47f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105848 Approved by: https://github.com/rohan-varma	2023-07-25 16:23:53 +00:00
RedContritio	71d18f6105	[DocString] Fix incorrect api Examples (#105911 ) Fix incorrect Examples in `torch.linalg.tensorinv`. - before (bug) : `torch.linalg.inverse` - after: `torch.linalg.inv` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105911 Approved by: https://github.com/lezcano	2023-07-25 13:03:06 +00:00
cyy	1157b4393b	Add const reference and std::move in opportunities detected by clang-tidy (#105815 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105815 Approved by: https://github.com/Skylion007	2023-07-25 12:28:14 +00:00
Bert Maher	5f6c6ff4cf	[inductor] Make OpenMP work in fbcode (#105777 ) Since we use libomp instead of libgomp, we need to suppress -fopenmp at link time (using it only at preproc time), and explicitly link libomp.so. Differential Revision: [D47692939](https://our.internmc.facebook.com/intern/diff/D47692939/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D47692939/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/105777 Approved by: https://github.com/jansel	2023-07-25 08:01:40 +00:00
Bert Maher	b2b1f2194b	[inductor] Enable vectorization in fbcode (#105756 ) In fbcode, to run the test python script (with its accompanying test DSO) we need to invoke the correct python, with the correct PYTHONPATH, so we supply those by reading the appropriate values out of `sys`. It's an improvement for OSS too, since the user may not be running the default python. My previous attempt of using `torch.backends.cpu.get_cpu_capability()` didn't work out, for two reasons: 1. That function actually refuses to report AVX512 support; it's #ifdef-ed out, for some reason. 2. In CI, we apparently are picking INVALID_VEC_ISA (at least when running inductor_timm_cpu_accuracy), whereas `get_cpu_capability` reports AVX2. This is surprising, and probably indicates a bug (either in cpu capability or our test binary), but I'd rather not go digging for it. Differential Revision: [D47678649](https://our.internmc.facebook.com/intern/diff/D47678649/) Differential Revision: [D47678649](https://our.internmc.facebook.com/intern/diff/D47678649) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105756 Approved by: https://github.com/jansel, https://github.com/mikekgfb	2023-07-25 08:01:40 +00:00
Michael Voznesensky	487a33e38a	[FSDP x dynamo] simplify registry keys (#104209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104209 Approved by: https://github.com/wconstab, https://github.com/fegin	2023-07-25 07:16:22 +00:00
AllenTiTaiWang	da8de0455b	[ONNX] Support ONNXFakeContext with op_level_debug (#105874 ) Previous to the PR, op_level_debug doesn't support OnnxFakeConext because it relies on real tensor in args to do shape type inference propagation in fx graph to get static shapes helping simulating the op args/kwargs. However, OnnxFakeContext will fakify the args/kwargs at the very begining, so the op_level_debug can't have the static_shapes to utilize. This PR uses SymInt API: `has_hint` and `hint_int` to fully replace the functionality of shape type inference propagation. The static shapes are obtained through SymInt. Therefore, the pass `ShapeInferenceWithFakeTensor` is eliminated. Also moved the args/kwargs processing into op_validation to live under the rule `op_level_debug`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105874 Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao	2023-07-25 07:01:49 +00:00
Jon Bolin	1032a2541e	Add option to disable rewriting index hints in default global save plan (#105861 ) With distributed checkpointing in PyTorch/XLA SPMD, the WriteItem index hints should not be modified when creating the global plan. In order to reuse the default planner logic for checkpoint metadata creation, we need to make the behavior of rewriting index hints optional. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105861 Approved by: https://github.com/kumpera	2023-07-25 06:00:13 +00:00
Angela Yi	8bf253ecce	[export] Remove eliminate_dead_code (#105875 ) Summary: Sometimes the graph that is being serialized contains nodes with side effects + no users (ex. out variants of operators), so we don't want to eliminate those when deserializing. Test Plan: CI Differential Revision: D47735009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105875 Approved by: https://github.com/ydwu4	2023-07-25 05:37:44 +00:00
Andres Lugo-Reyes	c89aec207a	[ROCm] reduce tolerance for triangular solve with well_conditioned set to True (#104425 ) Current test case causes an edge case tensor input that causes a single generated tensor to fail the tolerance assertion on ROCm only and only for float32. We have reviewed the logic with our libraries team and have discovered the discrepancy is due to a difference in order of operations on AMD GPUs. They came back with "working as intended" and found no perceivable bug. Interestingly, if we change the values in ks, ns, or bs, the test passes on ROCm. These particular sizes in this particular order generates a single problematic input that causes the assertion to fail the tolerance check by ~0.07. Again, this is not a bug, just differences in implementation. This PR loosens the tolerance for ROCm only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104425 Approved by: https://github.com/jeffdaily, https://github.com/nikitaved, https://github.com/lezcano	2023-07-25 05:03:09 +00:00
Mikayla Gawarecki	9c458942ae	[easy] Minor torch.load docs fix (#105876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105876 Approved by: https://github.com/albanD	2023-07-25 03:58:30 +00:00
vasiliy	8b34fa5e9b	add basic cuda support for float8 dtypes (#105807 ) Summary: Ensures that creating tensors, copying, filling with zeroes, checking for nan works on cuda for the `float8` dtypes. This should be enough for float8 emulation on cuda. Note that I skipped the mul test - it's less trivial to add (need a new c++ macro), and there is no use case for it. We can follow up on that in the future. Test Plan: ``` python test/test_quantization.py TestFloat8Dtype ``` Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/105807 Approved by: https://github.com/ezyang, https://github.com/jerryzh168, https://github.com/albanD	2023-07-25 03:43:36 +00:00
Louis Feng	3a01c056f5	[PyTorch][ET] Collect Process Groups Mapping Info (#104373 ) Summary: Add the logics and interface to log ProcessGroup comms configuration (unique ID, type, and ranks info). Test Plan: Testing in HPC: ``` TORCH_LOGS=all ../buck-out/v2/gen/fbcode/c8344b52091f4f7f/hpc/models/ads/__ads_10x_launcher__/ads_10x_launcher.par +launcher=local launcher.num_trainers=4 +data_loader=random data_loader.num_batches=2000 ``` Example output in ET: ``` { "name": "## process_group:init ##", "id": 3, "rf_id": 1, "parent": 2, "fw_parent": 0, "seq_id": -1, "scope": 7, "tid": 1, "fw_tid": 0, "op_schema": "", "inputs": ["[{'pg_id': 140538064364672, 'backend_id': 140538060772480, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}, {'pg_id': 140538064363904, 'backend_id': 140538042628864, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}]"], "input_shapes": [[]], "input_types": ["String"], "outputs": [], "output_shapes": [], "output_types": [] }, ``` Differential Revision: D46321690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104373 Approved by: https://github.com/kwen2501	2023-07-25 03:34:53 +00:00
BowenBao	00ee38c661	[ONNX] Export module as function (#105618 ) Introduce `Modularize` pass that analyzes the flat `fx.GraphModule` and creates nested layers of sub `fx.GraphModule`s along with the `call_module` fx nodes that invokes them. The analysis is done on the meta data "nn_module_stack", which captures the `nn.Module` each flat `fx.Node` belongs to. `FxOnnxInterpreter` is updated to support `call_module`. The related sub module linked by `node.target` is exported as an ONNX model local function. The `call_module` node itself is exported as an ONNX node, associated with the ONNX model local function by op_type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105618 Approved by: https://github.com/justinchuby	2023-07-25 03:28:31 +00:00
Ilya Sherstyuk	ec33733701	[ONNX] Improve shape inference for Slice (#105755 ) Previously, if 'starts', 'ends', or 'steps' was dynamic, then shape inference would give up, even for dimensions which are not being sliced. This commit improves this by setting the output shape to be the same as the input shape for dimensions which are not being sliced. Add a new test to cover this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105755 Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao	2023-07-25 02:58:20 +00:00
Ying Zhang	98956c5320	Support dynamic shapes in TritonTemplates (#105295 ) Currently when dynamic=True, TritonTemplates won't be used, as the condition `if list(call_args) != expected_args` defined in `TritonTemplate` cannot be satisfied. This PR tries to fix this issue by allowing passing symbolic variable names via `extra_args` and replacing all symbolic values in the generated TritonTemplate code as call_arg names. With this change, a locally compiled mm + epilogue node calls into the Triton kernel successfully. This PR also introduces a new config "max_autotune_gemm_backends" to allow specifying candidate gemm backends for max autotune. Current choices: combinations of ATEN, TRITON. This makes tests easier, so that we can explicitly test Triton gemm kernels + epilogue fusions + dynamic shapes, without falling back to ATen ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105295 Approved by: https://github.com/jansel	2023-07-25 01:41:25 +00:00
Jason Ansel	26e3b4020f	[Compiled Autograd] Move to torch::dynamo::autograd namespace (#105854 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105854 Approved by: https://github.com/albanD	2023-07-25 01:14:04 +00:00
Edward Z. Yang	5403c7770c	Provide a refined upper bound for nonzero when original numel is static (#105843 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105843 Approved by: https://github.com/lezcano	2023-07-25 00:51:35 +00:00
Edward Z. Yang	cc137342d0	Slightly improve AOTAutograd logging with ViewAndMutationMeta (#105702 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105702 Approved by: https://github.com/albanD	2023-07-25 00:47:38 +00:00
Jane Xu	5fec1f93dc	Add meta registration for foreach_maximum_.List (#105864 ) Will fix issues compiling for when amsgrad is True for Adam(W), see related failures in https://github.com/pytorch/benchmark/actions/runs/5628705163/job/15252867793 Also did some refactoring where common registrations could be deduplicated. Test plan: python test/inductor/test_compiled_optimizers.py -k test_adam Pull Request resolved: https://github.com/pytorch/pytorch/pull/105864 Approved by: https://github.com/albanD, https://github.com/mlazos	2023-07-25 00:39:13 +00:00
Andrew Gu	6655b6527a	[FSDP][Docs] Tidy up FSDP ctor/api docs (#105847 ) - This PR rewords the `BackwardPrefetch` docs to make the tradeoffs clear in the first sentence of each with more technical details after. - The only supported `_FSDPPolicy` is `ModuleWrapPolicy` at the time of writing this PR. We may add others in the future such as in my other PR stack. This PR removes `_FSDPPolicy` from the public docs. - This provides some more details around `MixedPrecision` such as explaining that layer norm and batch norm accumulate in fp32. Follow-ups: - Why do we force batch norm modules to have FSDP applied separately? (E.g. was this because before batch norm kernels did not support fp16/bf16?) Like layer norm, this just means that the affine parameters are in fp32. Both already accumulate in fp32 even with fp16/bf16 inputs. - Check the `param_init_fn` + `sync_module_states=True` usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105847 Approved by: https://github.com/rohan-varma	2023-07-25 00:19:08 +00:00
BowenBao	65bce811a6	[ONNX] Passes to reuse existing fake mode if possible (#105764 ) Fixes #105467, namely the need of setting `aten_graph=True` in `_dynamo.export` to make fake mode onnx exporter work. Previously, `make_fx` called by passes always create new fake mode. Hence it is missing out info from `shape_env` recorded during dynamo export. This PR tries to check and fetch existing fake mode from graph node meta. Also enable python dispatcher context when calling `make_fx`. This is done in `_dynamo.export(aten_graph=True)` but was missing in our passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105764 Approved by: https://github.com/titaiwangms	2023-07-24 23:42:26 +00:00
Huy Do	8047ce05dd	Cleanup calculate-docker-image (#105752 ) As this has been replaced by the more generic calculate-docker-image on test-infra https://github.com/pytorch/test-infra/pull/4397 in: * https://github.com/pytorch/test-infra/pull/4399 * and https://github.com/pytorch/pytorch/pull/105372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105752 Approved by: https://github.com/Skylion007	2023-07-24 23:37:08 +00:00
David Berard	becb8dc91a	[inductor] triton_utils.config_of: check for divisibility by 16, even when expr is not an Integer (#105743 ) TL;DR: triton_utils.config_of determines divisibility by 16 for each of the inputs to the kernel (pointer alignment for pointers, and divisibility by 16 for sizes). For sizes, the check previously could only return true if the expr representing the size was an integer. However, it's possible for non-integral exprs to be divisible by 16, e.g. for an expr like 16*s0. Motivation: Knowledge about divisibility by 16 allows for vectorizing loads and stores, which can improve memory bandwidth. If we have, for example, kernels with shape [s0, 16] (dynamic batch size; static, divisible-by-16 other dimensions), we want to still be able to vectorize those loads and stores. Dashboard results suggest that this improves dynamic shape training performance for timm, and possibly a small improvement for torchbench as well. More details are provided in a comment below. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105743 Approved by: https://github.com/ezyang, https://github.com/aakhundov	2023-07-24 22:41:50 +00:00
Wanchao Liang	8b94280008	[functional collective] parameterize allreduce tests (#105604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105604 Approved by: https://github.com/rohan-varma	2023-07-24 22:21:19 +00:00
drisspg	5453508115	Meff Attn Bias (#104310 ) # Summary ### Review Points - Automatically pad tensors to create aligned masks when seqlen_kv is not multiple of 16. This will cause memory spike ~ 2 * attn_mask size which could in theory be big. At appears though that doing this + mem_eff is faster than no_pad + math. SO seems to be worth it - Using expand to view the attn_mask in 4d. This is a little different to how we enforce q,k,v to be viewed in 4d prior to calling. Also not supprint bn_heads, seq_lenq, seq_lenkv case. - Should enable, #96099 ### Profiling I ran a bunch of comparisons between sdpa.MATH and sdp.MemEffAttention. I added a attn_bias of shape (1, 1, seqlen_q, seqln_k). For these experiments seqlen_q == seqlen_k. These were all ran on an a100 80gb gpu. Configs: ``` # Run a bunch of experiments batch_sizes = [8, 16, 32] num_heads = [16, 32] max_seq_lens = [15, 64, 128, 512, 555, 1024] embed_dims = [32, 64, 128] dtypes = [torch.float16, torch.bfloat16, torch.float32] pad_percentages = [None] backends = [SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH] run_backward = True attn_mask = True ``` The function calls `sdpa(input*).sum().backward()`. I calculated the geomean speedup of the efficient attention path of the math path for all these configs: `Geomean Speedup: 1.977` An example comparision with batchsize = 8, num_heads = 32, embed_dim = 64, and dtype = torch.float16: ![attn_mask_compare_bsz_8_num_heads_32_embed_dim_64_dtype_fp16](https://github.com/pytorch/pytorch/assets/32754868/0d75bffe-350b-43f2-a37f-514f9158dcff) This was done using the current state of the branch where we force alignment of mask when the last dim is not divisible by 16, which shows up in seq_len = 15 and 555 case. The full data can be found here: [attn_mask_sweep.csv](https://github.com/pytorch/pytorch/files/11962399/attn_mask_sweep.csv) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104310 Approved by: https://github.com/cpuhrsch	2023-07-24 22:19:26 +00:00
Jacob Szwejbka	9d62c5faf6	[exir] Add deepcopy to ExportedProgram (#105852 ) Summary: ExirExportedProgram would like to have this feature. Today it does it itself since it inherits from ExportedProgram but since we are moving it to composition I think it would be cleaner to upstream the behavior into the root object anyway Test Plan: ci, but todo where are the tests for this file? Differential Revision: D47645843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105852 Approved by: https://github.com/tugsbayasgalan	2023-07-24 21:15:55 +00:00
Jason Ansel	c902b84e0b	Compiled autograd (#103822 ) This branch: 1) converts the autograd tape into an FX graph 2) caches that conversion using a "shadow" graph 3) compiles and runs the generated FX graph instead of the normal autograd What works currently: 1) Caching, capture, and initial integration 2) Backwards hooks 3) Inlining AotAutograd generated subgraphs 4) torch.compiling the generated FX graph 5) Auto-detecting dynamic shapes based on changes Future work 1) Larger scale testing 1) Boxed calling convention, so memory can be freed incrementally 1) Support hooks on SavedTensor 1) Additional testing by running eager autograd tests under compiled_autograd.enable() Pull Request resolved: https://github.com/pytorch/pytorch/pull/103822 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-07-24 21:12:05 +00:00
Edward Z. Yang	14304afd76	Remove unnecessary simplification in guard_lt (#105842 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105842 Approved by: https://github.com/Skylion007	2023-07-24 20:58:10 +00:00
Andres Lugo-Reyes	b78341dda9	Use hipsolver for default svd case on ROCm (#103540 ) Fixes #102678 Fixes #102629 Fixes #102558 HipSOLVER performance on ROCm5.4.2 and later no longer serves as massive bottleneck. Additionally, using magma on rocm in this case caused test_compare_cpu_lialg_pinv_singular_cuda_float32 to fail. Using hipSOLVER, the test now passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103540 Approved by: https://github.com/lezcano	2023-07-24 20:50:56 +00:00
Michael Voznesensky	bf693f2000	Strengthen ConstantVariable invariants (#105796 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105796 Approved by: https://github.com/ezyang	2023-07-24 20:41:12 +00:00
Edward Z. Yang	d2ee3d0675	Add a version to this signpost so I can tell if packages have taken updates (#105735 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105735 Approved by: https://github.com/albanD	2023-07-24 19:49:16 +00:00
janEbert	b0708654c0	Implement NAdamW optimizer (#103881 ) NAdamW, which is simply NAdam with the AdamW weight decay term, has shown strong performance in optimizer comparisons such as 1. https://arxiv.org/abs/2211.09760 1. https://arxiv.org/abs/2306.07179 [The VeLO paper](https://arxiv.org/abs/2211.09760) argues its power lies in its ability to act as a superset of other popular optimizers. This PR adds NAdamW by ~~copying and making very small adaptations to the NAdam implementation (just like AdamW and Adam). To see the small changes in better detail, you can `diff torch/optim/nadam.py torch/optim/nadamw.py`.~~ adding a boolean flag `decoupled_weight_decay` that activates NAdamW behavior (`False` by default) to NAdam. Interest in the optimizer has also been shown in the PyTorch forums: https://discuss.pytorch.org/t/nadamw-and-demon-optimizers/179778 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103881 Approved by: https://github.com/janeyx99	2023-07-24 19:29:26 +00:00
yanbing-j	a54043516f	Add SparseCsrCPU and SparseCsrCUDA dispatch to sum.dim_IntList (#99292 ) This PR is to add support of sum.dim_IntList for Sparse Tensor, which is exposed in https://github.com/pytorch/pytorch/issues/98796. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99292 Approved by: https://github.com/mingfeima, https://github.com/rusty1s, https://github.com/cpuhrsch	2023-07-24 17:30:58 +00:00
Huy Do	fb0ffeece3	Use the newer g5.12xlarge instead of g3.16xlarge for multigpu tests (#105759 ) Both have 4 GPUs. This is an attempt to mitigate the runner issue with `g3.16xlarge` where it starts to crash a lot recently https://github.com/pytorch/pytorch/issues/105721. So, let's see if switching to a newer runner type helps. The job also finishes slightly faster in ~120m https://github.com/pytorch/pytorch/actions/runs/5625775414/job/15246453229 v.s. ~140m as before https://github.com/pytorch/pytorch/actions/runs/5625238650/job/15244823174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105759 Approved by: https://github.com/atalman	2023-07-24 17:18:22 +00:00
Edward Z. Yang	3045e84e67	Tweak dynamic=False behavior (#105715 ) Previously, dynamic=False is a no-op, and dynamic=True preemptively turns on dynamic shapes everywhere. Now, dynamic=False disables automatic dynamic, and an unset dynamic defaults to dynamic=None (which uses automatic dynamic.) This seems to be more intuitive per https://github.com/pytorch/pytorch/issues/105634#issuecomment-1644883477 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105715 Approved by: https://github.com/voznesenskym	2023-07-24 16:56:41 +00:00
Howard Huang	0ab74044c2	[BE] remove deprecated attributes from distributed_c10d (#105753 ) Removing these attributes as they were introduced 5 years ago and before pytorch 1.0. `Backend` is the only support use now. Differential Revision: [D47683717](https://our.internmc.facebook.com/intern/diff/D47683717) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105753 Approved by: https://github.com/rohan-varma	2023-07-24 16:35:08 +00:00
cyy	b8eb827d93	use UBSAN on some tests (#103655 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/103655 Approved by: https://github.com/kshitij12345, https://github.com/zou3519	2023-07-24 14:24:49 +00:00
Pruthvi Madugundu	68dce23722	[ROCm] Skip test_jit_cudnn_extension on ROCm (#105805 ) follow up https://github.com/pytorch/pytorch/pull/105594 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105805 Approved by: https://github.com/ezyang	2023-07-24 14:19:56 +00:00
PyTorch MergeBot	1600585219	Revert "Fix test failure in TestCudaMultiGPU.test_cuda_device_memory_allocated (#105501 )" This reverts commit e6fd8ca3eef2b85b821936829e86beb7d832575c. Reverted https://github.com/pytorch/pytorch/pull/105501 on behalf of https://github.com/zou3519 due to We've agreed that the PR is wrong. It didn't actually break anything. ([comment](https://github.com/pytorch/pytorch/pull/105501#issuecomment-1648005842))	2023-07-24 14:18:44 +00:00
PyTorch UpdateBot	33b855e906	[xla hash update] update the pinned xla hash (#105828 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105828 Approved by: https://github.com/pytorchbot	2023-07-24 10:54:25 +00:00
rohanjms	80144d9cf9	Implement NEON accelerated implementation of ERF() (#105610 ) Fixes #105493 Inspired by the [AVX implementation](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec256/vec256_float.h#L158-L189) for the same. Perf on a Graviton3 EC2 instance with one OMP thread: Operation \| std math \| SLEEF \| NEON (this PR) -- \| -- \| -- \| -- GELU (100 passes) \| 1141.897ms \| 598.929ms \| 515.499ms Pull Request resolved: https://github.com/pytorch/pytorch/pull/105610 Approved by: https://github.com/jgong5	2023-07-24 08:45:58 +00:00
Michael Voznesensky	54a673bdcf	Initial sourceless builder (#104734 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104734 Approved by: https://github.com/ezyang	2023-07-24 02:48:32 +00:00
Bin Bao	b0816e4714	[inductor] Fix AOTInductor output issues (#105773 ) Summary: This is a follow-up on https://github.com/pytorch/pytorch/pull/105496. There are several issues with the previous fix, 1) It explicitly does copy for every output at the end of the main function; 2) When an output is ReinterpretView, no as_strided was generated for it; 3) There can be duplicated buffer declarations. This PR fixes by making sure can_reuse behave consistently between two AOTIndcutor passes, and thus always generate the same set of kernels. It also adds handling of ReinterpretView. Differential Revision: [D47692214](https://our.internmc.facebook.com/intern/diff/D47692214) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105773 Approved by: https://github.com/jansel	2023-07-24 01:58:49 +00:00
Richard Barnes	32b67b5a6b	Fix RRef Future type annotation (#105296 ) Test Plan: Sandcastle Differential Revision: D47376236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105296 Approved by: https://github.com/Skylion007	2023-07-23 22:34:09 +00:00
Lu Fang	c44ae5544f	Skip the source info in the error report if the source code is too large (#105608 ) Summary: A small model (<100MB) took about 20mins to load, and consume 16GB memory. Strobelight profiling: https://fburl.com/strobelight/abwtz0ry We realized that calc_line_start_offsets is culprit, and the line_starting_offsets_ is a vector of line numbers. There are >20000 places we generate such ErrorReport, and the line number is ~100000. So total memory cost is about 100000 x 20000 x 8 = ~16GB. We propose to skip the error info for extreme large source file (>1MB). And keep an environment variable to keep the ability to print the source code info for large source file. Test Plan: buck run mode/opt-split-dwarf scripts/lufang:load_pt_model -- --model_file_path=/data/local/models/961746678/2/961746678_2.predictor.disagg.gpu.local before the change, it takes 20mins to load, and the model costs 16GB memory (the model itself is only <100MB) after the change, it takes 15s to load. The most of the time / space is spent on calc_line_start_offsets, https://fburl.com/code/2to60zqu Differential Revision: D47610805 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105608 Approved by: https://github.com/hl475	2023-07-23 20:53:14 +00:00
Wanchao Liang	e3539a0e54	[dtensor] forward fix for dynamo import with deploy (#105760 ) Summary: forward fix to avoid revert Differential Revision: D47679598 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105760 Approved by: https://github.com/atalman	2023-07-23 07:13:38 +00:00
Wanchao Liang	66fbffce1f	Fix unstable CI related to dynamo tests (#105797 ) this PR fix the current unstable CI. The test failure comes from a bad revert in https://github.com/pytorch/pytorch/pull/105581 where it does not revert the intended PR correctly (there were some merge conflicts and some logic got deleted during this revert) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105797 Approved by: https://github.com/ezyang	2023-07-23 05:43:54 +00:00
Nikita Karetnikov	45e4706aff	[pt2] add decomps for `multilabel_margin_loss_forward` ops (#105302 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105302 Approved by: https://github.com/ezyang	2023-07-23 02:16:29 +00:00
Nikita Karetnikov	944db0357d	Unify `multilabel_margin_loss_shape_check` on CPU and CUDA (#105645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105645 Approved by: https://github.com/ezyang	2023-07-23 02:16:29 +00:00
Nikita Karetnikov	eac9e1b35f	[OpInfo] add reference and error inputs for `multilabel_margin_loss` (#105523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105523 Approved by: https://github.com/ezyang	2023-07-23 02:16:29 +00:00
Peter Pham	bba06ad751	[MPS] aten::erfinv metal kernel ops (#101507 ) I've added the implementation of erfinv using the algorithm from `4154c8ea15/aten/src/ATen/native/Math.h (L152)` in order for the MPS based algorithm to match the CPU automatic test. This PR is using the new metal api calls from https://github.com/pytorch/pytorch/pull/100661 Testing shows MPS has a decent speed up (270x) compared to CPU on tensor size of 100 mil elements. ``` import torch x = torch.arange(-1, 1, 1e-8) # default cpu tensor #measure CPU compute time by calling torch.erfinv time = %timeit -o -q -r 5 torch.erfinv(x) cpu_time = time.average print("CPU torch.erfinv time: ", cpu_time) x = x.to("mps") # measure MPS compute time time = %timeit -o -q -r 5 torch.erfinv(x) mps_time = time.average print("MPS torch.erfinv time: ", mps_time) print(f"MPS torch.erfinv is {cpu_time/mps_time*100} percent faster than CPU torch.erfinv") # compute MSE between MPS and CPU torch.erfinv x = x.to("cpu") y_cpu = torch.erfinv(x) x = x.to("mps") y_mps = torch.erfinv(x) y_mps = y_mps.to("cpu") mask = torch.isfinite(y_cpu) & torch.isfinite(y_mps.to("cpu")) y_mps = y_mps[mask] y_cpu = y_cpu[mask] x = x[mask] print(f"length of y_mps: {len(y_mps)}, length of y_cpu: {len(y_cpu)}, length of x: {len(x)}") mse = torch.square(y_cpu - y_mps).mean() print("MSE between MPS and CPU torch.erfinv: ", mse) diff = torch.abs(y_cpu - y_mps) print("Largest difference") print(f"x: {x[torch.argmax(diff)]}, y_cpu: {y_cpu[torch.argmax(diff)]}, y_mps: {y_mps[torch.argmax(diff)]} , diff = {y_cpu[torch.argmax(diff)] - y_mps[torch.argmax(diff)]}") ``` CPU torch.erfinv time: 2.654937833400254 MPS torch.erfinv time: 0.009831255332002912 MPS torch.erfinv is 27005.07456822776 percent faster than CPU torch.erfinv length of y_mps: 199999992, length of y_cpu: 199999992, length of x: 199999992 MSE between MPS and CPU torch.erfinv: tensor(4.2339e-14) Largest difference x: -0.9999980330467224, y_cpu: -3.363569736480713, y_mps: -3.3635685443878174 , diff = -1.1920928955078125e-06 Fixes #https://github.com/pytorch/pytorch/issues/86808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101507 Approved by: https://github.com/kulinseth	2023-07-23 01:36:43 +00:00
Denis Vieriu	12ea12d659	[MPS] Fix upsample output size tensor (incorrect result in MacOS 14.0) (#105677 ) Fix output size tensor when passing a batched input tensor - this fixes the upsample test issues in MacOS 14.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105677 Approved by: https://github.com/kulinseth	2023-07-22 23:10:17 +00:00
Aaron Gokaslan	6d43c89f37	[BE]: Update Ruff to 0.0.280 (#105724 ) Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724 Approved by: https://github.com/ezyang, https://github.com/janeyx99	2023-07-22 23:03:34 +00:00
Edward Z. Yang	53a4b262d2	Add missing evaluate_expr for slice_scatter, slight refactor (#105714 ) The substantive change is adding slice_scatter to use evaluate_expr (and I add a test for it). While I'm at it, I do some cleanup: provide sizevars.evaluate_expr directly, and rewrite all sites to use it consistently. Fixes https://github.com/pytorch/pytorch/issues/105524 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105714 Approved by: https://github.com/Skylion007	2023-07-22 22:00:47 +00:00
gmagogsfm	f5def50461	Supress eager fallback suggestions when exporting (#105767 ) Previously during torch.export(), when an exception is raised during tracing, Dynamo displays this error: “You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True” This is not viable in torch.export(), thus this diff suppresses this suggestion during export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105767 Approved by: https://github.com/anijain2305	2023-07-22 19:17:08 +00:00
Animesh Jain	afd955f3de	[dynamo][constant] Kwargs already supported for str methods (#105785 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105785 Approved by: https://github.com/yanboliang	2023-07-22 09:33:23 +00:00
AllenTiTaiWang	20fb2ba68d	[ONNX] Register list/tuple/dict to format_argumment and refactor fx.Node format_argument in diagnostics (#105263 ) Previous to this PR, the SARIF reports don't have detail on torch.fx.Node (shape/dtype), and don't unpack the tuple/list/dict. This PR provides thorough information of args/kwargs from torch in fx.graph expression: f32[64, 64, 2] (dtype[shape]). Need https://github.com/microsoft/onnxscript/pull/890 ![dispatcher_sarif](https://github.com/pytorch/pytorch/assets/18010845/2567fac6-4154-4ce8-bc34-83950ef1c1d7) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105263 Approved by: https://github.com/BowenBao	2023-07-22 08:53:08 +00:00
Yanbo Liang	0ad93a3d56	Fix aten.logspace decomposition (#105201 ) Fixes #104118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105201 Approved by: https://github.com/ezyang	2023-07-22 04:10:20 +00:00
Ruoxi	5afc2f5069	Documentation for `torch.autocast` (#95760 ) - [x] Corrected examples for CUDA devices. - [x] Information about availability of `torch.autocast`. Fixes #95547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95760 Approved by: https://github.com/leslie-fang-intel, https://github.com/kit1980	2023-07-22 03:56:34 +00:00
Thiago Crepaldi	09b5c35911	Support torch.onnx.dynamo_export within FakeTensorMode (#105477 ) Currently, exporting a model to ONNX with fake tensor mode requires the user to load data and model within `torch.onnx.enable_fake_mode` context, but the actual call to `torch.onnx.dynamo_export` is done outside such context. With this PR, we enable `torch.onnx.dynamo_export` to be called either within `torch.onnx.enable_fake_mode` or outside of it. This feature required changes to the core PyTorch Dynamo, which were greatly supported by @ezyang In future steps we will determine which scenario we are going to support, but for now we can use either to explore different options and scenarios and asses their pros and cons. This PR also creates a separate suite of tests for fake mode specific scenarios (`TestFxToOnnxFakeTensorWithOnnxRuntime`). It was done separately to decrease the test time, but we could merge it with the default `TestFxToOnnxWithOnnxRuntime`. The additional parameters are `load_checkpoint_during_init` and `export_within_fake_mode` With the newly added supported of nested export within fake mode, the following scenarios are now supported: ```python import torch with torch.onnx.enable_fake_mode() as fake_context: fake_args = create_args() fake_kwargs = create_kwargs() fake_model = create_model() fake_model.load_state_dict(torch.load(tmp_checkpoint_file.name)) export_options = torch.onnx.ExportOptions(fake_context=fake_context) # `torch.onnx.dynamo_export` called WITHIN `torch.onnx.enable_fake_mode` export_output = torch.onnx.dynamo_export( fake_model, fake_args, *fake_kwargs, export_options=export_options, ) export_output.save("/path/to/model.onnx", model_state_dict=create_model()) ``` If we decide to only support scenarios in which `torch._dynamo.export` is called within `FakeTensorMode`, then we can remove `fake_mode` argument from `torch._dynamo.export` as a follow-up task ps: This PR is mostly Edward's https://github.com/pytorch/pytorch/pull/105468 + unit tests after an offline discussion ps: https://github.com/pytorch/pytorch/issues/105464 tracks pending tasks/limitations from this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/105477 Approved by: https://github.com/ezyang, https://github.com/BowenBao	2023-07-22 03:50:52 +00:00
Animesh Jain	0b11da0ccb	[partitioners][ac][dynamic] Fix output signature of fwd with symints (#105771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105771 Approved by: https://github.com/Chillee	2023-07-22 03:04:11 +00:00
AllenTiTaiWang	0148db6765	[ONNX] Support torch.device in FX exporter (#105757 ) Fixes https://github.com/pytorch/pytorch/issues/105172 When torch.device appears in kwargs, we ignore it, as it's not needed in onnx. However, if it's in args, it's used by dispatcher, and we didn't handle it. This PR adds torch.device into args processing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105757 Approved by: https://github.com/BowenBao, https://github.com/justinchuby	2023-07-22 02:22:42 +00:00
Edward Z. Yang	60d5efdb15	Disable torchrec/sparse from top-level Dynamo tracing (#105733 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105733 Approved by: https://github.com/voznesenskym	2023-07-22 02:00:36 +00:00
Edward Z. Yang	45e0193174	Add telemetry for number of nodes being compiled (#105741 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105741 Approved by: https://github.com/Chillee	2023-07-22 01:56:02 +00:00
Shreyas Bhat Kera	7b211ff8dd	doc: fix fake_quantize_per_channel_affine (#105241 ) Fixes #105085 Fix in formula Pull Request resolved: https://github.com/pytorch/pytorch/pull/105241 Approved by: https://github.com/jcaip	2023-07-22 00:49:28 +00:00
Animesh Jain	a6b8c30726	[dynamo][higher order ops] Bugfix for kwargs support (#105699 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105699 Approved by: https://github.com/Skylion007, https://github.com/ydwu4, https://github.com/zou3519	2023-07-21 23:44:37 +00:00
Jane Xu	1959802548	[AdamW] Fix complex x amsgrad support (#104990 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104990 Approved by: https://github.com/albanD	2023-07-21 23:43:26 +00:00
Jane Xu	e1296a7f8d	[Adam] Fix complex x amsgrad support (#104989 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104989 Approved by: https://github.com/albanD	2023-07-21 23:43:26 +00:00
SherlockNoMad	a44f8894fa	[Inductor] Provenance tracking for wrapper code (#105717 ) Summary: Add comments in wrapper code for better provenance tracking Sample inductor wrapper output: ``` # Source Nodes: [mm_1], Original ATen: [aten.mm] extern_kernels.mm(as_strided(tangents_1, (500, 20), (1, 500)), view, out=buf1) # Source Nodes: [l__self___linear], Original ATen: [aten.addmm] extern_kernels.addmm(primals_2, as_strided(primals_3, (20, 500), (500, 1)), as_strided(primals_1, (500, 500), (1, 500)), alpha=1, beta=1, out=buf0) ``` in cpp wrapper ``` // Source Nodes: [bmm_1], Original ATen: bmm at::bmm_out(buf0, arg0_1, arg1_1); ``` Test Plan: OSS CI Differential Revision: D47657260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105717 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-07-21 23:06:43 +00:00
PyTorch MergeBot	050d3de07d	Revert "Correct dynamo logging docs (#105658 )" This reverts commit f3a261e0968b4b2da071734dd749a179f75bceab. Reverted https://github.com/pytorch/pytorch/pull/105658 on behalf of https://github.com/PaliC due to breaking docs `f3a261e096` ([comment](https://github.com/pytorch/pytorch/pull/105658#issuecomment-1646310865))	2023-07-21 22:38:28 +00:00
Richard Zou	4d5d4d8b02	[pytorch] Disable new autograd fallback for mobile builds (#105750 ) Summary: To save on binary size, some of the mobile configs don't include the autograd kernels for built-in operators (VariableTypeEverything.cpp). For the mobile build: - we don't care about having a nice autograd fallback that warns if an operator has incorrect autograd support. If you're running a custom operator on mobile then it's already too late for us to warn or error on it. - for perf reasons, we do not want mobile to go through autograd_fallbac for all operators (the boxing/unboxing adds overhead). As a result, on mobile we set the fallback to the fallthrough. Test Plan: existing tests and benchmarks Differential Revision: D47674272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105750 Approved by: https://github.com/soulitzer	2023-07-21 22:32:50 +00:00
Andrew Gu	221853af23	[FSDP][Easy] nit follow-ups to handle refactor (#105738 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105738 Approved by: https://github.com/fegin, https://github.com/voznesenskym	2023-07-21 22:00:14 +00:00
David Radley	f3a261e096	Correct dynamo logging docs (#105658 ) Fixes #105657 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105658 Approved by: https://github.com/zou3519	2023-07-21 21:37:02 +00:00
Rodrigo Kumpera	174b0c22cb	[C10D] Remove watchKey functionality from the Store. (#105014 ) The feature was never fully finished and never got any adoption but TCPStore pays the cost of twice the number of tcp connections anyway. While the cost of all those idle connections is minimal is doesn't come for free: - It increases the likelyhood of a connection refused failure during the initialization stampede. - TCPStore uses poll for checking for socket availability which scales linearly on the number of sockets regardless of their status. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105014 Approved by: https://github.com/fduwjj	2023-07-21 21:18:55 +00:00
dependabot[bot]	9d2f56fd22	Bump pygments from 2.12.0 to 2.15.0 in /.ci/docker (#105654 ) Bumps [pygments](https://github.com/pygments/pygments) from 2.12.0 to 2.15.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/pygments/pygments/releases">pygments's releases</a>.</em></p> <blockquote> <h2>2.15.0</h2> <ul> <li> <p>Added lexers:</p> <ul> <li>Carbon (<a href="https://redirect.github.com/pygments/pygments/issues/2362">#2362</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2365">#2365</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2366">#2366</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2367">#2367</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2368">#2368</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2369">#2369</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2370">#2370</a>)</li> <li>Dax (<a href="https://redirect.github.com/pygments/pygments/issues/2335">#2335</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2345">#2345</a>)</li> <li>MediaWiki Wikitext (<a href="https://redirect.github.com/pygments/pygments/issues/2373">#2373</a>, <a href="https://redirect.github.com/pygments/pygments/issues/827">#827</a>)</li> <li>PostgreSQL Explain (<a href="https://redirect.github.com/pygments/pygments/issues/2398">#2398</a>)</li> <li>WGSL (WebGPU Shading Language) (<a href="https://redirect.github.com/pygments/pygments/issues/2386">#2386</a>)</li> <li>X++ (<a href="https://redirect.github.com/pygments/pygments/issues/2339">#2339</a>)</li> </ul> </li> <li> <p>Updated lexers:</p> <ul> <li> <p>AMDGPU: Add support for <code>scratch_</code> instructions, the <code>attr.</code> argument, as well as the <code>off</code> modifier (<a href="https://redirect.github.com/pygments/pygments/issues/2327">#2327</a>).</p> </li> <li> <p>APDL: Miscellaneous improvements (<a href="https://redirect.github.com/pygments/pygments/issues/2314">#2314</a>)</p> </li> <li> <p>bash/tcsh:</p> <ul> <li>Move <code>break</code> to keywords (<a href="https://redirect.github.com/pygments/pygments/issues/2377">#2377</a>)</li> <li>Improve bash math expansion lexing (<a href="https://redirect.github.com/pygments/pygments/issues/2255">#2255</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2353">#2353</a>)</li> </ul> </li> <li> <p>Chapel: Support attributes (<a href="https://redirect.github.com/pygments/pygments/issues/2376">#2376</a>)</p> </li> <li> <p>CMake: Implement bracket style comments (<a href="https://redirect.github.com/pygments/pygments/issues/2338">#2338</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2354">#2354</a>)</p> </li> <li> <p>CSS: Improve lexing of numbers inside function calls (<a href="https://redirect.github.com/pygments/pygments/issues/2382">#2382</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2383">#2383</a>)</p> </li> <li> <p>diff: Support normal diff syntax, as opposed to unified diff syntax (<a href="https://redirect.github.com/pygments/pygments/issues/2321">#2321</a>)</p> </li> <li> <p>GLSL, HLSL:</p> <ul> <li>Support line continuations in preprocessor code (<a href="https://redirect.github.com/pygments/pygments/issues/2350">#2350</a>)</li> <li>Improve preprocessor directive handling (<a href="https://redirect.github.com/pygments/pygments/issues/2357">#2357</a>)</li> </ul> </li> <li> <p>LilyPond: minor update of builtins</p> </li> <li> <p>PHP: support attributes (<a href="https://redirect.github.com/pygments/pygments/issues/2055">#2055</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2347">#2347</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2360">#2360</a>), fix anonymous classes without parameters (<a href="https://redirect.github.com/pygments/pygments/issues/2359">#2359</a>), improve lexing of variable variable syntax (<a href="https://redirect.github.com/pygments/pygments/issues/2358">#2358</a>)</p> </li> <li> <p>Python:</p> <ul> <li>Add missing builtins (<a href="https://redirect.github.com/pygments/pygments/issues/2334">#2334</a>)</li> <li>Fix inconsistent lexing of <code>None</code> (<a href="https://redirect.github.com/pygments/pygments/issues/2406">#2406</a>)</li> </ul> </li> <li> <p>Rebol/Red: Don't require script headers (<a href="https://redirect.github.com/pygments/pygments/issues/2348">#2348</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2349">#2349</a>)</p> </li> <li> <p>Spice: Update keywords (<a href="https://redirect.github.com/pygments/pygments/issues/2336">#2336</a>)</p> </li> <li> <p>SQL+Jinja (<code>analyse_text</code> method): Fix catastrophic backtracking (<a href="https://redirect.github.com/pygments/pygments/issues/2355">#2355</a>)</p> </li> <li> <p>Terraform: Add <code>hcl</code> alias (<a href="https://redirect.github.com/pygments/pygments/issues/2375">#2375</a>)</p> </li> </ul> </li> <li> <p>Declare support for Python 3.11 and drop support for Python 3.6 (<a href="https://redirect.github.com/pygments/pygments/issues/2324">#2324</a>).</p> </li> <li> <p>Update <code>native</code> style to improve contrast (<a href="https://redirect.github.com/pygments/pygments/issues/2325">#2325</a>).</p> </li> <li> <p>Update `github-dark`` style to match latest Primer style (<a href="https://redirect.github.com/pygments/pygments/issues/2401">#2401</a>)</p> </li> <li> <p>Revert a change that made guessing lexers based on file names slower on Python 3.10 and older (<a href="https://redirect.github.com/pygments/pygments/issues/2328">#2328</a>).</p> </li> <li> <p>Fix some places where a locale-dependent encoding could unintentionally be used instead of UTF-8 (<a href="https://redirect.github.com/pygments/pygments/issues/2326">#2326</a>).</p> </li> <li> <p>Fix Python traceback handling (<a href="https://redirect.github.com/pygments/pygments/issues/2226">#2226</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2329">#2329</a>).</p> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/pygments/pygments/blob/master/CHANGES">pygments's changelog</a>.</em></p> <blockquote> <h2>Version 2.15.0</h2> <p>(released April 10th, 2023)</p> <ul> <li> <p>Added lexers:</p> <ul> <li>Carbon (<a href="https://redirect.github.com/pygments/pygments/issues/2362">#2362</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2365">#2365</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2366">#2366</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2367">#2367</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2368">#2368</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2369">#2369</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2370">#2370</a>)</li> <li>Dax (<a href="https://redirect.github.com/pygments/pygments/issues/2335">#2335</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2345">#2345</a>)</li> <li>MediaWiki Wikitext (<a href="https://redirect.github.com/pygments/pygments/issues/2373">#2373</a>, <a href="https://redirect.github.com/pygments/pygments/issues/827">#827</a>)</li> <li>PostgreSQL Explain (<a href="https://redirect.github.com/pygments/pygments/issues/2398">#2398</a>)</li> <li>WGSL (WebGPU Shading Language) (<a href="https://redirect.github.com/pygments/pygments/issues/2386">#2386</a>)</li> <li>X++ (<a href="https://redirect.github.com/pygments/pygments/issues/2339">#2339</a>)</li> </ul> </li> <li> <p>Updated lexers:</p> <ul> <li> <p>AMDGPU: Add support for <code>scratch_</code> instructions, the <code>attr.</code> argument, as well as the <code>off</code> modifier (<a href="https://redirect.github.com/pygments/pygments/issues/2327">#2327</a>).</p> </li> <li> <p>APDL: Miscellaneous improvements (<a href="https://redirect.github.com/pygments/pygments/issues/2314">#2314</a>)</p> </li> <li> <p>bash/tcsh:</p> <ul> <li>Move <code>break</code> to keywords (<a href="https://redirect.github.com/pygments/pygments/issues/2377">#2377</a>)</li> <li>Improve bash math expansion lexing (<a href="https://redirect.github.com/pygments/pygments/issues/2255">#2255</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2353">#2353</a>)</li> </ul> </li> <li> <p>Chapel: Support attributes (<a href="https://redirect.github.com/pygments/pygments/issues/2376">#2376</a>)</p> </li> <li> <p>CMake: Implement bracket style comments (<a href="https://redirect.github.com/pygments/pygments/issues/2338">#2338</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2354">#2354</a>)</p> </li> <li> <p>CSS: Improve lexing of numbers inside function calls (<a href="https://redirect.github.com/pygments/pygments/issues/2382">#2382</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2383">#2383</a>)</p> </li> <li> <p>diff: Support normal diff syntax, as opposed to unified diff syntax (<a href="https://redirect.github.com/pygments/pygments/issues/2321">#2321</a>)</p> </li> <li> <p>GLSL, HLSL:</p> <ul> <li>Support line continuations in preprocessor code (<a href="https://redirect.github.com/pygments/pygments/issues/2350">#2350</a>)</li> <li>Improve preprocessor directive handling (<a href="https://redirect.github.com/pygments/pygments/issues/2357">#2357</a>)</li> </ul> </li> <li> <p>LilyPond: minor update of builtins</p> </li> <li> <p>PHP: support attributes (<a href="https://redirect.github.com/pygments/pygments/issues/2055">#2055</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2347">#2347</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2360">#2360</a>), fix anonymous classes without parameters (<a href="https://redirect.github.com/pygments/pygments/issues/2359">#2359</a>), improve lexing of variable variable syntax (<a href="https://redirect.github.com/pygments/pygments/issues/2358">#2358</a>)</p> </li> <li> <p>Python:</p> <ul> <li>Add missing builtins (<a href="https://redirect.github.com/pygments/pygments/issues/2334">#2334</a>)</li> <li>Fix inconsistent lexing of <code>None</code> (<a href="https://redirect.github.com/pygments/pygments/issues/2406">#2406</a>)</li> </ul> </li> <li> <p>Rebol/Red: Don't require script headers (<a href="https://redirect.github.com/pygments/pygments/issues/2348">#2348</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2349">#2349</a>)</p> </li> <li> <p>Spice: Update keywords (<a href="https://redirect.github.com/pygments/pygments/issues/2336">#2336</a>)</p> </li> <li> <p>SQL+Jinja (<code>analyse_text</code> method): Fix catastrophic backtracking (<a href="https://redirect.github.com/pygments/pygments/issues/2355">#2355</a>)</p> </li> <li> <p>Terraform: Add <code>hcl</code> alias (<a href="https://redirect.github.com/pygments/pygments/issues/2375">#2375</a>)</p> </li> </ul> </li> <li> <p>Declare support for Python 3.11 and drop support for Python 3.6 (<a href="https://redirect.github.com/pygments/pygments/issues/2324">#2324</a>).</p> </li> <li> <p>Update <code>native</code> style to improve contrast (<a href="https://redirect.github.com/pygments/pygments/issues/2325">#2325</a>).</p> </li> <li> <p>Update `github-dark`` style to match latest Primer style (<a href="https://redirect.github.com/pygments/pygments/issues/2401">#2401</a>)</p> </li> <li> <p>Revert a change that made guessing lexers based on file names slower on Python 3.10 and older (<a href="https://redirect.github.com/pygments/pygments/issues/2328">#2328</a>).</p> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`6c187ad832`"><code>6c187ad</code></a> Prepare 2.15 release.</li> <li><a href="`00b9cb022c`"><code>00b9cb0</code></a> Prepare for release.</li> <li><a href="`a0824a45f0`"><code>a0824a4</code></a> Update CHANGES</li> <li><a href="`26f9f6c852`"><code>26f9f6c</code></a> Merge pull request <a href="https://redirect.github.com/pygments/pygments/issues/2406">#2406</a> from rdbende/fix-fromimport-none</li> <li><a href="`62b1bbbe6e`"><code>62b1bbb</code></a> Change token of None after from keyword</li> <li><a href="`acee60e4e8`"><code>acee60e</code></a> Update CHANGES</li> <li><a href="`eaca690911`"><code>eaca690</code></a> Add lexer for MediaWiki Wikitext (<a href="https://redirect.github.com/pygments/pygments/issues/2373">#2373</a>)</li> <li><a href="`0e9c87bcf0`"><code>0e9c87b</code></a> Update CHANGES</li> <li><a href="`ef0abbaece`"><code>ef0abba</code></a> Add PostgreSQL Explain lexer (<a href="https://redirect.github.com/pygments/pygments/issues/2398">#2398</a>)</li> <li><a href="`3c6e2af8fb`"><code>3c6e2af</code></a> Update CHANGES</li> <li>Additional commits viewable in <a href="https://github.com/pygments/pygments/compare/2.12.0...2.15.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pygments&package-manager=pip&previous-version=2.12.0&new-version=2.15.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105654 Approved by: https://github.com/huydhn, https://github.com/seemethere	2023-07-21 20:52:05 +00:00
Thiago Crepaldi	999ca07ef8	Improve fake mode support by adding fake_context to ExportOutput (#105247 ) Prior to this PR, if the user called `fake_model.load_state_dict()` from within `enable_fake_mode`, the initial model state dict (including non persistent buffers) would not be reused by `ExportOutput.save` during ONNX proto creation. That is not necessarily a bug because `ExportOutput.save` has a `model_state_dict` in which they can specify any state they want. However, it can be a hassle because if the user doesn't provide a full state, including non-persistent buffers, the resulting ONNX graph would require the missing buffers to be specified as input during execution. With this PR, the `enable_fake_mode` is improved to capture the initial model state including any non-persistent buffer. This reference (not actual data) is persisted within `ExportOutput` and used by `save` to load additional `state_dict` that was captured by `enable_fake_mode`. The result is an ONNX graph with all model state without user having to specify the non-persistent buffers. This helps addressing https://github.com/pytorch/pytorch/issues/105233 for models that call `fake_model.load _state_dict` under the hood as potential buffers not returned by `model.state_dict()` may be captured. ps: https://github.com/pytorch/pytorch/issues/105464 tracks pending tasks/limitations from this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/105247 Approved by: https://github.com/BowenBao	2023-07-21 20:36:45 +00:00
Jane Xu	803d42e457	add lerp cpu support for half (#105607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105607 Approved by: https://github.com/albanD	2023-07-21 20:29:05 +00:00
AllenTiTaiWang	d5d6eb2d46	[ONNX] Refactor AvgPool to support dynamic shapes (#105683 ) In #87892, to pick up the corner cases found in #71549, the PR falls back the implementation of AvgPool to the way opset 9 implementing. However, it introduces a regression on dynamic shape cases found in #101397. This PR refactors the AvgPool op with the same implementation we have in onnxscript: https://github.com/microsoft/onnxscript/pull/754. However, the corner case with `count_include_pad` remains unsolved in onnxruntime: https://github.com/microsoft/onnxruntime/issues/16203. The calculuation on the last value of each dimension is different between ORT and PyTorch. But the fix can be proved in: https://github.com/microsoft/onnxruntime/pull/16752, and it supports AvgPool since opset19. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105683 Approved by: https://github.com/thiagocrepaldi	2023-07-21 20:22:08 +00:00
Justin Chu	4cc1745b13	[BE] f-stringify torch/ and scripts (#105538 ) This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`. - https://docs.python.org/3/reference/lexical_analysis.html#f-strings - https://pypi.org/project/flynt/ Command used: ``` flynt torch/ -ll 120 flynt scripts/ -ll 120 flynt tools/ -ll 120 ``` and excluded `collect_env.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-07-21 19:35:24 +00:00
Yanbo Liang	4c73016ff2	[Dynamo] Enable torch._dynamo.config.suppress_errors by default (#105307 ) Summary: We are working toward full model compilation, where when compilation error happens, we just fall back to eager mode rather than error out. But at the same time, we should fix these issues if they are bugs. We will: * 1/ log warnings in OSS; * 2/ log warnings and write them into Scuba in fbcode; to prevent us from ignoring these issues. Test Plan: Manual test Differential Revision: D47506314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105307 Approved by: https://github.com/jansel	2023-07-21 19:17:46 +00:00
Justin Chu	de8bd108b4	[BE] Enable ruff's UP rules in pyproject.toml (#105437 ) Signed-off-by: Justin Chu <justinchu@microsoft.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105437 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/Skylion007	2023-07-21 19:14:52 +00:00
Iris	6b2d48e78c	[8/n][FSDP] make use_dtensor=True work with offload_to_cpu=True for optim.load_state_dict() (#105690 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/105690 Approved by: https://github.com/fegin	2023-07-21 18:55:01 +00:00
Michael Lazos	72b223cd1b	[Inductor] Optimize read write merging in FusedSchedulerNode ctor (#105693 ) Reduced optimizer compilation time by half, I think it will improve it in general as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105693 Approved by: https://github.com/jansel	2023-07-21 17:26:44 +00:00
Thiago Crepaldi	842616bcba	Allow (temporarily?) non-fake input during ONNX export with fake mode (#105246 ) Although input and model are expected to be fake during ONNX export with fake mode enabled, apparently some models can create new parameters during tracing. That makes internal checks on dynamo side to fail when we dont set `allow_non_fake_input=True` for `torch._dynamo.export`. https://github.com/pytorch/pytorch/issues/105077 tracks this issue and if a proper fix is done, we will set `allow_non_fake_input=False` again Additionally to that, a possible bug was found at torch.nn.Module.state_dict() in which some registered buffers are not listed. This is being tracked by https://github.com/pytorch/pytorch/issues/105233 but in the mean time, we are merging `state_dict()` and `named_buffers()` results to create a full `state_dict` for the model Two more complex/larger tests are added to the ONNX export which are the same for the experimental symbolic tracing: tiny gpt2 and toy mlp (https://github.com/pytorch/pytorch/blob/main/test/onnx/test_fx_to_onnx_with_onnxruntime.py#L766-L825) ps: https://github.com/pytorch/pytorch/issues/105464 tracks pending tasks/limitations from this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/105246 Approved by: https://github.com/BowenBao	2023-07-21 16:05:09 +00:00
Richard Zou	04da0c76a0	Improve basicAutogradNotImplementedFallback + new tests (#105591 ) This PR: - removes some reference count bumps (to potentially improve overhead) - adds some tests for undefined gradients Pull Request resolved: https://github.com/pytorch/pytorch/pull/105591 Approved by: https://github.com/soulitzer	2023-07-21 14:37:21 +00:00
Richard Zou	ed6de45563	Fix Tensor::register_hook behavior on undefined tensors (#105587 ) When the hook registered by Tensor::register_hook (in C++) gets passed an undefined tensor, it raises an internal assert in debug mode. The cause is that we attempt to construct an OptionalTensorRef (`4448c78a5d/aten/src/ATen/core/Tensor.h (L68)`) which asserts that the passed-in TensorBase is defined. The fix is that we create a new TensorRef class to convert the TensorBase into a Tensor without bumping the refcount (which is what OptionalTensorRef does). We cannot reuse OptionalTensorRef because OptionalTensorRef represents `optional<Tensor>` that cannot hold an Undefined Tensor. For some more historical context, it looks like this behavior was introduced in #63612 Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/105587 Approved by: https://github.com/soulitzer	2023-07-21 14:37:21 +00:00
eqy	29f856e3e0	Kill process in `wait_for_process` if `SIGINT` fails to terminate it (#105625 ) #98035 adds some additional logic `wait_for_process` that includes catching a timeout exception and sending `SIGINT` to the process before waiting on it again with a timeout. However, if the additional wait times out again, then the wait call in the `finally` block (which does not have a timeout) has the potential to hang indefinitely. This PR kills the process if a second timeout exception occurs after the `SIGINT` signal is sent. CC @clee2000 @ptrblck @xwang233 @kwen2501 Also hoping that this has the potential to reduce turnaround time for distributed timeouts like those seen in https://hud.pytorch.org/pr/pytorch/pytorch/105274#15148799113 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105625 Approved by: https://github.com/ezyang	2023-07-21 10:11:58 +00:00
Michael Lazos	ec26947c58	[Inductor] Replace functools.reduce union calls with set unions (#105720 ) This improves compilation speed another 30% Pull Request resolved: https://github.com/pytorch/pytorch/pull/105720 Approved by: https://github.com/lezcano	2023-07-21 09:49:56 +00:00
Justin Chu	79c5e33349	[BE] Enable ruff's UP rules and autoformat nn/ mps/ and torch/ (#105436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105436 Approved by: https://github.com/malfet, https://github.com/albanD	2023-07-21 07:38:46 +00:00
Edward Z. Yang	322dff475c	Skip test_cudnn_rnn when cudnn not available (#105701 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105701 Approved by: https://github.com/mlazos	2023-07-21 06:03:50 +00:00
Michael Voznesensky	429d45f91a	Simplify handle indexing (#105006 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105006 Approved by: https://github.com/awgu	2023-07-21 05:53:23 +00:00
angelayi	b0a04331b4	[dynamo] Fix import if numpy is not installed (#105711 ) This [line](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/allowed_functions.py#L18) results in an import issue if numpy is not installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105711 Approved by: https://github.com/yanboliang, https://github.com/ezyang	2023-07-21 05:52:32 +00:00
Jackie (Jiaqi) Xu	e40f8acef2	[inductor][fx passes] batch layernom (#105492 ) Summary: Batch layernorm. Fuse independent horizontal layernorm with same size into one. Test Plan: # unit test ``` buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py Buck UI: https://www.internalfb.com/buck2/68eb51e1-bdbc-4847-aabf-e50737d8485b Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549764442206 Network: Up: 0 B Down: 0 B Jobs completed: 10. Time elapsed: 1:07.2s. Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D47447542 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105492 Approved by: https://github.com/jansel, https://github.com/xuzhao9	2023-07-21 05:03:04 +00:00
Edward Z. Yang	a01a732954	Rename some sizevars methods for clarity (#105585 ) The guard functions require you to ALREADY KNOW that a particular condition holds. If you don't know (you want to guard on an expression being a particular value, and then get access to that value), use the evaluate functions. I renamed the functions that don't abide by this: ``` guard_min -> evaluate_min guard_max (deleted, no uses) guard_static_shape -> evaluate_static_shape guard_static_shapes -> evaluate_static_shapes ``` Some added comments. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105585 Approved by: https://github.com/voznesenskym	2023-07-21 04:46:23 +00:00
Mengwei Liu	cce2b7e3c9	[dynamo][numpy] Add support for builtin len() on numpy ndarray (#105691 ) Issue #105054 ``` def fn(x): v = x.sum() / len(x) return v ``` This creates a graph break because we don't know how to handle __len__ method. Solution is just delegate it back to `TensorVariable`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105691 Approved by: https://github.com/ezyang	2023-07-21 03:50:40 +00:00
angelayi	fed8d3608d	Update core aten decomp table (#105673 ) Updated the decomposition table based on the existing [Core ATen IR](https://pytorch.org/docs/stable/ir.html) list, and moved rest of decompositions to inductor's decomposition table. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105673 Approved by: https://github.com/SherlockNoMad	2023-07-21 02:45:37 +00:00
Mark Saroufim	c759a57003	Skip deterministic mode for SAM (#105615 ) SAM uses cumsum which doesnt have a deterministic mode enabled so this the onl way I can work around this https://github.com/pytorch/pytorch/issues/89492 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/105615 Approved by: https://github.com/eellison, https://github.com/cpuhrsch	2023-07-21 01:52:08 +00:00
PyTorch MergeBot	117325862c	Revert "Add torch.utils to the docs page, remove dead code and fix docstrings (#105142 )" This reverts commit e985719e98ba02f61438d6a27e29caeaeedb9e6c. Reverted https://github.com/pytorch/pytorch/pull/105142 on behalf of https://github.com/huydhn due to Sorry for reverting this but it is failing python doc build job in trunk `e985719e98` ([comment](https://github.com/pytorch/pytorch/pull/105142#issuecomment-1644874540))	2023-07-21 01:47:49 +00:00
blzheng	6ed96b9ed8	inductor: fix bug in nn.Linear when in_feature size is zero (#105449 ) Fix #104937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105449 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-07-21 01:10:01 +00:00
wangxiyuan	cb9abf725c	Update torch.compile docstring (#105652 ) Update the description of 'mode' parameter for torch.compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/105652 Approved by: https://github.com/ezyang	2023-07-21 01:02:31 +00:00
Jerry Zhang	143c83d637	[quant][pt2e][be] Remove unneeded code (#105676 ) Summary: att Test Plan: CIs Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/105676 Approved by: https://github.com/andrewor14	2023-07-21 00:51:22 +00:00
Edward Z. Yang	a8f568e99b	Make recompiles log print stack traces (#105663 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105663 Approved by: https://github.com/voznesenskym	2023-07-21 00:31:22 +00:00
albanD	e985719e98	Add torch.utils to the docs page, remove dead code and fix docstrings (#105142 ) As per title. Note that the c++ side code for the minidumps part was removed. So trying to call any of these 3 functions today results in an error saying that `torch._C` doesn't have these attributes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105142 Approved by: https://github.com/janeyx99	2023-07-21 00:14:59 +00:00
Shunting Zhang	1e87778552	[inductor] refactor wrapper benchmark code out of utils.py (#105584 ) Refactor wrapper benchmark out of utils.py since 1. utils.py gets too large 2. I plan to add more code to wrapper benchmark for multi-kernel. This is split out from https://github.com/pytorch/pytorch/pull/103469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105584 Approved by: https://github.com/jansel	2023-07-21 00:01:35 +00:00
Catherine Lee	07ea344dcf	Fix docs not showing error, remove circleci docs scripts (#105678 ) Docs were not exiting with failure, for example https://github.com/pytorch/pytorch/actions/runs/5604612586/job/15184094038#step:9:1131 because the if statement returned 0 if we want to exit. Also get rid of the circleci scripts since they aren't used anywhere. Example error: ``` copying static files... done copying extra files... done dumping search index in English (code: en)... done dumping object inventory... done build finished with problems, 1 warning. make: *** [Makefile:49: html] Error 1 + code=2 + '[' 2 -ne 0 ']' + set +x ========================= /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/nn/parallel/comm.py:docstring of torch.nn.parallel.comm.scatter:1: WARNING: more than one target found for cross-reference 'Stream': torch.cuda.Stream, torch.cuda.streams.Stream, torch.cpu.Stream ========================= Docs build failed. If the failure is not clear, scan back in the log for any WARNINGS or for the line build finished with problems (tried to echo the WARNINGS above the ==== line) ========================= + return 2 + exit 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105678 Approved by: https://github.com/seemethere	2023-07-20 23:11:31 +00:00
xvladus1	e47fad68a0	[caffe2] Update tracepoint USDT macros (#105232 ) Summary: Fix existing CAFFE static tracepoint macros and make them match the latest FOLLY version. Per anakryiko, current `CAFE_SDT` definition is broken. Quote: ``` "Arguments: -5@-16(%rbp) -4@$100 Arguments: -8@-16(%rbp) -4@$100 #define FOLLY_SDT_IS_ARRAY_POINTER(x) ((__builtin_classify_type(x) == 14) \|\| \ (__builtin_classify_type(x) == 5)) vs #define CAFFE_SDT_ISARRAY(x) (__builtin_classify_type(x) == 14) https://github.com/atgreen/gcc/blob/master/gcc/typeclass.h that 5 is "pointer_type_class" so you were right, it's just fixed up version of header I think it should be 8, not 5 5 is the size of literal, but you don't pass string literal as an argument, you pass its address, so actual argument is a pointer, and so 8 byte long you can try just fixing up CAFFE_SDT macro ``` {F1048035373} Test Plan: Tested the following macros on test scripts with libbpf USDTs: CAFFE_SDT CAFFE_DISABLE_SDT CAFFE_SDT_WITH_SEMAPHORE Reviewed By: RihamSelim Differential Revision: D47159249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105232 Approved by: https://github.com/chaekit, https://github.com/malfet	2023-07-20 22:56:11 +00:00
Elias Ellison	024d26208c	Add Freezing Option to Benchmarking (#105616 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105616 Approved by: https://github.com/desertfire	2023-07-20 22:50:51 +00:00
Thomas Findlay	8399cf9bfe	Rnn base hidden size type check (#105659 ) Fixes #105631 Added a type and value check on `hidden_size` to align behaviour between GPU and CPU modes and alert users when the wrong type is supplied. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105659 Approved by: https://github.com/albanD, https://github.com/mikaylagawarecki	2023-07-20 22:45:43 +00:00
Lucy Qiu	18d8961d91	[Pytorch][Vulkan] aten::pow (#105550 ) Summary: Add support for [aten::pow](https://pytorch.org/docs/stable/generated/torch.pow.html#torch.pow) in [various forms](https://www.internalfb.com/code/fbsource/[c717e1fa980ed47c6580778dcfa49c21d3270a67]/xplat/caffe2/aten/src/ATen/native/native_functions.yaml?lines=9656%2C9670%2C9685%2C9693%2C9699): \|Not in-place\| Base \| Exp \| \|--\| -- \| -- \| \|pow\| Tensor \| Tensor \| \|pow_tensor_scalar\| Tensor \| Scalar \| \|pow_scalar_tensor\| Scalar \| Tensor \| \|In-place\| Base \| Exp \| \|--\| -- \| -- \| \|pow_ \| Tensor \| Tensor \| \|pow_tensor_scalar_\| Tensor \| Scalar \| Test Plan: pow tests ``` [lfq@35771.od /data/sandcastle/boxes/fbsource (97d4bdf9e)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="pow" Building: finished in 0.1 sec (100%) 329/329 jobs, 0/329 updated Total time: 0.2 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = pow [==========] Running 7 tests from 1 test suite. [----------] Global test environment set-up. [----------] 7 tests from VulkanAPITest [ RUN ] VulkanAPITest.pow [ OK ] VulkanAPITest.pow (255 ms) [ RUN ] VulkanAPITest.pow_broadcast [ OK ] VulkanAPITest.pow_broadcast (3 ms) [ RUN ] VulkanAPITest.pow_ [ OK ] VulkanAPITest.pow_ (90 ms) [ RUN ] VulkanAPITest.pow_broadcast_other_ [ OK ] VulkanAPITest.pow_broadcast_other_ (0 ms) [ RUN ] VulkanAPITest.pow_tensor_scalar [ OK ] VulkanAPITest.pow_tensor_scalar (57 ms) [ RUN ] VulkanAPITest.pow_tensor_scalar_ [ OK ] VulkanAPITest.pow_tensor_scalar_ (83 ms) [ RUN ] VulkanAPITest.pow_scalar_tensor [ OK ] VulkanAPITest.pow_scalar_tensor (50 ms) [----------] 7 tests from VulkanAPITest (542 ms total) [----------] Global test environment tear-down [==========] 7 tests from 1 test suite ran. (542 ms total) [ PASSED ] 7 tests. ``` All tests ``` QueryPool is not available [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms) [----------] 317 tests from VulkanAPITest (18448 ms total) [----------] Global test environment tear-down [==========] 317 tests from 1 test suite ran. (18448 ms total) [ PASSED ] 316 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log ``` clang-format on glsl and cpp files Reviewed By: SS-JIA Differential Revision: D46704167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105550 Approved by: https://github.com/SS-JIA	2023-07-20 22:19:25 +00:00
Rodrigo Kumpera	795885d947	[docs] Fix docstring. (#105689 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105689 Approved by: https://github.com/clee2000	2023-07-20 22:02:43 +00:00
Mark Saroufim	450c22c311	mypy index propagation (#105622 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/105622 Approved by: https://github.com/eellison	2023-07-20 21:37:43 +00:00
Mark Saroufim	fe7187b903	mypy _inductor/cuda_properties (#105620 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/105620 Approved by: https://github.com/eellison	2023-07-20 21:13:01 +00:00
Mark Saroufim	75a8c8a538	softshrink lowering (#105603 ) Fixes https://github.com/pytorch/pytorch/issues/105563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105603 Approved by: https://github.com/Chillee	2023-07-20 20:26:05 +00:00
Yanbo Liang	6560750d08	[Dynamo] Support list indexed by constant tensor (#105509 ) Fixes #104092 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105509 Approved by: https://github.com/eellison	2023-07-20 20:14:04 +00:00
Xiao Wang	e6fd8ca3ee	Fix test failure in TestCudaMultiGPU.test_cuda_device_memory_allocated (#105501 ) The test `f508d3564c/test/test_cuda_multigpu.py (L1282-L1290)` Torch cuda caching allocator may cache the allocation and cause the "new_alloc" being the same as the "old_alloc". ```python self.assertGreater(memory_allocated(0), current_alloc[0]) ``` I suggest that we use `assertGreaterEqual` instead of `assertGreater` in the test. Individually running only this test does not make it fail but running it together with other tests from the same test module will make it fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105501 Approved by: https://github.com/zou3519	2023-07-20 19:59:10 +00:00
ydwu4	6abb8c382c	[export] add kwargs support for export. (#105337 ) Solving #105242. During export, the exported function's signature changes multiple times. Suppose we'd like to export f as shown in following example: ```python def f(arg1, arg2, kw1, kw2): pass args = (arg1, arg2) kwargs = {"kw2":arg3, "kw1":arg4} torch.export(f, args, kwargs) ``` The signature changes mutiple times during export process in the following order: 1. *gm_torch_level = dynamo.export(f, args, \\kwargs). In this step, we turn all kinds of parameters such as postional_only, var_positioinal, kw_only, and var_kwargs into positional_or_kw.It also preserves the positional and kword argument names in original function (i.e. f in this example) [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/export.py#L546C13-L546C27). The order of kwargs will be the key order of kwargs (after python 3.6, the order is the insertion of order of keys) instead of the original function signature and the order is baked into a _orig_args varaible of gm_torch_level's pytree info. So we'll have: ```python def gm_torch_level(arg1, arg2, kw2, kw1) ``` Such difference is acceptable as it's transparent to users of export. 2. gm_aot_export = aot_export_module(gm_torch_level, pos_or_kw_args)*. In this step, we need to turn kwargs into positional args in the order of how gm_torch_level expected, which is stored in _orig_args. The returned gm_aot_export has the graph signature of flat_args, in_spec = pytree.tree_flatten(pos_or_kw_args): ``` python flat_args, _ = pytree.tree_flatten(pos_or_kw_args) def gm_aot_export(flat_args) ``` 3. *exported_program(args, \\kwargs)*. The epxorted artifact is exported_program, which is a wrapper over gm_aot_export and has the same calling convention as the original function "f". To do this, we need to 1. specialize the order of kwargs into pos_or_kw_args and 2. flatten the pos_or_kw_args into what gm_aot_export expected. We can combine the two steps into one with : ```python _, in_spec = pytree.tree_flatten((args, kwargs)) # Then during exported_program.__call__(args, **kwargs) flat_args = fx_pytree.tree_flatten_spec((args, kwargs), in_spec) ``` , where kwargs is treated as a normal pytree whose keyorder is preserved in in_spec. Implementation-wise, we treat _orig_args in dynamo exported graph module as single source of truth and kwags are ordered following it. Test plan: See added tests in test_export.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105337 Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan	2023-07-20 19:53:08 +00:00
ydwu4	9584d614a1	[inductor] add decompositions for aten.angle (#105609 ) Fixes #105564. Added tests. CPU benchmarking result: Before decomposition: ``` [2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH [2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO] ===== Forward graph 0 ===== [2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO] <eval_with_key>.4 from /home/yidi/local/pytorch/torch/fx/experimental/proxy_tensor.py:477 in wrapped class <lambda>(torch.nn.Module): [2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO] def forward(self, arg0_1: f32[100000]): [2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO] # File: /home/yidi/local/t.py:5, code: return torch.angle(x) [2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO] angle: f32[100000] = torch.ops.aten.angle.default(arg0_1); arg0_1 = None [2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO] return (angle,) [2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO] [2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO] eager: per-call time (us): 1069.2930221557617 compiled: per-call time (us): 742.4068450927734 ``` After decomposition: ``` [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] ===== Forward graph 0 ===== [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] <eval_with_key>.4 from /home/yidi/local/pytorch/torch/fx/experimental/proxy_tensor.py:477 in wrapped class <lambda>(torch.nn.Module): [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] def forward(self, arg0_1: f32[100000]): [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] # File: /home/yidi/local/t.py:5, code: return torch.angle(x) [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] lt: b8[100000] = torch.ops.aten.lt.Scalar(arg0_1, 0) [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] scalar_tensor: f32[] = torch.ops.aten.scalar_tensor.default(0.0, dtype = torch.float32, layout = torch.strided, device = device(type='cpu')) [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] scalar_tensor_1: f32[] = torch.ops.aten.scalar_tensor.default(3.141592653589793, dtype = torch.float32, layout = torch.strided, device = device(type='cpu')) [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] where: f32[100000] = torch.ops.aten.where.self(lt, scalar_tensor_1, scalar_tensor); lt = scalar_tensor_1 = scalar_tensor = None [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] isnan: b8[100000] = torch.ops.aten.isnan.default(arg0_1); arg0_1 = None [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] scalar_tensor_2: f32[] = torch.ops.aten.scalar_tensor.default(0, dtype = torch.float32, layout = torch.strided, device = device(type='cpu')) [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] scalar_tensor_3: f32[] = torch.ops.aten.scalar_tensor.default(nan, dtype = torch.float32, layout = torch.strided, device = device(type='cpu')) [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] where_1: f32[100000] = torch.ops.aten.where.self(isnan, scalar_tensor_3, scalar_tensor_2); isnan = scalar_tensor_3 = scalar_tensor_2 = None [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] add: f32[100000] = torch.ops.aten.add.Tensor(where, where_1); where = where_1 = None [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] return (add,) [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] [2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] eager: per-call time (us): 1228.0082702636719 compiled: per-call time (us): 83.6038589477539 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105609 Approved by: https://github.com/jansel	2023-07-20 19:12:20 +00:00
PaliC	9760ea58a3	fix lint (#105675 ) Forward fix of the lint issues introduced by https://github.com/pytorch/pytorch/pull/104242 We are forward fixing as this PR contains Meta internal changes that would be tricky to revert smoothly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105675 Approved by: https://github.com/jerryzh168, https://github.com/albanD, https://github.com/atalman	2023-07-20 18:42:25 +00:00
Catherine Lee	3464cd6e62	Close non existent disable issues (#105096 ) example run https://github.com/pytorch/pytorch/actions/runs/5539549596/jobs/10110608650?pr=105096 I spot checked a few to make sure the tests are gone, and most of them are automatic dynamic shapes tests, which got renamed. I will remove the pull_request trigger and the dry run before merging Pull Request resolved: https://github.com/pytorch/pytorch/pull/105096 Approved by: https://github.com/huydhn	2023-07-20 18:07:37 +00:00
William Wen	777fc0bb58	[dynamo] fine-grained bytecode-source attribution in python 3.11 (#104676 ) Since Python 3.11 bytecode contains endline and column information, for each bytecode, we attribute the source code corresponding to the bytecode in a more accurate way. For example, we can highlight a function call in a series of nested function calls, or highlight a function call spanning multiple lines. Sample: ```python import torch import torch._dynamo from functorch.experimental.control_flow import cond def h(x): return x * 5 def true_fn(x): return x * 2 def false_fn(x): return x * 3 def f(pred, x): x = h( h(h(x)) ) x = x[1:][:2] torch._dynamo.graph_break() x = cond(pred, true_fn, false_fn, [x]) opt_f = torch.compile(f, backend="eager") opt_f(torch.tensor(True), torch.randn(3, 3, 3, 3)) ``` Output: ``` $ TORCH_LOGS="trace_call" python playground9.py TRACE inlined call h from f /scratch/williamwen/work/pytorch/playground9.py:16 h(h(x)) ~^^^ TRACE FX call mul from h /scratch/williamwen/work/pytorch/playground9.py:6 (inline depth: 1) return x * 5 ~~^~~ TRACE inlined call h from f /scratch/williamwen/work/pytorch/playground9.py:16 h(h(x)) ~^^^^^^ TRACE FX call mul_1 from h /scratch/williamwen/work/pytorch/playground9.py:6 (inline depth: 1) return x * 5 ~~^~~ TRACE inlined call h from f /scratch/williamwen/work/pytorch/playground9.py:15 x = h( ~^ h(h(x)) ^^^^^^^ ) ^ TRACE FX call mul_2 from h /scratch/williamwen/work/pytorch/playground9.py:6 (inline depth: 1) return x * 5 ~~^~~ TRACE FX call getitem from f /scratch/williamwen/work/pytorch/playground9.py:18 x = x[1:][:2] ~^^^^ TRACE FX call getitem_1 from f /scratch/williamwen/work/pytorch/playground9.py:18 x = x[1:][:2] ~~~~~^^^^ TRACE inlined call true_fn from <resume in f> /scratch/williamwen/work/pytorch/playground9.py:20 x = cond(pred, true_fn, false_fn, [x]) ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TRACE FX call mul from true_fn /scratch/williamwen/work/pytorch/playground9.py:9 (inline depth: 1) return x * 2 ~~^~~ TRACE inlined call false_fn from <resume in f> /scratch/williamwen/work/pytorch/playground9.py:20 x = cond(pred, true_fn, false_fn, [x]) ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TRACE FX call mul from false_fn /scratch/williamwen/work/pytorch/playground9.py:12 (inline depth: 1) return x * 3 ~~^~~ TRACE FX call cond from <resume in f> /scratch/williamwen/work/pytorch/playground9.py:20 x = cond(pred, true_fn, false_fn, [x]) ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104676 Approved by: https://github.com/ezyang	2023-07-20 17:18:52 +00:00
CURTLab	b5d3d58497	Fixed cmake mkl lib path in caffee2 public (#105525 ) This small change fixes a linking error (Intel MKL) for distributed version of libtorch c++ using cmake. Fixes #105215. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105525 Approved by: https://github.com/albanD	2023-07-20 17:15:09 +00:00
Jane Xu	25d80c69ce	[foreach] super minor BE: remove unnecessary cast (#105601 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105601 Approved by: https://github.com/albanD	2023-07-20 17:06:52 +00:00
Jane Xu	e855348cdf	[foreach][SGD] minimize intermediates=1 to decrease peak memory (#105599 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105599 Approved by: https://github.com/albanD	2023-07-20 17:06:52 +00:00
Daniil Kutz	585ce32ca1	Heap buffer overflow in `ditributed/rpc` module (#105537 ) Hi! we've been fuzzing PyTorch project with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch). We've found a couple heap-buffer-overflows in `distributed/rpc` module. PyTorch version: `0f1621df1a` OS: Ubuntu 20.04 ### How to reproduce 1. Build docker from this [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch) and run the container. 2. Then run `message_deserialize-afl++` fuzzing target on provided crash-inputs ([crash-056826339f6da8dbb97c944178e94494369a9e22.zip](https://github.com/pytorch/pytorch/files/12096151/crash-056826339f6da8dbb97c944178e94494369a9e22.zip), [crash-4f85db9f19fe152c0018f6675c3b4c122227058f.zip](https://github.com/pytorch/pytorch/files/12096160/crash-4f85db9f19fe152c0018f6675c3b4c122227058f.zip)): ``` unzip crash-4f85db9f19fe152c0018f6675c3b4c122227058f.zip /message_deserialize-afl++ crash-4f85db9f19fe152c0018f6675c3b4c122227058f ``` ### Heap buffer overflow in torch/csrc/jit/serialization/pickle.cpp:144 [crash-056826339f6da8dbb97c944178e94494369a9e22.zip](https://github.com/pytorch/pytorch/files/12096151/crash-056826339f6da8dbb97c944178e94494369a9e22.zip) ```asan "==7614==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60b001b58355 at pc 0x0000005d1147 bp 0x7fffffffa610 sp 0x7fffffff9de0", "READ of size 256 at 0x60b001b58355 thread T0", " #0 0x5d1146 in __asan_memcpy /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:22:3", " #1 0xd1cd19f in torch::jit::unpickle(char const, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&))::$_3::operator()(char, unsigned long) const /pytorch/torch/csrc/jit/serialization/pickle.cpp:144:9", " #2 0xd1cd19f in unsigned long std::__invoke_impl<unsigned long, torch::jit::unpickle(char const, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&))::$_3&, char, unsigned long>(std::__invoke_other, torch::jit::unpickle(char const, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&))::$_3&, char&&, unsigned long&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14", " #3 0xd27aa48 in std::function<unsigned long (char, unsigned long)>::operator()(char, unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/std_function.h:622:14", " #4 0xd27a61c in torch::jit::Unpickler::readSlowWithBuffer(char, unsigned long) /pytorch/torch/csrc/jit/serialization/unpickler.cpp:1047:23", " #5 0xd2698b8 in unsigned char torch::jit::Unpickler::read<unsigned char>() /pytorch/torch/csrc/jit/serialization/unpickler.h:111:7", " #6 0xd268816 in torch::jit::Unpickler::readOpCode() /pytorch/torch/csrc/jit/serialization/unpickler.h:130:38", " #7 0xd268816 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:238:17", " #8 0xd268522 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204:3", " #9 0xd1c8502 in torch::jit::unpickle(std::function<unsigned long (char, unsigned long)>, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:126:20", " #10 0xd1c8dbd in torch::jit::unpickle(char const, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:136:10", " #11 0xe56b16d in torch::distributed::rpc::readWrappedPayload(std::vector<char, std::allocator<char> >&, torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:515:18", " #12 0xe3d8f29 in torch::distributed::autograd::RpcWithProfilingReq::fromMessage(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/autograd/rpc_messages/rpc_with_profiling_req.cpp:112:24", " #13 0xe55f692 in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:138:14", " #14 0x6120a8 in LLVMFuzzerTestOneInput /message_deserialize.cc:192:27", " #15 0x535de1 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15", " #16 0x51fcec in fuzzer::RunOneTest(fuzzer::Fuzzer, char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6", " #17 0x525a3b in fuzzer::FuzzerDriver(int, char**, int ()(unsigned char const, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9", " #18 0x54eff2 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10", " #19 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)", " #20 0x51a60d in _start (/message_deserialize_fuzz+0x51a60d)", "", "0x60b001b58355 is located 0 bytes to the right of 101-byte region [0x60b001b582f0,0x60b001b58355)", "allocated by thread T0 here:", " #0 0x60c7bd in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3", " #1 0x62c7fd in std::_Vector_base<char, std::allocator<char> >::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:346:20", " #2 0x62c7fd in void std::vector<char, std::allocator<char> >::_M_range_initialize<unsigned char const>(unsigned char const, unsigned char const, std::forward_iterator_tag) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:1582:14", " #3 0x612913 in std::vector<char, std::allocator<char> >::vector<unsigned char const, void>(unsigned char const, unsigned char const, std::allocator<char> const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:657:4", " #4 0x611c4a in LLVMFuzzerTestOneInput /message_deserialize.cc:181:21", " #5 0x535de1 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15", " #6 0x51fcec in fuzzer::RunOneTest(fuzzer::Fuzzer, char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6", " #7 0x525a3b in fuzzer::FuzzerDriver(int, char*, int ()(unsigned char const, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9", " #8 0x54eff2 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10", " #9 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)", "", "SUMMARY: AddressSanitizer: heap-buffer-overflow /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:22:3 in __asan_memcpy", "Shadow bytes around the buggy address:", " 0x0c1680363010: 00 00 00 fa fa fa fa fa fa fa fa fa 00 00 00 00", " 0x0c1680363020: 00 00 00 00 00 00 00 00 00 00 fa fa fa fa fa fa", " 0x0c1680363030: fa fa 00 00 00 00 00 00 00 00 00 00 00 00 00 fa", " 0x0c1680363040: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00", " 0x0c1680363050: 00 00 00 00 00 fa fa fa fa fa fa fa fa fa 00 00", "=>0x0c1680363060: 00 00 00 00 00 00 00 00 00 00[05]fa fa fa fa fa", " 0x0c1680363070: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00", " 0x0c1680363080: 05 fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa", " 0x0c1680363090: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa", " 0x0c16803630a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa", " 0x0c16803630b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa", "Shadow byte legend (one shadow byte represents 8 application bytes):", " Addressable: 00", " Partially addressable: 01 02 03 04 05 06 07", " Heap left redzone: fa", " Freed heap region: fd", " Stack left redzone: f1", " Stack mid redzone: f2", " Stack right redzone: f3", " Stack after return: f5", " Stack use after scope: f8", " Global redzone: f9", " Global init order: f6", " Poisoned by user: f7", " Container overflow: fc", " Array cookie: ac", " Intra object redzone: bb", " ASan internal: fe", " Left alloca redzone: ca", " Right alloca redzone: cb", "==7614==ABORTING" ``` ### Heap-buffer-overflow in aten/src/ATen/core/ivalue.h:432 [crash-4f85db9f19fe152c0018f6675c3b4c122227058f.zip](https://github.com/pytorch/pytorch/files/11553011/crash-4f85db9f19fe152c0018f6675c3b4c122227058f.zip) ```asan "==60983==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6150001e4108 at pc 0x000000601877 bp 0x7fffffff9fd0 sp 0x7fffffff9fc8", "READ of size 4 at 0x6150001e4108 thread T0", " #0 0x601876 in c10::IValue::isTensor() const /pytorch/aten/src/ATen/core/ivalue.h:432:27", " #1 0x601876 in c10::IValue::destroy() /pytorch/aten/src/ATen/core/ivalue.h:1148:9", " #2 0x699f72 in c10::IValue::~IValue() /pytorch/aten/src/ATen/core/ivalue.h:236:5", " #3 0x699f72 in void std::_Destroy<c10::IValue>(c10::IValue) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_construct.h:140:19", " #4 0x699f72 in void std::_Destroy_aux<false>::__destroy<c10::IValue>(c10::IValue, c10::IValue) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_construct.h:152:6", " #5 0x699f72 in void std::_Destroy<c10::IValue>(c10::IValue, c10::IValue) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_construct.h:184:7", " #6 0x699f72 in void std::_Destroy<c10::IValue, c10::IValue>(c10::IValue, c10::IValue, std::allocator<c10::IValue>&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/alloc_traits.h:738:7", " #7 0x699f72 in std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_erase_at_end(c10::IValue) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:1796:6", " #8 0x699e4a in std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_erase(__gnu_cxx::__normal_iterator<c10::IValue, std::vector<c10::IValue, std::allocator<c10::IValue> > >, __gnu_cxx::__normal_iterator<c10::IValue, std::vector<c10::IValue, std::allocator<c10::IValue> > >) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:191:4", " #9 0xea5b11e in torch::jit::Unpickler::readInstruction() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:454:14", " #10 0xea57d97 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:251:27", " #11 0xea579f1 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204:3", " #12 0xe9a435e in torch::jit::unpickle(std::function<unsigned long (char, unsigned long)>, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:126:20", " #13 0xe9a471c in torch::jit::unpickle(char const, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:136:10", " #14 0xfcd034b in torch::distributed::autograd::PropagateGradientsReq::fromMessage(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/autograd/rpc_messages/propagate_gradients_req.cpp:54:18", " #15 0xfe720ff in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:132:14", " #16 0x5c5c93 in LLVMFuzzerTestOneInput /message_deserialize.cc:192:27", " #17 0x5c2bfd in ExecuteFilesOnyByOne /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255:7", " #18 0x5c2a08 in LLVMFuzzerRunDriver /AFLplusplus/utils/aflpp_driver/aflpp_driver.c", " #19 0x5c25c8 in main /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300:10", " #20 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)", " #21 0x50237d in _start (/message_deserialize_afl+0x50237d)", "", "0x6150001e4108 is located 8 bytes to the right of 512-byte region [0x6150001e3f00,0x6150001e4100)", "allocated by thread T0 here:", " #0 0x5bfbfa in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3", "", "SUMMARY: AddressSanitizer: heap-buffer-overflow /pytorch/aten/src/ATen/core/ivalue.h:432:27 in c10::IValue::isTensor() const", "Shadow bytes around the buggy address:", " 0x0c2a800347d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa", " 0x0c2a800347e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00", " 0x0c2a800347f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00", " 0x0c2a80034800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00", " 0x0c2a80034810: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00", "=>0x0c2a80034820: fa[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa", " 0x0c2a80034830: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa", " 0x0c2a80034840: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa", " 0x0c2a80034850: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa", " 0x0c2a80034860: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa", " 0x0c2a80034870: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa", "Shadow byte legend (one shadow byte represents 8 application bytes):", " Addressable: 00", " Partially addressable: 01 02 03 04 05 06 07", " Heap left redzone: fa", " Freed heap region: fd", " Stack left redzone: f1", " Stack mid redzone: f2", " Stack right redzone: f3", " Stack after return: f5", " Stack use after scope: f8", " Global redzone: f9", " Global init order: f6", " Poisoned by user: f7", " Container overflow: fc", " Array cookie: ac", " Intra object redzone: bb", " ASan internal: fe", " Left alloca redzone: ca", " Right alloca redzone: cb", "==60983==ABORTING" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105537 Approved by: https://github.com/albanD	2023-07-20 16:56:49 +00:00
Edward Z. Yang	0af18f2234	Unify TEST_CUDNN definition (#105594 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105594 Approved by: https://github.com/larryliu0820, https://github.com/voznesenskym	2023-07-20 16:10:26 +00:00
Amadeusz Skrzypczak	b64bd4a5dd	Add torch.float8_e5m2 and torch.float8_e4m3 data types (#104242 ) Proposal of two float8 variants - e5m2 and e4m3 - based on https://arxiv.org/pdf/2209.05433.pdf Hide all Float8 operator implementations behind `#if !defined(C10_MOBILE)` guard to keep Android build size almost unchanged TODO: - Refactor duplicated code - Cleanup unbalanced pragma pop in dtype utils - Add native implementation on the CUDA size Co-authored-by: Nikita Shulga <nshulga@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104242 Approved by: https://github.com/albanD	2023-07-20 16:09:11 +00:00
George White	803d58a408	Add TensorPipe header files to Python package (#105521 ) This change adds the TensorPipe header files to `torch_package_data` if `USE_DISTRIBUTED` is set to `ON` in the CMake cache. The TensorPipe library and CMake config is already available in the Torch wheel, but the headers are not. This resolves issue where out-of-tree backends could not implement TensorPipe converters, because the definition of the `tensorpipe::Message` struct is defined in the TensorPipe headers. Fixes #105224. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105521 Approved by: https://github.com/albanD	2023-07-20 16:06:00 +00:00
PyTorch MergeBot	154d89b224	Revert "Unify TEST_CUDNN definition (#105594 )" This reverts commit 1ea153a11d3011b90cdac1f9977889988a0c981f. Reverted https://github.com/pytorch/pytorch/pull/105594 on behalf of https://github.com/PaliC due to breaks periodic test `distributed/_tensor/test_dtensor.py::TestDynamoDTensor::test_dynamo_dtensor` ([comment](https://github.com/pytorch/pytorch/pull/105594#issuecomment-1644166414))	2023-07-20 15:48:25 +00:00
PyTorch MergeBot	f2b15772ff	Revert "Add torch.float8_e5m2 and torch.float8_e4m3 data types (#104242 )" This reverts commit a9804130e5a9a982d82934fa9702abd08d6903ce. Reverted https://github.com/pytorch/pytorch/pull/104242 on behalf of https://github.com/PaliC due to breaks lint (run lintrunner and remerge) ([comment](https://github.com/pytorch/pytorch/pull/104242#issuecomment-1644150284))	2023-07-20 15:37:53 +00:00
Rodrigo Kumpera	02cd971e95	[C10D] Improve MTPG autograd test. Fixes #105106 (#105356 ) Explicitly asserts that bwd is running from the same thread as fwd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105356 Approved by: https://github.com/rohan-varma, https://github.com/wanchaol, https://github.com/fduwjj	2023-07-20 13:51:21 +00:00
Thomas Findlay	ded9b94207	Improved error messages for deprecated linalg functions. (#105506 ) Fixes #105452 New error messages to point out potentially breaking/annoying changes between the old and new interfaces. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105506 Approved by: https://github.com/lezcano	2023-07-20 10:48:06 +00:00
BowenBao	ca126880d9	Enable intellisense for _dynamo, _inductor and onnx by importing under type_checking guard (#105361 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105361 Approved by: https://github.com/malfet	2023-07-20 10:40:02 +00:00
Amadeusz Skrzypczak	a9804130e5	Add torch.float8_e5m2 and torch.float8_e4m3 data types (#104242 ) Proposal of two float8 variants - e5m2 and e4m3 - based on https://arxiv.org/pdf/2209.05433.pdf Hide all Float8 operator implementations behind `#if !defined(C10_MOBILE)` guard to keep Android build size almost unchanged TODO: - Refactor duplicated code - Cleanup unbalanced pragma pop in dtype utils - Add native implementation on the CUDA size Co-authored-by: Nikita Shulga <nshulga@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104242 Approved by: https://github.com/albanD	2023-07-20 09:45:45 +00:00
Edward Z. Yang	1ea153a11d	Unify TEST_CUDNN definition (#105594 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105594 Approved by: https://github.com/larryliu0820, https://github.com/voznesenskym	2023-07-20 08:36:58 +00:00
Edward Z. Yang	692e0566d6	Rely on is_expr_static_and_true to test gcd (#105578 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105578 Approved by: https://github.com/voznesenskym	2023-07-20 08:29:04 +00:00
Bin Bao	71067631c2	[inductor] Fix an AOTInductor missing output issue (#105496 ) Summary: When an output buffer is reused instead of directly referring to the passed-in output, we need to explictly make a copy Pull Request resolved: https://github.com/pytorch/pytorch/pull/105496 Approved by: https://github.com/jansel	2023-07-20 08:27:31 +00:00
blzheng	0e4c12157c	inductor: add support for 0 repeats (#105446 ) Fix #104948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105446 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-07-20 08:04:47 +00:00
constroy	0578732bc3	[inductor] fix duplicate arg handling in triton templates (#105315 ) Fixes #105212 De-duplicate kernel args in codegen and autotuning of `torch.mm` and `torch.bmm`. refer to https://github.com/pytorch/pytorch/issues/105212#issuecomment-1637168866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105315 Approved by: https://github.com/jansel	2023-07-20 07:46:46 +00:00
Edward Z. Yang	a5317ae857	Remove unnecessary left == right test. (#105576 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105576 Approved by: https://github.com/voznesenskym	2023-07-20 07:33:08 +00:00
BowenBao	980589b04d	[ONNX] Suppress ORT warnings in unit tests (#105624 ) As title, these warnings are too noisy and made CI test logs hard to read. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105624 Approved by: https://github.com/justinchuby	2023-07-20 07:21:21 +00:00
Yanbo Liang	8daed86e4e	[Inductor] aten.dist decomposition (#105586 ) Fixes #105557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105586 Approved by: https://github.com/desertfire, https://github.com/Chillee	2023-07-20 06:42:44 +00:00
PyTorch MergeBot	dfc9874740	Revert "inductor: promote half/bfloat16 constant to float for cpu vectorization path (#105440 )" This reverts commit 18bcf62bbcf7ffd47e3bcf2596f72aa07a07d65f. Reverted https://github.com/pytorch/pytorch/pull/105440 on behalf of https://github.com/XiaobingSuper due to introduce core dumped when init bfloat16 zero tensor ([comment](https://github.com/pytorch/pytorch/pull/105440#issuecomment-1643079005))	2023-07-20 03:56:44 +00:00
Jerry Zhang	dff4e034b8	[quant][pt2e][be] Rename qnnpack quantizer to xnnpack quantizer (#105551 ) Summary: att Test Plan: sandcastle CI and OSS CI Reviewed By: andrewor14 Differential Revision: D47422894 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105551 Approved by: https://github.com/andrewor14	2023-07-20 03:52:40 +00:00
Andrey Talman	c6653b65d8	Back out "Make adding buffers more like adding parameters (#104069 )" (#105581 ) Summary: D47537831 is breaking pyper tests: https://fb.workplace.com/groups/802176577445480/posts/1018902842439518/ with `TypeError: register_buffer() takes 3 positional arguments but 4 were given` Original commit changeset: d4b4069fbd38 Original Phabricator Diff: D47537831 Test Plan: ``` buck2 run //caffe2/torch/fb/training_toolkit/integration_tests/training_lifecycle/cogwheel_tests/pyper_release_v2:cogwheel_smallworld_inline_cvr_infer_pyper_pyper__canary_offline_training-launcher -- --run-harness-in-tupperware --build-fbpkg ads_dper3 --build-fbpkg training_platform ``` Reviewed By: atalman Differential Revision: D47600140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105581 Approved by: https://github.com/mikaylagawarecki	2023-07-20 03:39:53 +00:00
Edward Z. Yang	2e81cdc1dd	Remove dead sizevars.__getitem__ (#105579 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105579 Approved by: https://github.com/albanD	2023-07-20 03:06:01 +00:00
Edward Z. Yang	43540a1cab	Tighten size_hint invariant (#105580 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105580 Approved by: https://github.com/albanD	2023-07-20 03:04:21 +00:00
Michael Lazos	690ea933ca	Enable more e2e foreach optimizer compilation tests (#105438 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/105438 Approved by: https://github.com/jansel	2023-07-20 02:41:19 +00:00
Yukio Siraichi	0cd51b3df0	Reland: Value range refinement using multi-variate expressions (#105491 ) Trying to re-land: #97964. Test strategy: ``` buck2 test '@fbcode//mode/dev-nosan' fbcode//pye/model_inventory/inside_out_tracking_model:inside_out_tracking_model_test -- --exact 'pye/model_inventory/inside_out_tracking_model:inside_out_tracking_model_test - test_executorch_e2e_output_consistency_aten (pye.model_inventory.inside_out_tracking_model.InsideOutTrackingModelTest.InsideOutTrackingModelTest)' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105491 Approved by: https://github.com/ezyang	2023-07-20 02:38:39 +00:00
Howard Cheng	3dacc8e847	[PyTorch] [Memory profiler] Early return if qualified name is invalid (#105495 ) Summary: Return early if we can easily determine the operator qualified name is invalid before attempting to retrieve the schema. In particular "::" should always be present. Quick estimate shows that this is >50x faster (100 us -> 2 us). Test Plan: CI Differential Revision: D47562587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105495 Approved by: https://github.com/aaronenyeshi	2023-07-20 00:58:32 +00:00
Jithun Nair	86076abeff	Update slow CI jobs to rocm5.6 (#105516 ) Follow-up to https://github.com/pytorch/pytorch/pull/103092, which missed updating the slow CI jobs to ROCm5.6, as they were recently moved to slow.yml by `def50d2534` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105516 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-07-20 00:56:46 +00:00
Jorge Pineda	93e6fc54fa	[PyTorch] Remove device transfers from JNI (#105583 ) Summary: If a model was exported for Vulkan backend without (automatic or manual) device transfers, then the export is incorrect, and the JNI need not correct that. (If this assumption is incorrect, please give feedback.) Undo the changes from - D23763771: automatic device transfers in JNI - D39519168: `"requires_backend_transfers"` logic in JNI Test Plan: Verify CUNET+ hybrid model from D47488843 works. Reviewed By: SS-JIA Differential Revision: D47527244 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105583 Approved by: https://github.com/SS-JIA	2023-07-20 00:26:21 +00:00
Edward Z. Yang	0b524343be	Reenable UFMT on pyi (#105577 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105577 Approved by: https://github.com/albanD	2023-07-20 00:11:45 +00:00
David Berard	5abc5ab55d	[inductor] Disable cudagraphs if index_put_ fallback is encountered (#105439 ) TL;DR: if lowerings.py encounters aten.index_put, it will set V.graph.cudagraphs_okay = False, which will disable cudagraphs. index_put needs to be disabled because it crashes cuda graphs. index_put_ fallbacks fail with cuda graphs when `accumulate=True` - likely for the same reason that it fails with deterministic_algorithms_enabled: `fcb7d4b358/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L730)` A first attempt was just to expand the scenarios where `index_put_` is one of the disallowed kernels in utils.py: `2fa7d11b64/torch/_inductor/utils.py (L436-L438)` However this disables cuda graphs in too many scenarios, because index_put doesn't cause issues if it gets fused, it only causes issues if the aten kernel gets called. So in the updated version of this PR, we check for fallbacks in lowerings.py and disable cudagraphs only if a fallback is encountered there. Example of failure outside of PT2: ```python import torch def fn(x, y, z): x = torch.zeros_like(x) return x.index_put_([y], z, True) # return x + 1 x = torch.zeros((512, 512), dtype=torch.bool, device='cuda') y = torch.arange(512, dtype=torch.int64, device='cuda') z = torch.ones((512, 512), dtype=torch.bool, device='cuda') s = torch.cuda.Stream() s.wait_stream(torch.cuda.current_stream()) with torch.cuda.stream(s): for i in range(3): fn(x, y, z) torch.cuda.current_stream().wait_stream(s) g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): fn(x, y, z) ``` fails with ``` Traceback (most recent call last): File "/data/users/dberard/scripts/graphed_index_put.py", line 24, in <module> fn(x, y, z) File "/data/users/dberard/scripts/graphed_index_put.py", line 8, in fn return x.index_put_([y], z, True) RuntimeError: CUDA error: operation not permitted when stream is capturing Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/data/users/dberard/scripts/graphed_index_put.py", line 24, in <module> fn(x, y, z) File "/data/users/dberard/pytorch/torch/cuda/graphs.py", line 173, in __exit__ self.cuda_graph.capture_end() File "/data/users/dberard/pytorch/torch/cuda/graphs.py", line 79, in capture_end super().capture_end() RuntimeError: CUDA error: operation failed due to a previous error during capture Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ``` Differential Revision: [D47538548](https://our.internmc.facebook.com/intern/diff/D47538548) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105439 Approved by: https://github.com/eellison	2023-07-19 23:38:29 +00:00
Max Ren	bc6bca9d42	[XNNPACK][QS8] torch.slice (#105252 ) Differential Revision: [D47487423](https://our.internmc.facebook.com/intern/diff/D47487423/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105252 Approved by: https://github.com/digantdesai	2023-07-19 23:36:02 +00:00
leslie-fang-intel	fa6be2fa6f	[Quant][PT2E] Remove x86 inductor pt2e backend config (#105039 ) Summary For the Quantization PT2E path, we recommend to use `X86InductorQuantizer` instead of backend config of `x86_inductor_pt2e_backend_config`. Remove the `x86_inductor_pt2e_backend_config` and the relevant testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105039 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-07-19 23:18:29 +00:00
Animesh Jain	af9a4e08fa	[dynamo][rewrite_asserts] Insert assertion msg in bytecode only when needed (#105549 ) Fixes https://github.com/pytorch/pytorch/issues/105513 The main issue is that we could call `self.LOAD_CONST` and change Dynamo stack, and then decide that we can't rewrite it later. This PR ensures that we change the dynamo stack only when we decide to rewrite asserts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105549 Approved by: https://github.com/tugsbayasgalan	2023-07-19 23:14:01 +00:00
leslie-fang-intel	6c432381f5	[Quant][Inductor] Use truncate instead of default rounding round when convert float to uint8 (#105109 ) Summary When convert float tensor to uint8 data type as `tensor.to(dtype=torch.uint8)`, PyTorch will directly truncate the decimal. Previously, in `convert_float_to_uint8` we use `_mm512_cvtps_epi32` which uses default rounding mode (round to nearest) to convert float to uint8 which doesn't align with the eager mode behavior. Change `_mm512_cvtps_epi32` to `_mm512_cvttps_epi32` to use directly truncate when convert float tensor to uint8. Test Plan ``` python -m pytest test_cpu_repro.py -k test_to_uint8_rounding_method ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105109 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/jerryzh168	2023-07-19 23:07:16 +00:00
Michael Voznesensky	a832967627	Migrate tuple(handle) -> handle (#104488 ) We strengthen the invariant that one FSDP managed module has one flatparameter, and remove unused code that would have supported 1:many module to flatparam mapping Pull Request resolved: https://github.com/pytorch/pytorch/pull/104488 Approved by: https://github.com/awgu	2023-07-19 22:33:35 +00:00
Iris	c54f630201	[7/n][FSDP] make use_dtensor=True work with offload_to_cpu=True for load_state_dict (#105378 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/105378 Approved by: https://github.com/fegin	2023-07-19 21:36:37 +00:00
Yukio Siraichi	5ce5372d70	Create tensor from Numpy in current device. (#105546 ) Fix: #105046 This PR changes how tensors are created from Numpy arrays, when tracing with dynamo. Instead of using `from_numpy`, we use `as_tensor`. The latter takes into consideration the current device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105546 Approved by: https://github.com/lezcano	2023-07-19 21:31:52 +00:00
Justin Chu	73e1455327	[BE] Enable ruff's UP rules and autoformat test/ (#105434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434 Approved by: https://github.com/albanD	2023-07-19 20:36:06 +00:00
Mo Mo	7b56238551	fix typo (#105507 ) Differential Revision: D47568928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105507 Approved by: https://github.com/awgu, https://github.com/fduwjj	2023-07-19 20:34:43 +00:00
Aaron Gokaslan	801fb93b0c	Update pybind11 submodule to 2.11.0 (#105245 ) Update pybind11 submodule to 2.11.0 with better python 3.12 support, bugfixes, a few new features, and more. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105245 Approved by: https://github.com/albanD	2023-07-19 19:56:16 +00:00
Debojeet Chatterjee	70b5264ec5	[EZ][BE] Fix the massively annoying strict-weak-ordering issue. (#105189 ) Summary: kip_fist_pump Running any EgoOCR workflow in non-opt modes was breaking with https://fburl.com/strict-weak-ordering Painstakingly found out that the stable_sort comparator in the generate_proposals caffe2 op was the issue due to numerical imprecision. This was causing Word Detector model to barf with the error. Adding explicit handling for the [irreflexivity property](https://www.boost.org/sgi/stl/StrictWeakOrdering.html) fixes this annoying strict-weak-ordering issue that has bugged me and several others(https://fb.workplace.com/groups/1405155842844877/permalink/7079705785389826/) for a while. We can finally run all OCR workflows in non-opt mode! :) Test Plan: Debugged this with `fdb --disable-auto-breakpoints --secondary-debugger=lldb buck2 run mode/dev-sand ai_demos/server_model_zoo/models/ego_ocr_e2e_prod:ego_ocr_e2e_prod_binary` and running `breakpoint set -E c++` in the lldb terminal. Differential Revision: D47446816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105189 Approved by: https://github.com/malfet, https://github.com/atalman	2023-07-19 19:37:50 +00:00
Tom Aarsen	4448c78a5d	[ONNX] Add missing spaces between sentences in warning text (#105527 ) Hello! I ran into this warning locally and noticed that it was formatted incorrectly. Even a link was wrong because of it: https://github.com/pytorch/pytorch/issues.Error This should resolve it. - Tom Aarsen Pull Request resolved: https://github.com/pytorch/pytorch/pull/105527 Approved by: https://github.com/justinchuby	2023-07-19 17:57:27 +00:00
jjsjann123	218b5477ea	switching NNC as default for TorchScript support (#105185 ) Disable nvfuser by default in TorchScript Add deprecation warning for nvfuser usage via TorchScript and PrimTorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/105185 Approved by: https://github.com/malfet, https://github.com/davidberard98	2023-07-19 16:31:34 +00:00
Chien-Chin Huang	a10f93f606	[composable API] Fix the replicate_device_id test case to avoid copy replicated models. (#105503 ) We should not `replicate` deeocopy.copy(a already replicated model). Differential Revision: [D47566678](https://our.internmc.facebook.com/intern/diff/D47566678/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105503 Approved by: https://github.com/awgu, https://github.com/rohan-varma	2023-07-19 16:20:43 +00:00
Wanchao Liang	f139aab2f4	[dynamo] add initial dynamo support for DTensor (#103146 ) This PR adds initial dynamo support for DTensor, in particular, it: - allows DTensor be passed into a compiled function, and allow fakify DTensor during dynamo tracing by turning the inner local tensor to meta tensor. - We use `allow_in_graph` to include `DTensor` and `DTensor.from_local` to be represented as `TorchVariable` - The dtensor created becomes a normal `TensorVariable` and it would insert any tensor operations to the output graph just like torch.Tensor - note that dtensor have a new instance method `redistribute` compare to plain tensor, and we currently special handle it in `TensorVariable` `from_local` and `redistribute` both accepts some non-trival metadata as arguments (i.e. DeviceMesh, Placement) which fx.Graph does not support. In order to let these two APIs appear in the dynamo captured graph, we encoded the metadata into a new_function (like `functools.partial`) and the new function only accepts prim args (i.e. tensor), then we put `call_function` with this new_function to the graph. This is suggested by @ezyang. The underlying rationale here is that the metadata will not change across the graph invocations so it's safe to encode them. Captured graph: ``` def forward(self, L_x_ : torch.Tensor): l_x_ = L_x_ # File: /scratch/wanchaol/work/pytorch/test/distributed/_tensor/test_dtensor.py:685, code: dt = DTensor.from_local(x, mesh, [Shard(0)], run_check=False) prim_from_local = torch__dynamo_variables_torch_prim_from_local(l_x_, run_check = False); l_x_ = None # File: /scratch/wanchaol/work/pytorch/test/distributed/_tensor/test_dtensor.py:686, code: return dt.redistribute(mesh, [Replicate()]).to_local() + 2 prim_redistribute = torch__dynamo_variables_tensor_prim_redistribute(prim_from_local); prim_from_local = None to_local = prim_redistribute.to_local(); prim_redistribute = None add = to_local + 2; to_local = None return (add,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103146 Approved by: https://github.com/voznesenskym	2023-07-19 16:01:12 +00:00
Edward Z. Yang	a788365d14	Switch UFMT to denylist rather than allowlist (#105536 ) The new denylist was generated with this script: https://gist.github.com/ezyang/851589ac4694ed131feee7ad59281ca9 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105536 Approved by: https://github.com/malfet, https://github.com/albanD	2023-07-19 15:15:28 +00:00
Justin Chu	232b96b6e2	[BE] Enable ruff's UP rules and autoformat distributed/ (#105433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433 Approved by: https://github.com/albanD	2023-07-19 14:27:11 +00:00
Nikita Shulga	2125794c12	[MPS][BE] Use `Tensor::copy_` (#105475 ) Replace `const_cast<Tensor&>(x) = y;` with `x.copy_(y);` <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at d2b7a0d</samp> > _`copy_` not `clone`_ > _MPS backend runs faster_ > _Winter memory_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/105475 Approved by: https://github.com/kulinseth	2023-07-19 14:26:36 +00:00
Justin Chu	8a688277a2	[BE] Enable ruff's UP rules and autoformat dynamo / functorch and refs (#105432 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105432 Approved by: https://github.com/ezyang	2023-07-19 13:48:44 +00:00
blorange-amd	88f119775d	Upgrade nightly wheels to rocm5.6 (#105076 ) Tests https://github.com/pytorch/builder/pull/1442 Fixes #104419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105076 Approved by: https://github.com/atalman	2023-07-19 13:47:58 +00:00
Justin Chu	cb7a30f656	[BE] Enable ruff's UP rules and autoformat inductor/ (#105431 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105431 Approved by: https://github.com/albanD	2023-07-19 13:45:00 +00:00
Justin Chu	c0d8a4af0a	[BE] Enable ruff's UP rules and autoformat ao/ (#105430 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105430 Approved by: https://github.com/albanD, https://github.com/malfet	2023-07-19 13:44:37 +00:00
Yukio Siraichi	0b6de0eb1c	Improve validator module behavior if Z3 is not installed. (#105168 ) Fixes: #105143 In summary, the changes are: - Check if Z3 is installed when the module is loaded - Naming consistently as "translation validation" (not "validator") - Skipping tests if Z3 is not installed Pull Request resolved: https://github.com/pytorch/pytorch/pull/105168 Approved by: https://github.com/ezyang	2023-07-19 13:11:22 +00:00
kshitij12345	e137ac6c59	[dynamo][torch_np] support linalg, random and fft module (#105320 ) Support tracing through `np.linalg` with `torch_np` installed. Will update with other modules if this approach makes sense. TODO: * [x] Add test for `fft` and `random`. Fixes https://github.com/pytorch/pytorch/issues/105269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105320 Approved by: https://github.com/ezyang, https://github.com/lezcano	2023-07-19 11:06:37 +00:00
XiaobingSuper	18bcf62bbc	inductor: promote half/bfloat16 constant to float for cpu vectorization path (#105440 ) As scalar path, we should also promote half/bfloat16 constant to float for better accuracy, after this PR, the TIMM ```dm_nfnet``` model amp path can be passed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105440 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-19 06:53:23 +00:00
Michael Arcidiacono	7ddb66e334	Fix for "AttributeError when attempting to remove inductor buffers twice" (#104901 ) Fixes #102857 I added the proposed fix and found a reasonably small test case. I don't have any insight into why this test case was causing the error, but it is fixed now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104901 Approved by: https://github.com/eellison	2023-07-19 06:00:32 +00:00
Xuebin Min	167eab1cec	[inductor] Suport OMP on MacOS (#105136 ) Fixes Dynamo + MacOS: fatal error: 'omp.h' file not found #95708 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105136 Approved by: https://github.com/jansel	2023-07-19 05:58:43 +00:00
Huy Do	0e85c224f8	Use shareable calculate-docker-image GHA (#105372 ) Switch from PyTorch `calculate-docker-image` GHA to its shareable version on test-infra https://github.com/pytorch/test-infra/pull/4397. I will clean up PyTorch `calculate-docker-image` GHA in a separate PR after landing this one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105372 Approved by: https://github.com/malfet	2023-07-19 05:02:01 +00:00
Jerry Zhang	554052f321	[quant][pt2e][be] Rename prepare_pt2e_quantizer to prepare_pt2e (#105484 ) Summary: att Test Plan: sandcastle and OSS CI Reviewed By: andrewor14 Differential Revision: D47422892 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105484 Approved by: https://github.com/andrewor14	2023-07-19 04:51:37 +00:00
Justin Chu	5ef023b05a	[BE] Enable ruff's UP rules and autoformat benchmarks/ (#105429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105429 Approved by: https://github.com/malfet	2023-07-19 04:46:37 +00:00
Richard Zou	9c225c9b9a	[pytorch] Change autograd fallback mode to Nothing (#105505 ) Summary: This caused some internal tests to fail. I'm not sure it is possible to easily revert the original diff. This diff is a hotfix that changes the autograd fallback behavior to what it was previously. Test Plan: Existing tests Differential Revision: D47569822 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105505 Approved by: https://github.com/soulitzer	2023-07-19 04:35:37 +00:00
Xilun Wu	d2fa3f608b	Produce more logs from TCPStore init (#105350 ) this diff: 1. adds debug logs to TCPStore initialization to better capture the "connection reset by peer" error. Differential Revision: [D47454956](https://our.internmc.facebook.com/intern/diff/D47454956/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105350 Approved by: https://github.com/kumpera, https://github.com/fduwjj	2023-07-19 04:15:48 +00:00
Peter Stefek	d2c24eca8a	Fix mps unary op issue on non densely stored tensors (#105512 ) This pr fixes a bug where non densely stored tensors were not converted to the dense tensors of the correct scalar type in the mps `unary_op` helper function Fixes https://github.com/pytorch/pytorch/issues/105284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105512 Approved by: https://github.com/malfet	2023-07-19 03:56:38 +00:00
Edward Z. Yang	133a5e9a7a	Upgrade triton pin (#105463 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105463 Approved by: https://github.com/albanD	2023-07-19 03:55:41 +00:00
kato8966	64c39ece65	Fix a docstring of resolve_neg (#104151 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104151 Approved by: https://github.com/malfet	2023-07-19 03:55:20 +00:00
Bin Bao	fe04c6c371	[inductor] Allow specify a subdir to store .so and .cubin files (#105466 ) Summary: The subdir is used to store .so and .cubin files generated by AOTInductor. It can either be specified, or created based on hash of the input graph. Differential Revision: [D47556730](https://our.internmc.facebook.com/intern/diff/D47556730) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105466 Approved by: https://github.com/chenyang78	2023-07-19 03:13:50 +00:00
Michael Lazos	1597dd7a54	Report guard failures with recompiles logging (#105500 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/105500 Approved by: https://github.com/Chillee, https://github.com/anijain2305	2023-07-19 02:20:44 +00:00
Michael Gschwind	11b753af01	Refactor causal mask generation and detection for nn.transformer (#105265 ) Summary: * Create a private global-scope function _generate_subsequent because static class attribute member functions not supported by TorchScript resulting in torchscripting errors. * Make TransformerEncoder and TransformerDecoder consistent w.r.t. is_causal handling by calling _detect_casual_mask * Clarify documentation that is_causal is a hint * Move causal mask detection into a method _detect_causal_mask * only accept input-size compatible causal mask as causal mask * update _generate_subsequent_causal_mask to include factory kwargs for dtype and device: avoid extra copies & conversions by passing directly to torch.full. Test Plan: sandcastle & github CICD Continuation of #101487 (due to a tooling issue) which is a continuation-in-part of https://github.com/pytorch/pytorch/pull/98327 by @janEbert Differential Revision: D47427117 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105265 Approved by: https://github.com/mikaylagawarecki	2023-07-19 01:26:50 +00:00
Justin Chu	14d87bb5ff	[BE] Enable ruff's UP rules and autoformat tools and scripts (#105428 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105428 Approved by: https://github.com/albanD, https://github.com/soulitzer, https://github.com/malfet	2023-07-19 01:24:44 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	5666d20bb8	Add unlifting pass under private config (#104897 ) Summary: We wanna do this little by little. For now, I tried only on DissectedPartsModel which needs to use aot_export version. Test Plan: CI Reviewed By: zhxchen17 Differential Revision: D46785735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104897 Approved by: https://github.com/JacobSzwejbka	2023-07-19 01:16:35 +00:00
Adnan Akhundov	fbd7e74c92	[inductor] Enable mypy checking in lowering.py (#105317 ) Summary: As suggested in #105230, mypy checking is enabled in `torch/_inductor/lowering.py`. 23 errors fixed; 6 silenced with `# type: ignore[attr-defined]`. Test Plan: Before the fix: ``` $ mypy torch/_inductor/lowering.py torch/_inductor/lowering.py:139:16: error: "Symbol" has no attribute "is_integer" [attr-defined] torch/_inductor/lowering.py:263:20: error: Incompatible types in assignment (expression has type "Union[List[Any], Tuple[Any, ...]]", variable has type "List[Any]") [assignment] torch/_inductor/lowering.py:427:49: error: "IRNode" has no attribute "get_size" [attr-defined] torch/_inductor/lowering.py:439:37: error: "IRNode" has no attribute "get_dtype" [attr-defined] torch/_inductor/lowering.py:456:34: error: "IRNode" has no attribute "get_device" [attr-defined] torch/_inductor/lowering.py:645:44: error: Need type annotation for "b" [var-annotated] torch/_inductor/lowering.py:1321:12: error: "FakeTensor" has no attribute "is_cpu" [attr-defined] torch/_inductor/lowering.py:1542:24: error: Argument 3 to "FixedLayout" has incompatible type "List[int]"; expected "List[Expr]" [arg-type] torch/_inductor/lowering.py:1542:81: error: Argument "offset" to "FixedLayout" has incompatible type "int"; expected "Expr" [arg-type] torch/_inductor/lowering.py:1571:24: error: Argument 3 to "FixedLayout" has incompatible type "List[int]"; expected "List[Expr]" [arg-type] torch/_inductor/lowering.py:1571:81: error: Argument "offset" to "FixedLayout" has incompatible type "int"; expected "Expr" [arg-type] torch/_inductor/lowering.py:1654:12: error: Incompatible types in assignment (expression has type "List[Any]", variable has type "Tuple[Any, ...]") [assignment] torch/_inductor/lowering.py:2009:9: error: Need type annotation for "ranges" (hint: "ranges: List[<type>] = ...") [var-annotated] torch/_inductor/lowering.py:2151:16: error: Incompatible types in assignment (expression has type "List[Any]", variable has type "Tuple[Any, ...]") [assignment] torch/_inductor/lowering.py:2198:43: error: Item "type" of "Union[List[Any], type]" has no attribute "__iter__" (not iterable) [union-attr] torch/_inductor/lowering.py:2229:36: error: Argument 1 to "len" has incompatible type "Union[List[Any], type]"; expected "Sized" [arg-type] torch/_inductor/lowering.py:2231:38: error: Item "type" of "Union[List[Any], type]" has no attribute "__iter__" (not iterable) [union-attr] torch/_inductor/lowering.py:2233:35: error: Item "type" of "Union[List[Any], type]" has no attribute "__iter__" (not iterable) [union-attr] torch/_inductor/lowering.py:2569:54: error: Incompatible default for argument "reduce" (default has type "None", argument has type "str") [assignment] torch/_inductor/lowering.py:2569:54: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True torch/_inductor/lowering.py:2569:54: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase torch/_inductor/lowering.py:2586:59: error: Incompatible default for argument "reduce" (default has type "None", argument has type "str") [assignment] torch/_inductor/lowering.py:2586:59: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True torch/_inductor/lowering.py:2586:59: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase torch/_inductor/lowering.py:2720:65: error: Incompatible default for argument "scales_x" (default has type "None", argument has type "Tuple[float]") [assignment] torch/_inductor/lowering.py:2720:65: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True torch/_inductor/lowering.py:2720:65: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase torch/_inductor/lowering.py:2735:5: error: Name "scale" already defined on line 2731 [no-redef] torch/_inductor/lowering.py:2758:47: error: Argument 3 to "upsample_nearestnd" has incompatible type "Tuple[Optional[float]]"; expected "Tuple[float]" [arg-type] torch/_inductor/lowering.py:2765:47: error: Argument 3 to "upsample_nearestnd" has incompatible type "Tuple[Optional[float], Optional[float]]"; expected "Tuple[float]" [arg-type] torch/_inductor/lowering.py:2776:47: error: Argument 3 to "upsample_nearestnd" has incompatible type "Tuple[Optional[float], Optional[float], Optional[float]]"; expected "Tuple[float]" [arg-type] torch/_inductor/lowering.py:2949:13: error: No binding for nonlocal "grad" found [misc] torch/_inductor/lowering.py:3063:49: error: Argument 2 to "range_mask_low" has incompatible type "int"; expected "Expr" [arg-type] torch/_inductor/lowering.py:3271:48: error: "IRNode" has no attribute "data" [attr-defined] torch/_inductor/lowering.py:3272:16: error: "IRNode" has no attribute "data" [attr-defined] Found 29 errors in 1 file (checked 1 source file) ``` After the fix: ``` $ mypy torch/_inductor/lowering.py Success: no issues found in 1 source file ``` Reviewers: @eellison Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/105317 Approved by: https://github.com/eellison	2023-07-19 00:33:11 +00:00
maxren	88f1885ec9	[XNNPACK][QS8] torch.cat (#104800 ) Differential Revision: [D47304143](https://our.internmc.facebook.com/intern/diff/D47304143/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104800 Approved by: https://github.com/digantdesai	2023-07-19 00:15:05 +00:00
soulitzer	39477f7ca9	Remove unnecessary seen check in get_current_graph_task_execution_order (#105487 ) https://github.com/pytorch/pytorch/pull/105353#discussion_r1266977015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105487 Approved by: https://github.com/albanD, https://github.com/jansel	2023-07-18 23:49:45 +00:00
Lucy Qiu	78a7684b5b	[Pytorch] Unary Ops (#104994 ) Summary: Use templates to generate shaders for unary operators `exp` and `sqrt` for in-place and not in-place variants. [sqrt](https://pytorch.org/docs/stable/generated/torch.sqrt.html) [exp](https://pytorch.org/docs/stable/generated/torch.Tensor.exp.html#torch.Tensor.exp) Refactor: use 'NAME' field in yaml for generated shader name in `gen_vulkan_spv.py` Test Plan: New tests: ``` lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="unary_op" Parsing buck files: finished in 16.1 sec Creating action graph: finished in 0.7 sec Downloaded 75/3986 artifacts, 248.89 Mbytes, 96.3% cache miss (for updated rules) Building: finished in 08:24.0 min (100%) 2571/2571 jobs, 2571/2571 updated Total time: 08:40.9 min BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = unary_op [==========] Running 4 tests from 1 test suite. [----------] Global test environment set-up. [----------] 4 tests from VulkanAPITest [ RUN ] VulkanAPITest.unary_op_exp [ OK ] VulkanAPITest.unary_op_exp (479 ms) [ RUN ] VulkanAPITest.unary_op_exp_ [ OK ] VulkanAPITest.unary_op_exp_ (1 ms) [ RUN ] VulkanAPITest.unary_op_sqrt [ OK ] VulkanAPITest.unary_op_sqrt (2 ms) [ RUN ] VulkanAPITest.unary_op_sqrt_ [ OK ] VulkanAPITest.unary_op_sqrt_ (2 ms) [----------] 4 tests from VulkanAPITest (485 ms total) [----------] Global test environment tear-down [==========] 4 tests from 1 test suite ran. (485 ms total) [ PASSED ] 4 tests. ``` All tests: https://www.internalfb.com/phabricator/paste/view/P786547213 Run clang-format on shader files and `UnaryOp.cpp` Differential Revision: D47271856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104994 Approved by: https://github.com/SS-JIA	2023-07-18 23:43:57 +00:00
Andrew Gu	e983625f22	[FSDP] Fix skip-sharded-views + mixed precision (#105346 ) This fixes https://github.com/pytorch/pytorch/issues/104504. - When not using full-precision eval, the relevant fix is to force `_use_sharded_views()` calls if needed in `SUMMON_FULL_PARAMS` training state. - When using full-precision in eval, the relevant fix is tracking what was the unsharded flat parameter from which the unsharded views were computed and using that instead of determining the unsharded flat parameter from the calling context via `_get_padded_unsharded_flat_param()`. This also fixes https://github.com/pytorch/pytorch/issues/104770. <details> <summary> Print output showing parity </summary> ``` Key: 0 Model 1: [-1.5, 6.40625, -0.9453125, -0.3828125, 0.16015625, -1.5078125] Model 2: [-1.5, 6.40625, -0.9453125, -0.3828125, 0.16015625, -1.5078125] Key: 1 Model 1: [0.0157470703125, -0.8828125, 5.65625, 1.1328125, 0.275390625, 0.11181640625] Model 2: [0.0157470703125, -0.8828125, 5.65625, 1.1328125, 0.275390625, 0.11181640625] Key: 2 Model 1: [0.1689453125, -0.00567626953125, -0.09375, 7.34375, -0.18359375, -0.09521484375] Model 2: [0.1689453125, -0.00567626953125, -0.09375, 7.34375, -0.18359375, -0.09521484375] Key: 3 Model 1: [0.546875, -0.8984375, 0.228515625, 0.7578125, 6.0625, 0.435546875] Model 2: [0.546875, -0.8984375, 0.228515625, 0.7578125, 6.0625, 0.435546875] Key: 4 Model 1: [-0.66796875, -0.88671875, 0.30078125, 0.06494140625, 0.412109375, 6.9375] Model 2: [-0.66796875, -0.88671875, 0.30078125, 0.06494140625, 0.412109375, 6.9375] Key: 5 Model 1: [0.07763671875, 0.8671875, -0.43359375, 0.5703125, 0.76171875, -0.0089111328125] Model 2: [0.07763671875, 0.8671875, -0.43359375, 0.5703125, 0.76171875, -0.0089111328125] Key: 6 Model 1: [-0.283203125, -0.361328125, 0.474609375, 0.10205078125, 1.125, -0.0859375] Model 2: [-0.283203125, -0.361328125, 0.474609375, 0.10205078125, 1.125, -0.0859375] Key: 7 Model 1: [1.140625, 0.62890625, -0.07568359375, -1.0390625, -0.2578125, -0.053955078125] Model 2: [1.140625, 0.62890625, -0.07568359375, -1.0390625, -0.2578125, -0.053955078125] Key: 8 Model 1: [0.68359375, -1.09375, 0.59375, 1.0, -0.23828125, 0.578125] Model 2: [0.68359375, -1.09375, 0.59375, 1.0, -0.23828125, 0.578125] Key: 9 Model 1: [0.515625, 0.296875, -0.1826171875, -0.12890625, -0.51953125, -0.3359375] Model 2: [0.515625, 0.296875, -0.1826171875, -0.12890625, -0.51953125, -0.3359375] ``` </details> Follow-ups: - I suspect that for `SHARD_GRAD_OP`, train forward -> eval forward when using full-precision in eval will not free the low-precision unsharded parameters from the train forward, resulting in 1.5x unsharded parameter memory. Differential Revision: [D47527597](https://our.internmc.facebook.com/intern/diff/D47527597) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105346 Approved by: https://github.com/fegin, https://github.com/rohan-varma	2023-07-18 23:13:53 +00:00
Rodrigo Kumpera	e645f2adaf	[DTensor] Fix device detection logic for TestDTensorPlacementTypes::test_split_tensor. (#105357 ) The test should respect self.device_type as it checks whether the environment has enough GPUs to serve the requested world size. The test will lead to hangs if we try to run 8 ranks over our 2-4 GPUs CI instances. Fixes #104769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105357 Approved by: https://github.com/wanchaol	2023-07-18 21:53:50 +00:00
Justin Chu	242a7743f3	[BE] Enable ruff's UP rules and autoformat onnx/ (#105427 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105427 Approved by: https://github.com/malfet	2023-07-18 21:41:24 +00:00
Lucy Qiu	f508d3564c	[Pytorch][Vulkan] Templatize BinaryOps (#105380 ) Summary: Use templates to generate the kernels for add, sub, mul, div and their variants (tensor/scalar, in-place/not in-place). Rename Arithmetic.cpp to BinaryOp.cpp Test Plan: https://www.internalfb.com/phabricator/paste/view/P785131030 ``` buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 ... xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:6377: Skipped QueryPool is not available [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms) [----------] 307 tests from VulkanAPITest (5427 ms total) [----------] Global test environment tear-down [==========] 307 tests from 1 test suite ran. (5427 ms total) [ PASSED ] 306 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log YOU HAVE 5 DISABLED TESTS ``` Differential Revision: D47307169 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105380 Approved by: https://github.com/SS-JIA	2023-07-18 21:21:19 +00:00
Nikita Shulga	78829d6e07	Fix `isinstance` check in `quat_utils` (#105476 ) Calling `isinstance(x, Tuple[Node, Node])` would either fail, or raise a type error on a more modern Python, as none of the tuples are actually instances of `Tuple` ```python >>> from typing import Tuple >>> from torch.fx import Node >>> edge_or_node=(Node(None, "foo", "output", "foo", None, None), Node(None, "bar", "output", "bar", None, None)) >>> isinstance(edge_or_node, tuple) and len(edge_or_node) == 2 and all(isinstance(x, Node) for x in edge_or_node) True >>> isinstance(edge_or_node, Tuple[Node, Node]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/malfet/miniconda3/lib/python3.10/typing.py", line 994, in __instancecheck__ return self.__subclasscheck__(type(obj)) File "/Users/malfet/miniconda3/lib/python3.10/typing.py", line 997, in __subclasscheck__ raise TypeError("Subscripted generics cannot be used with" TypeError: Subscripted generics cannot be used with class and instance checks ``` <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 40fa451</samp> > _Fix type annotation_ > _Quantize nodes in the graph_ > _Autumn leaves falling_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/105476 Approved by: https://github.com/jerryzh168	2023-07-18 21:16:05 +00:00
Justin Chu	3721fa5612	[BE] Enable ruff's UP rules and autoformat optim/ (#105426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105426 Approved by: https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi, https://github.com/janeyx99	2023-07-18 21:07:43 +00:00
Justin Chu	be03a56955	[BE] Enable ruff's UP rules and autoformat testing/ (#105425 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105425 Approved by: https://github.com/malfet	2023-07-18 21:04:39 +00:00
Rodrigo Kumpera	9e1b07e692	[C10d] Handle bool tensors in gloo. Fixes #103585 . (#105354 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105354 Approved by: https://github.com/wanchaol	2023-07-18 20:42:58 +00:00
Justin Chu	abc1cadddb	[BE] Enable ruff's UP rules and autoformat utils/ (#105424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105424 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-07-18 20:17:25 +00:00
Nicolas Macchioni	91ab32e4b1	[pt2][inductor] fix LoweringException: TypeError: '<' not supported between instances of 'ExternKernelCaller' and 'ExternKernelCaller' (#105469 ) Summary: `sort_keys=True` for autotuning results fails because we can't compare ExternKernelCaller objects. besides, it isn't really necessary to sort the keys, either for the autotuning results or the sysinfo. let's just drop sorting all together Test Plan: sandcastle + CI Reviewed By: aaronenyeshi Differential Revision: D47544587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105469 Approved by: https://github.com/jansel	2023-07-18 20:08:50 +00:00
Nikita Shulga	8cd94e1eab	[MPS] Add lerp implementation (#105470 ) lerp.Scalar fits very well into binary op template Add a very naive implementation for `lerp.Tensor` as `add_out(self, weights.mul(end.sub(self)))` Enable `lerp` testing in `test_mps` Fixes https://github.com/pytorch/pytorch/issues/105382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105470 Approved by: https://github.com/albanD	2023-07-18 20:01:04 +00:00
Wanchao Liang	cb23373264	[dynamo] allow tensor subclass fakification in dynamo (#105308 ) This PR adds necessary plumbing through torchdynamo to allow tensor subclasses with certain contract (i.e. with `__tensor_flatten__` and `__tensor_unflatten__`) to goes through the dynamo fakification pass by fakifying the tensor subclass internal components. Some of the tensor subclass contract logic mostly borrowed from https://github.com/pytorch/pytorch/pull/97540 Added some tests to verify simply passing through a tensor subclass (i.e. DTensor) through dynamo eager works as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105308 Approved by: https://github.com/ezyang	2023-07-18 17:28:04 +00:00
Wanchao Liang	bcb9ca4e5a	[dtensor] canonicalize detach callsites and use `view_as` when appropriate (#105239 ) This PR canonicalize the detach callsite to only call the detach from `distribute_tensor`. Change other callsite to view_as and remove the tensor constructor detach call This is so that we don't detach local tensor for every op run when rewrapping the DTensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/105239 Approved by: https://github.com/albanD	2023-07-18 17:13:37 +00:00
poseljacob	1aba399138	allow set_multithreading_enabled to act as function and context manager (#105291 ) Fixes #104985 Implemented `set_multithreading_enabled` C++ function to directly alter state rather than using `MultithreadingEnabled` class, which was automatically resetting the state when the object was destroyed. This behavior more closely aligns with set_grad_enabled which does work as expected. This allows us to change python class `set_multithreading_enabled` to act as both a function and context manager. I also added a getter: `torch._C.is_multithreading_enabled` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105291 Approved by: https://github.com/albanD	2023-07-18 16:55:40 +00:00
Jerry Zhang	ed2b9f1af1	[quant][pt2e] rename _quantize_pt2e to quantize_pt2e (#105377 ) Summary: att Test Plan: CIs Reviewed By: andrewor14 Differential Revision: D47234357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105377 Approved by: https://github.com/andrewor14	2023-07-18 16:46:05 +00:00
cyy	8364a5116c	Simplify Dispatcher case for zero arguments (#104613 ) MSVC detects calling Dispatcher::callWithDispatchKeySlowPath with zero arguments. This PR fixes it and simplifies code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104613 Approved by: https://github.com/ezyang	2023-07-18 16:42:57 +00:00
angelayi	133c5ec997	Add torch.ops.out_dtype (#103333 ) https://docs.google.com/document/d/10DYFG2sU3TSvguFP5kYwYLlo45KHFg3BhBOkUk0NKsU/edit#bookmark=id.hgfzmhlzkamk Renamed mixed_dtype --> out_dtype because "mixed_dtype is not very descriptive in the context of regular pytorch where we support type promotion on most ops" Pull Request resolved: https://github.com/pytorch/pytorch/pull/103333 Approved by: https://github.com/zou3519	2023-07-18 16:25:45 +00:00
Danni Li	1b78f23a1a	Allow nn.ChannelShuffle to run without erroring on CUDA tensors (#105351 ) Summary: Include GPU support for `nn.ChannelShuffle` & update test. Fix: #104603 Test Plan: Please see GitHub Actions. Differential Revision: D47523764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105351 Approved by: https://github.com/mikaylagawarecki	2023-07-18 16:24:30 +00:00
Edward Z. Yang	dc1186b0f9	Add peterbell10 to core reviewers (#105461 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105461 Approved by: https://github.com/lezcano, https://github.com/albanD	2023-07-18 13:52:45 +00:00
Bin Bao	b10de43c0a	Add aot_inductor as a test backend for benchmarking (#105221 ) Summary: Original PR at https://github.com/pytorch/pytorch/pull/104977. Landing from fbcode instead. Add an aot_inductor backend (Export+AOTInductor) in the benchmarking harness. Note it is not a dynamo backend. Moved files from torch/_inductor/aot_inductor_include to torch/csrc/inductor as a more standard way for exposing headers Created a caching function in benchmarks/dynamo/common.py for compiling, loading and caching the .so file, as a proxy for a pure C++ deployment, but easier for benchmarking. Differential Revision: D47452591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105221 Approved by: https://github.com/jansel	2023-07-18 13:16:36 +00:00
kshitij12345	671a21926f	[torch_np] update test to use ones_like instead of empty_like (#105453 ) This test fails locally (probably because deterministic mode is not on by default). We replace the use of `empty_like` to `ones_like` as this test doesn't need `empty_like`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105453 Approved by: https://github.com/lezcano	2023-07-18 13:13:11 +00:00
mingfeima	5e942ac5ec	add bfloat16 support for reflection and replication padding (#102949 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102949 Approved by: https://github.com/cpuhrsch	2023-07-18 13:01:09 +00:00
zhuhong61	1a4ee2a6bb	Add XPU support for storage resize_ (#105262 ) We'd like to add XPU device support for storage resize_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/105262 Approved by: https://github.com/mikaylagawarecki	2023-07-18 12:46:00 +00:00
XiaobingSuper	d09195ce82	inductor: fix fx tracing error for freezing pass (#105133 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105133 Approved by: https://github.com/eellison	2023-07-18 10:40:22 +00:00
XiaobingSuper	38c1e86ee2	inductor: make sure as_strided ops' input layout is not changed after converting conv's weight format (#105122 ) For the freezing path, if we convert conv's weight to channels last, we need to make sure as_strided ops' input layout is not changed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105122 Approved by: https://github.com/jgong5, https://github.com/shunting314	2023-07-18 09:26:54 +00:00
Justin Chu	964d29f312	[BE] Enable ruff's UP rules and autoformat torchgen/ (#105423 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105423 Approved by: https://github.com/Skylion007	2023-07-18 06:44:20 +00:00
Nicolas Macchioni	6ca3d7e1a2	[pt2][inductor] only use global cache on MAST (#105375 ) Summary: until we can further investigate the autotuning differences between MAST and non-MAST (devserver) environments, turn off the global cache for all non-MAST environments. this ensures we don't see unexpected regressions also update scuba logging for cache lookup, and add scuba logging for autotuning results. Test Plan: sandcastle + CI Differential Revision: D47516633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105375 Approved by: https://github.com/jansel	2023-07-18 06:16:47 +00:00
willfengg	8010f6bf48	[dynamo][inductor] Provide public API to get compiler options/configs (#105026 ) issues resolved: https://github.com/pytorch/pytorch/issues/101832 context: get torch.compile config for further usage. E.g, the training platform wants to get if model is compiled with cudagraph enabled and trigger further action how it is implemented * the core logic is backend.get_compiler_config() in torch/_dynamo/eval_frame.py * for backend='inductor' / _TorchCompileInductorWrapper, we have inductor-specific implementation in get_compiler_config in torch/_inductor/compile_fx.py and torch/__init__.py how to use it: Below is an example. ``` model = DummyModule() optimized_module = torch.compile( model, options={"triton.cudagraphs": True} ) compiler_config = optimized_module.get_compiler_config() if compiler_config["triton.cudagraphs"]: pass ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105026 Approved by: https://github.com/yanboliang, https://github.com/jansel	2023-07-18 06:12:06 +00:00
XiaobingSuper	4b3c261a2e	inductor: fix issue of vectorization when the store's index is constant value (#105314 ) Fix #104515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105314 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-07-18 04:54:25 +00:00
XiaobingSuper	3201a90428	inductor: don't force convert channels last if one op's user is as_strided ops (#105111 ) For ```aten.unfold```, it will be decomposed to ```aten.as_satrided```, and it depends on input size and stride, we shouldn't change its' stride to avoid getting a wrong value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105111 Approved by: https://github.com/shunting314, https://github.com/eellison	2023-07-18 04:52:19 +00:00
Aleksandar Samardžić	5d473a950f	Make conversions from/to sparse semi-structured always @torch.compile-d (#105272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105272 Approved by: https://github.com/ezyang	2023-07-18 04:51:28 +00:00
David Berard	ad6dad810e	[dynamo][profiler] More verbose profiler warning (#105362 ) torch.profiler.record_function and torch.profiler.profile are ignored by dynamo. In the common case, users have `record_function` in the middle of their program in order to annotate a section of the profile. The previous error message was `Profiler will be ignored`. Users would think that profiling would be completely ignored. Now the message will look like `Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105362 Approved by: https://github.com/yanboliang, https://github.com/aaronenyeshi	2023-07-18 04:42:13 +00:00
BowenBao	2ba9b56449	[ONNX] Fix aten::cat export when arg include parameters (#105373 ) Not all fx.Node are guaranteed to have meta["val"]. 'get_attr' nodes do not. This PR fixes the callsite checking if meta["val"] is symbol. Fixes #105370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105373 Approved by: https://github.com/titaiwangms	2023-07-18 04:21:59 +00:00
Yanbo Liang	dc58259746	[Inductor] [FX passes] Group linear fusion (#105116 ) Summary: The draft version of a group + batch fusion framework, and the group linear fusion implementation. In the future, it's pretty straightforward to add a new group/batch fusion policy by defining a class with match + fuse functions. Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion Differential Revision: D46956695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105116 Approved by: https://github.com/jansel	2023-07-18 03:56:42 +00:00
chunyuan	ba00b0939e	Inductor cpp wrapper: support torch.complex64 (#105305 ) Add `torch.complex64` into the supported dtype list of cpp wrapper to fix CPU cpp wrapper failure on llama. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105305 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-07-18 02:37:44 +00:00
Kurt Mohler	fcb7d4b358	Mark `bincount` CUDA deterministic if `weights` are not given (#105244 ) Fixes #98316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105244 Approved by: https://github.com/mikaylagawarecki	2023-07-18 01:16:51 +00:00
Jason Ansel	e9fd815226	Misc visibility changes for compiled autograd (#105298 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105298 Approved by: https://github.com/albanD, https://github.com/soulitzer	2023-07-18 01:10:04 +00:00
soulitzer	cf404a8ce4	Fix get_current_graph_task_execution_order accumulate_grads ordering (#105353 ) Fixes https://github.com/pytorch/pytorch/issues/105293 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105353 Approved by: https://github.com/albanD	2023-07-18 00:59:25 +00:00
Kefei Lu	750b9b359f	fix aot_inductor+dynamo fail on IG_CTR (#105250 ) Test Plan: CI. Reviewed By: chenyang78, houseroad Differential Revision: D47464664 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105250 Approved by: https://github.com/houseroad	2023-07-18 00:26:09 +00:00
Lucy Qiu	a4021af42e	[Pytorch] General broadcast for arithmetic operators (#104718 ) Summary: Currently, broadcast is supported for 4D tensors where, if the batch or channel dimensions are not equal, then the batch and channel of one tensor must both be 1, ie: ``` tensorA NCHW: 5, 2, 3, 3 tensorB NCHW: 1, 1, 3, 3 --> batch=1, channel=1 ``` This diff adds broadcast support for 4D tensors where the batch and channel of a tensor are different, ie: ``` tensorA NCHW: 5, 1, 3, 3 tensorB NCHW: 1, 5, 3, 3 ``` Broadcast rules: ``` - tensorA.dim()[x] = tensorB.dim()[x] - tensorA.dim()[x] == 1 \|\| tensorB.dim()[x] == 1 - tensorA.dim()[x] does not exist \|\| tensorB.dim()[x] does not exist ``` Broadcast method: 1. Pass `output`, `input` and `other` tensors to the shader 2. Iterate through the output texture to calculate the value of each texel (no repeating) 3. Mapping NHW positions: use modulo 4. Mapping C position: divide pos.z by ceil(C/4) to map to original tensor range --- Also some test refactoring to reduce repeated setup code. Test Plan: New tests: Add ``` [ RUN ] VulkanAPITest.add_broadcast5 [ OK ] VulkanAPITest.add_broadcast5 (0 ms) [ RUN ] VulkanAPITest.add_broadcast6 [ OK ] VulkanAPITest.add_broadcast6 (0 ms) ``` Sub ``` [ RUN ] VulkanAPITest.sub_broadcast5 [ OK ] VulkanAPITest.sub_broadcast5 (0 ms) [ RUN ] VulkanAPITest.sub_broadcast6 [ OK ] VulkanAPITest.sub_broadcast6 (0 ms) ``` Mul ``` [ RUN ] VulkanAPITest.mul_broadcast5 [ OK ] VulkanAPITest.mul_broadcast5 (1 ms) [ RUN ] VulkanAPITest.mul_broadcast6 [ OK ] VulkanAPITest.mul_broadcast6 (1 ms) ``` Div ``` [ RUN ] VulkanAPITest.div_broadcast5 [ OK ] VulkanAPITest.div_broadcast5 (1 ms) [ RUN ] VulkanAPITest.div_broadcast6 [ OK ] VulkanAPITest.div_broadcast6 (2 ms) ``` All tests: https://www.internalfb.com/phabricator/paste/view/P781794761 Run clang-format on glsl files and Arithmetic.cpp Differential Revision: D46874508 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104718 Approved by: https://github.com/SS-JIA	2023-07-18 00:15:19 +00:00
William Wen	196f2ab014	Log triton random warning to INFO (#105343 ) For https://github.com/pytorch/pytorch/issues/105204 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105343 Approved by: https://github.com/lessw2020, https://github.com/Skylion007	2023-07-18 00:06:52 +00:00
Animesh Jain	88aa51fe85	[dynamo] Support defaults for namedtuples (#105341 ) Fixes https://github.com/pytorch/pytorch/issues/103008 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105341 Approved by: https://github.com/jansel	2023-07-17 23:52:57 +00:00
Nikita Shulga	03a4aecf60	Make libtorch CUDA12 builds actually build on CUDA-12 (#105364 ) Not sure, why https://github.com/pytorch/pytorch/pull/102178 downgraded this build from CUDA-12.1 down to CUDA-11.8 Hattip to @ptrblck for spotting it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105364 Approved by: https://github.com/izaitsevfb, https://github.com/atalman	2023-07-17 23:44:39 +00:00
Michael Voznesensky	a6758cb304	Revert "Revert "SetVariable in dynamo (#103205 )"" + Fix for improved graph breaks (#105345 ) This reverts commit 94b3f9f646a84fb7bb0df997a57d410697440210. Fix Pull Request resolved: https://github.com/pytorch/pytorch/pull/105345 Approved by: https://github.com/atalman	2023-07-17 23:21:30 +00:00
Nicolas Macchioni	b2150b4795	[pt2][inductor] move gemm local cache to `cache_dir()/cache/{hash}` (#105334 ) Summary: move gemm autotuning local cache to `cache_dir()/cache/{hash}` since we might have multiple local caches, i.e. one cache with `allow_tf32=True` and one cache with `allow_tf32=False` Test Plan: sandcastle + CI Differential Revision: D47504654 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105334 Approved by: https://github.com/jansel	2023-07-17 23:05:50 +00:00
albanD	5e6c124649	upgrade multipy to latest master there (#105344 ) This is in particular to have https://github.com/pytorch/multipy/pull/325 which will unblock https://github.com/pytorch/pytorch/pull/105245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105344 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-07-17 22:15:03 +00:00
Tugsbayasgalan Manlaibaatar	d623f22b8b	Skip frame if the graph is empty (#105228 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105228 Approved by: https://github.com/anijain2305	2023-07-17 21:50:00 +00:00
Dunyang Chen	0af287cef2	Update batch_norm_backward_elemt args in native_functions.yaml (#104529 ) cuda kernel impl alread change this name Pull Request resolved: https://github.com/pytorch/pytorch/pull/104529 Approved by: https://github.com/soulitzer	2023-07-17 21:25:47 +00:00
Yikun Jiang	37f5d7866c	Remove redundant setuptools in pyproject.toml requires (#105303 ) There are two `setuptools` require in requires. It was a typo in original PR: `e85dfb6203` . This PR just remove the redundant one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105303 Approved by: https://github.com/mikaylagawarecki	2023-07-17 20:31:08 +00:00
lezcano	a26afb9848	Better comparisons for np.ndarrays in dynamo (#105333 ) This takes tolerances into account. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105333 Approved by: https://github.com/larryliu0820	2023-07-17 20:20:50 +00:00
Jason Ansel	3fdf365397	Move TypeAndSize out of /generated/ (#105195 ) This avoids a circular import in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105195 Approved by: https://github.com/albanD	2023-07-17 19:31:27 +00:00
David Berard	28d018dafd	[inductor] Implement bucketize() for dependencies.py (#105102 ) dependencies.py is used for tracking reads and writes, which is used for identifying dependencies between buffers: i.e. if buffer X reads buffer Y, then X depends on Y. ops.bucketize() reads from an offsets tensor, so we should track it in dependencies.py to correctly track dependencies. Since bucketize performs a binary search over the offsets tensor, the dependency is marked as a StarDep to indicate that the entire tensor is needed. Use case: we find that jagged tensor dense_to_jagged ops - which use bucketize() to map jagged indices to dense indices - perform better if the bucketize() kernel is separated from the gather kernel. Previously, because bucketize() wasn't marked as reading anything, it would just get inlined. Differential Revision: [D47422704](https://our.internmc.facebook.com/intern/diff/D47422704) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105102 Approved by: https://github.com/eellison	2023-07-17 19:15:00 +00:00
Nikita Shulga	74a08db12b	[BE] Speedup bazel builds (#105347 ) The way it was setup before this PR are that results of `build` step are discarded and are repeated anew during the test step, that was executed in a new pristine container instance. Avoid that by running builds and tests in the same container instance, that is passed from `Build` to `Test` step via `GITHUB_ENV` variable. Test plan: [linux-bionic-cpu-py3.10-gcc9-bazel-test](https://github.com/pytorch/pytorch/actions/runs/5578973087/jobs/10193974290#logs) finishes in 27 min instead of 40 min on trunk and [linux-focal-cuda12.1-py3.10-gcc9-bazel-test](https://github.com/pytorch/pytorch/actions/runs/5578973087/jobs/10193975032#logs) finsihes in 44 min instead of 90 min on trunk. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105347 Approved by: https://github.com/huydhn	2023-07-17 18:50:30 +00:00
ekamiti	32d422f335	Make adding buffers more like adding parameters (#104069 ) Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new `Buffer` class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the `register_buffer` method has not been changed. The `persistent` parameter in the `Buffer` type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new `Buffer` type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the `Buffer` type can be used as a drop in replacement for `register_buffer` as it just leads to `register_buffer` being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible. Fixes #35735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104069 Approved by: https://github.com/mikaylagawarecki	2023-07-17 17:59:05 +00:00
janEbert	4fc47b4156	Allow disabling bias for `LayerNorm` (#101683 ) Only relevant if `elementwise_affine=True`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101683 Approved by: https://github.com/mikaylagawarecki	2023-07-17 17:56:21 +00:00
Aaron Enye Shi	e0d2ad1a21	[Profiler][Memory] Export raw timestamped events in export_memory_timeline_raw (#105094 ) Summary: Rather than processing the events into a time and sizes plot, dump the actual events as (timestamp, action, num of bytes, category) when output file ends in `raw.json.gz`. This can allow downstream analysis tools to process these events. It also avoids having to control the granularity of the previous json.gz in memory profiler. Test Plan: CI Tests Differential Revision: D47416544 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/105094 Approved by: https://github.com/davidberard98	2023-07-17 17:39:37 +00:00
Animesh Jain	95232c216b	[dynamo] Bugfix for enums (#105306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105306 Approved by: https://github.com/yanboliang	2023-07-17 16:39:16 +00:00
PyTorch MergeBot	ca47205783	Revert "create mergability check (#105086 )" This reverts commit 0a20233e5bf495d06bb0c47e671cdcbbaede7b79. Reverted https://github.com/pytorch/pytorch/pull/105086 on behalf of https://github.com/PaliC due to no longer needed ([comment](https://github.com/pytorch/pytorch/pull/105086#issuecomment-1638436771))	2023-07-17 16:05:54 +00:00
Michael Gschwind	07a1c3f7ff	Exercise subclass of TransformerEncoderLayer (#105297 ) Summary: Exercise subclass of TransformerEncoderLayer Additional unit tests for change in #102045 to show correct e2e operation (cf. issue #100188) Also: remove batch_first from list of TS module constants where it is not used to resolve torchscripting warning Test Plan: saqndcastle, github Differential Revision: D47503004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105297 Approved by: https://github.com/davidberard98	2023-07-17 16:03:10 +00:00
Nicolas Macchioni	e5f5bcf6d4	[inductor] include global cache dir in inductor resources (#102130 ) Summary: adding global cache dir glob to inductor resources Test Plan: sandcastle + CI + tested locally Differential Revision: D46131451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102130 Approved by: https://github.com/jansel	2023-07-17 15:44:16 +00:00
shibo19	05854212dd	add syncBN support for custom device (#104250 ) Fixes #ISSUE_NUMBER there are some hard checks for `cuda`, so I make optimize the check so that we can run it for other device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104250 Approved by: https://github.com/albanD	2023-07-17 15:41:39 +00:00
Edward Z. Yang	2fa7d11b64	Immediately compile backwards graph in AOTAutograd if dynamic shapes (#104971 ) Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately if we capture any SymInts for backwards. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104971 Approved by: https://github.com/Chillee	2023-07-17 15:37:17 +00:00
PyTorch MergeBot	94b3f9f646	Revert "SetVariable in dynamo (#103205 )" This reverts commit 82fb5edfc725714d6ccb3cb978a42d29b4c34cc2. Reverted https://github.com/pytorch/pytorch/pull/103205 on behalf of https://github.com/atalman due to Failing cuda11.8-py3.10-gcc7-sm86 / test (inductor_torchbench_dynamic) with CUDA oom ([comment](https://github.com/pytorch/pytorch/pull/103205#issuecomment-1638115073))	2023-07-17 13:13:47 +00:00
QSHLGZ	07108ff1e8	Fix typos under _decomp directory (#105210 ) Fix typos under _decomp directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/105210 Approved by: https://github.com/ezyang, https://github.com/Neilblaze	2023-07-17 11:41:30 +00:00
lezcano	3152feab07	Assert that we can compute the bounds for guards using rational numbers (#105139 ) This makes sure that the bounds are always correct, as we're not losing precision Pull Request resolved: https://github.com/pytorch/pytorch/pull/105139 Approved by: https://github.com/ezyang	2023-07-17 11:34:05 +00:00
lezcano	34c91a7051	Prefer bound_sympy over sympy_interp (#105138 ) This is the first PR towards simplifying sympy_interp, and more generally, simplifying the implementation of ValueRangeAnalysis for SymPy expressions. In general, it would be conteptually good to have a minimal subset of operations that conform our SymPy expressions, let that be guards or indexing expressions. This would allow us to reason better about SymPy guards and potentially have invariants like knowing that guards are continuous piecewise rational functions. If this were the case, we could operate on them using exact arithmetic and completely avoid precision errors like the one found in https://github.com/pytorch/pytorch/issues/105097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105138 Approved by: https://github.com/ezyang	2023-07-17 11:34:05 +00:00
lezcano	eae99b0f51	Bound just size variables in bound_sympy (#105155 ) We also bound them as starting on 2, because of 0,1 specialisation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105155 Approved by: https://github.com/ezyang	2023-07-17 11:34:05 +00:00
PyTorch UpdateBot	b5beced299	[xla hash update] update the pinned xla hash (#105312 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105312 Approved by: https://github.com/pytorchbot	2023-07-17 11:09:56 +00:00
Danni Li	7f84d55e58	[BE] Add actual dtype to RuntimeError in ADDMM_META() (#105309 ) Summary: Include actual dtype in RuntimeError Test Plan: Please see GitHub Actions Fix: #105243 Differential Revision: D47506482 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105309 Approved by: https://github.com/IvanYashchuk	2023-07-17 10:54:19 +00:00
Shunting Zhang	8c479d32da	[inuctor][easy] avoid duplicate kernel definitions (#105099 ) When running BertForMaskedLM , I found if I enable the kernel benchmark, essentially identical kernels will be defined once for each call site. The reason is the benchmark harness of those kernels uses different seed_offset for each invocation. We should be safe to just force seed_offset to be 0 so we can deduplicate identical kernel definitions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105099 Approved by: https://github.com/jansel	2023-07-17 05:34:09 +00:00
Jason Ansel	93f852f201	Add PyObject_CallMethodNoArgs to pythoncapi_compat.h (#105285 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105285 Approved by: https://github.com/Skylion007	2023-07-17 03:19:04 +00:00
PyTorch MergeBot	e68cf02420	Revert "[inductor] Implement bucketize() for dependencies.py (#105102 )" This reverts commit cff5d6a22c8326f3d9fcbed2f68c5aaae9f4523a. Reverted https://github.com/pytorch/pytorch/pull/105102 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/105102#issuecomment-1637261924))	2023-07-17 01:22:19 +00:00
Peter Bell	9adfaf8807	[inductor] Add lowering for aten.unfold (#105165 ) The decomposition for unfold uses `as_strided` which forces the input to be realized. Instead, this implements it as a `GenericView` with reindexing which removes the need to realize, though it does call `mark_reuse` incase the input computation is expensive and the windows overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105165 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-07-16 13:09:23 +00:00
Animesh Jain	b47d91a537	[dynamo] Reland Move decorators into decorators.py (#105273 ) Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/105273 Approved by: https://github.com/jansel	2023-07-16 05:33:54 +00:00
yanbing-j	cbe0254dc4	Optimize sparse.mm reduce in BFloat16 data type in CPU backend (#103239 ) ### Description This PR is to optimize sparse.mm reduce of BFloat16 data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057. Half support (need support addmm Half implementation) will be done once https://github.com/pytorch/pytorch/pull/99498 upstream. Next step: - [x] Add benchmarks - [x] Update UTs - [x] Check backward behaviors - [x] Refactor code ### Performance test (Updated) Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz With jemalloc and iomp Single socket (40C) ![image](https://github.com/pytorch/pytorch/assets/61222868/509e8482-9160-4b85-bc39-5b6aad510283) Single core ![image](https://github.com/pytorch/pytorch/assets/61222868/c953a494-8f8e-4dbd-a8a7-421d8c22e946) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103239 Approved by: https://github.com/mingfeima, https://github.com/albanD	2023-07-16 05:02:10 +00:00
PyTorch UpdateBot	e3c4f2fb59	[vision hash update] update the pinned vision hash (#105282 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105282 Approved by: https://github.com/pytorchbot	2023-07-16 04:05:04 +00:00
Edward Z. Yang	efdabbff06	Assert that evaluate_expr matches hint (#97792 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97792 Approved by: https://github.com/avikchaudhuri	2023-07-15 23:57:42 +00:00
Nikita Shulga	5837e95d30	[Reland] Update mypy to 1.4.1 (#105227 ) This PR re-lands - [Typing] Fix PEP 484 Violation (#105022) - Update mypy to 1.4.1 (#91983) That were reverted due to the conflict with internal source repo. Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - Add assert it `torch/optim/optimizer.py` that Optional list is not None TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04: - Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh` - Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227 Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007	2023-07-15 20:30:20 +00:00
William Wen	5cd861fcf7	Add empty/empty_like to core aten decomps (#105158 ) Fixes https://github.com/pytorch/pytorch/issues/104871 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105158 Approved by: https://github.com/SherlockNoMad	2023-07-15 18:48:55 +00:00
Edward Z. Yang	1152e86da1	Transmute refined SymInt into int (#104828 ) Previously, x.size(0) could return a SymInt, even when the internal sympy expression was actually already constant (e.g., due to an introduced guard.) We now allow to query the Python object with maybe_as_int which allows us to transmute these objects back to int when possible. It is still possible to end up with a constant SymInt even after this change, e.g., if you get out a SymInt and while holding onto it specialize it, but casual users are more likely to get ints when they want to. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104828 Approved by: https://github.com/Skylion007	2023-07-15 18:46:10 +00:00
Jason Ansel	66d3729388	Add THPVariable_WrapList helper (#105194 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105194 Approved by: https://github.com/soulitzer, https://github.com/albanD	2023-07-15 18:13:35 +00:00
Jerry Zhang	7b4d080496	[quant][pt2e] Rename _pt2e to pt2e (#104668 ) Summary: X-link: https://github.com/pytorch/executorch/pull/3 att Test Plan: Imported from OSS Differential Revision: D47202807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104668 Approved by: https://github.com/andrewor14	2023-07-15 06:34:17 +00:00
Michael Lazos	a63f3f4335	[dynamo] Disable fused adam op compile (#105256 ) Don't compile the hand-fused adam implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/105256 Approved by: https://github.com/Chillee	2023-07-15 06:13:40 +00:00
AllenTiTaiWang	922a98e693	[ONNX] Enable attribute type checking in onnx dispatcher (#105104 ) The dipatcher didn't check attribute dtype, as AttributeProto is a totally different system from InputProto in ONNX. This PR introduces the mapping table for AttributeProto type to python type. And further utilize it in opschema matching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105104 Approved by: https://github.com/thiagocrepaldi	2023-07-15 06:06:39 +00:00
PyTorch MergeBot	0285366464	Revert "[dynamo] Maintainable code - Move export impl to a different file (#105071 )" This reverts commit 068f163dd3beb5883cda37518017d18fc6a50561. Reverted https://github.com/pytorch/pytorch/pull/105071 on behalf of https://github.com/clee2000 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/105071#issuecomment-1636654945))	2023-07-15 04:18:07 +00:00
chunyuan	1fdc88f877	Inductor cpp wrapper: fix codegen of FallbackKernel with kwargs (#104575 ) Fix cpp wrapper failure on TorchBench model `hf_Reformer` with `randn`: ``` random_rotations = torch.randn(rotations_shape, device=vectors.device, dtype=vectors.dtype) ``` For cpp wrapper, when `kwargs` is not empty, for `OpOverloadPacket` kernel, we need to know the exact overload schema to handle the `kwargs` properly when calling the cpp kernel: including finding the correct order of the kwargs and getting the default value for optional args without provided value when calling the function (`layout` in the above case). The current support in this PR is conservative and we'll extend the functionality in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104575 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-07-15 03:33:44 +00:00
Kurt Mohler	ffce2492af	Remove set_default_dtype calls from jit and ops tests (#105072 ) Part of #68972 This only attempts to avoid setting the default dtype for `test_jit.py` and `test_ops.py`. There are other tests, like `test_nn.py`, which will be addressed in follow up PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/105072 Approved by: https://github.com/ezyang	2023-07-15 03:18:33 +00:00
Michael Voznesensky	82fb5edfc7	SetVariable in dynamo (#103205 ) Set initial Fixes https://github.com/pytorch/pytorch/issues/94738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103205 Approved by: https://github.com/jansel	2023-07-15 02:25:31 +00:00
Aidyn-A	a155c68e6d	[MPI] Allow previously initialized (#105023 ) This pull request fixes #33943 for those applications that initialize MPI before `init_process_group("mpi", ...)` call, including `mpi4py`, some LibTorch applications and beyond. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105023 Approved by: https://github.com/H-Huang	2023-07-15 01:24:56 +00:00
Bowen Bao	15411b8afd	[ONNX] Make unsupported node analysis result deterministic (#105231 ) Replace `Set` with `Dict` for node.target to keep insertion order. Fixes https://github.com/pytorch/pytorch/issues/105200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105231 Approved by: https://github.com/thiagocrepaldi	2023-07-15 01:15:05 +00:00
Bin Bao	d438b99823	[reland][inductor] fix a custom_op test problem (#105234 ) Summary: reland https://github.com/pytorch/pytorch/pull/104972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105234 Approved by: https://github.com/clee2000	2023-07-15 01:01:12 +00:00
PyTorch MergeBot	15fd1ea118	Revert "[Reland] Update mypy to 1.4.1 (#105227 )" This reverts commit c9c4f8efc3dd4e66059522bf5f5c1ba0431e2069. Reverted https://github.com/pytorch/pytorch/pull/105227 on behalf of https://github.com/atalman due to trying to mitigate ci sev #105248 ([comment](https://github.com/pytorch/pytorch/pull/105227#issuecomment-1636510935))	2023-07-14 22:28:35 +00:00
Mengwei Liu	fb376f80a2	[retry][dynamo][numpy] Add support for np.dtype (#105034 ) Original PR: #103546 Trying to support numpy function call in dynamo, with numpy dtype as argument. For example: ``` def fn(x: int): return np.empty_like(x, dtype=np.float64) ``` This currently doesn't work because `NumpyVariable` doesn't implement `as_proxy()`. The idea in `as_proxy()` for now is to convert `np.float64` and other np.<dtype> into `str` and then feed into the corresponding `torch_np` method. The assumption here is that all `torch_np` methods that are taking `dtype` kwarg will be able to also take `str` as `dtype`. This assumption stands for `numpy`. For previous example, we convert `np.float64` to `"float64"` in `as_proxy()` and then feed it into `torch_np.empy_like()` method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105034 Approved by: https://github.com/voznesenskym	2023-07-14 21:36:36 +00:00
Nikita Karetnikov	7e72126487	[pt2] add decomps for `multi_margin_loss` ops (#104578 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104578 Approved by: https://github.com/ezyang, https://github.com/lezcano	2023-07-14 21:16:09 +00:00
Nikita Karetnikov	0a6888243b	`multi_margin_loss`: check `weight` shape, make contiguous on CPU, add tests (#104852 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104852 Approved by: https://github.com/ezyang	2023-07-14 21:16:09 +00:00
Nikita Karetnikov	de67b52a88	Unify `multi_margin_loss_shape_check` on CPU and CUDA (#104851 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104851 Approved by: https://github.com/ezyang	2023-07-14 21:16:09 +00:00
Nikita Karetnikov	0c89596e4f	[OpInfo] add reference and error inputs for `multi_margin_loss` (#104850 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104850 Approved by: https://github.com/ezyang	2023-07-14 21:16:09 +00:00
Dave Bort	d06e1df1aa	[torchgen] Rename executorch's RuntimeContext to KernelRuntimeContext (#104892 ) Rename the context type to match changes in executorch. Differential Revision: [D46977359](https://our.internmc.facebook.com/intern/diff/D46977359/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104892 Approved by: https://github.com/larryliu0820	2023-07-14 21:15:50 +00:00
atalman	99ab2ad677	[nightly] Fix macos nightly conda builds due to miniconda update (#105226 ) This is to fix failing macos conda nightly builds: https://github.com/pytorch/pytorch/actions/runs/5551435365/jobs/10147149169 Accompanying builder PR: https://github.com/pytorch/builder/pull/1460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105226 Approved by: https://github.com/jeanschmidt, https://github.com/malfet	2023-07-14 21:03:36 +00:00
Nikita Shulga	c9c4f8efc3	[Reland] Update mypy to 1.4.1 (#105227 ) This PR re-lands - [Typing] Fix PEP 484 Violation (#105022) - Update mypy to 1.4.1 (#91983) That were reverted due to the conflict with internal source repo. Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - Add assert it `torch/optim/optimizer.py` that Optional list is not None TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227 Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007	2023-07-14 20:45:12 +00:00
Bosheng Zhang (Daniel)	1518d5eec4	Update Documentation for TripletMarginLoss (#105115 ) This PR updates the documentation for `TripletMarginLoss` in `torch.nn`. The previous version of the documentation didn't mention the parameter `eps` used for numerical stability. This PR does the following: 1. Describes the purpose and use of the `eps` parameter in the `TripletMarginLoss` class documentation. 2. Includes `eps` in the example usage of `TripletMarginLoss`. Please review this update for the completeness with respect to the `TripletMarginLoss` functionality. If there are any issues or further changes needed, please let me know. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105115 Approved by: https://github.com/mikaylagawarecki	2023-07-14 20:04:25 +00:00
David Berard	cff5d6a22c	[inductor] Implement bucketize() for dependencies.py (#105102 ) dependencies.py is used for tracking reads and writes, which is used for identifying dependencies between buffers: i.e. if buffer X reads buffer Y, then X depends on Y. ops.bucketize() reads from an offsets tensor, so we should track it in dependencies.py to correctly track dependencies. Since bucketize performs a binary search over the offsets tensor, the dependency is marked as a StarDep to indicate that the entire tensor is needed. Use case: we find that jagged tensor dense_to_jagged ops - which use bucketize() to map jagged indices to dense indices - perform better if the bucketize() kernel is separated from the gather kernel. Previously, because bucketize() wasn't marked as reading anything, it would just get inlined. Differential Revision: [D47422704](https://our.internmc.facebook.com/intern/diff/D47422704) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105102 Approved by: https://github.com/eellison	2023-07-14 19:54:06 +00:00
Jithun Nair	4fc408ded2	[ROCm] Add AMD devs as owners for hipify code (#105080 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105080 Approved by: https://github.com/malfet	2023-07-14 19:51:31 +00:00
Angela Yi	bf46b6653f	[export] Allow optional call-spec (#105179 ) Summary: Submodules may have a none call-spec values, which is ok. Updating types + serializer to handle this Test Plan: CI Reviewed By: ydwu4, zhxchen17 Differential Revision: D47353101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105179 Approved by: https://github.com/zhxchen17	2023-07-14 19:11:47 +00:00
Jithun Nair	d3b96969a0	Upgrade CI to ROCm5.6 (#103092 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103092 Approved by: https://github.com/malfet	2023-07-14 19:10:28 +00:00
Mark Saroufim	fcbe1be8f9	Update torchbench.txt to include SAM (#105222 ) Goal is to include `745644f391` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105222 Approved by: https://github.com/ezyang, https://github.com/cpuhrsch	2023-07-14 18:37:30 +00:00
Seth Sangsu Lee	d855c6c7de	[PyTorch-TB] Write full tensor as tensor proto (#105186 ) Write full tensor as tensor proto Pull Request resolved: https://github.com/pytorch/pytorch/pull/105186 Approved by: https://github.com/atalman	2023-07-14 18:04:09 +00:00
PyTorch MergeBot	233f917c83	Revert "Add torch.ops.out_dtype (#103333 )" This reverts commit 7c10b58c5fb1e007801ff8f781eda72e4724994f. Reverted https://github.com/pytorch/pytorch/pull/103333 on behalf of https://github.com/atalman due to broke trunk win-vs2019-cpu-py3 ([comment](https://github.com/pytorch/pytorch/pull/103333#issuecomment-1636195679))	2023-07-14 17:59:25 +00:00
Aleksei Nikiforov	0401ffa83f	s390x: fix special_hermite_polynomial_h for '+/-inf' and '+/-nan' (#104705 ) On s390x static cast may return big positive number, in that case uninitialized value of 'r' is returned. In case of +/-inf or +/-nan use -1 explicitely. Also initialize 'r' to 0 in case 'n+n' overflows anyway. This change fixes test_vmap_exhaustive_special_hermite_polynomial_h_cpu_float32 from test/functorch/test_vmap.py on s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104705 Approved by: https://github.com/ezyang	2023-07-14 17:55:45 +00:00
David Radley	17250976f3	correct empty tensor mps all operation (#105218 ) Fixes #104694 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105218 Approved by: https://github.com/ezyang, https://github.com/kulinseth	2023-07-14 17:42:54 +00:00
Richard Barnes	15ea0a00cb	Fix RRef type annotations (#104876 ) Test Plan: Sandcastle Reviewed By: H-Huang Differential Revision: D47334579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104876 Approved by: https://github.com/H-Huang	2023-07-14 17:31:51 +00:00
Catherine Lee	c0a278d6f0	Upload all test stats only if the workflow is from pytorch/pytorch main (#105087 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/105087 Approved by: https://github.com/huydhn	2023-07-14 17:07:48 +00:00
angelayi	7c10b58c5f	Add torch.ops.out_dtype (#103333 ) https://docs.google.com/document/d/10DYFG2sU3TSvguFP5kYwYLlo45KHFg3BhBOkUk0NKsU/edit#bookmark=id.hgfzmhlzkamk Renamed mixed_dtype --> out_dtype because "mixed_dtype is not very descriptive in the context of regular pytorch where we support type promotion on most ops" Pull Request resolved: https://github.com/pytorch/pytorch/pull/103333 Approved by: https://github.com/zou3519	2023-07-14 16:40:05 +00:00
Edward Z. Yang	10cbc9a063	Enable cuda graphs for dynamic shapes (#105064 ) The general idea is to do a separate CUDA graph for each size. Because of cuda graph trees, these graphs will all share the same memory pool, so your memory usage will only be the worst case memory usage of the biggest dynamic size you want. This requires an extra dispatch in the cudagraphified callable. You must pay for a CUDA graph recording for every dynamic size you encounter, but this is MUCH cheaper than running the entire PT2 compile stack, so I expect you to still see benefits. This was surprisingly easy to do. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105064 Approved by: https://github.com/voznesenskym	2023-07-14 16:13:50 +00:00
Edward Z. Yang	9fc3a22731	Turn on typechecking in cudagraph_trees (#105151 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105151 Approved by: https://github.com/Skylion007, https://github.com/eellison	2023-07-14 16:13:50 +00:00
PyTorch MergeBot	1646d6f939	Revert "Merge and improve torch optim optimizer type stubs (#102593 )" This reverts commit 3279f06410032e9798e380cedf552f5b706ac6c1. Reverted https://github.com/pytorch/pytorch/pull/102593 on behalf of https://github.com/malfet due to There is nothing wrong with this PR, but it fails some internal builds that depend on outdated typing_extensions, will reland when update is done ([comment](https://github.com/pytorch/pytorch/pull/102593#issuecomment-1636062515))	2023-07-14 16:04:54 +00:00
PyTorch MergeBot	3c5a494d7a	Revert "Update mypy to 1.4.1 (#91983 )" This reverts commit 634659e262f82bbc76aa776119c9fea079fbffe3. Reverted https://github.com/pytorch/pytorch/pull/91983 on behalf of https://github.com/malfet due to It's dependent change was reverted, so reverting this one as well, to keep CI clean ([comment](https://github.com/pytorch/pytorch/pull/91983#issuecomment-1636059709))	2023-07-14 15:59:16 +00:00
PyTorch MergeBot	1c69f363c4	Revert "Transmute refined SymInt into int (#104828 )" This reverts commit 0f322a300efe588a4f7ae61dabcfd4fe0aa9e225. Reverted https://github.com/pytorch/pytorch/pull/104828 on behalf of https://github.com/ezyang due to executorch failure ([comment](https://github.com/pytorch/pytorch/pull/104828#issuecomment-1635997559))	2023-07-14 15:08:11 +00:00
Richard Zou	f03a8f0589	[reland] Deprecate registering autograd kernels at not an autograd key (#105078 ) Summary: Context ------- This PR adds a new fallback to the Autograd dispatch keys. If you would prefer the old behavior: - A quick (unsupported) way to get the previous behavior is to call `torch._C._set_autograd_fallback("nothing")` - Register "torch::CppFunction::makeFallthrough()" to your Autograd key, like in https://gist.github.com/zou3519/d09a5f4b1afe2430af09fea67c6ff2c8 It is possible that this PR regresses performance of overhead-bound models. If this is the case, please reach out (and apply one of the temporary fixes in the previous section). Description for reviewers ------------------------- In order to deprecate registering autograd kernels at not an autograd key, we add a fallback to the Autograd dispatch keys. This fallback raises a warning if the user attempts to backprop through the operator and is also configurable to either warn or not warn. The goal of this PR is to - preserve as much BC as possible - raise a warning that whatever the user is doing is potentially wrong. - be as performant as possible There are roughly two cases: - if the post-autograd kernels return a Tensor that requires grad, then we install an autograd hook that raises a warning. We are preserving BC in that it is possible that the user has a torch::autograd::Function registered to their CPU key. - if the post-autograd kernels return Tensors that do not require grad, then we make them require_grad and install a WarnNotImplemented grad fn that warns in the backward pass. This is mildy BC-breaking (see next section). Test Plan: - bunch of new tests BC-Breaking Note ---------------- This PR adds a new fallback to the Autograd dispatch keys. It affects custom operators that do not have a kernel registered to the Autograd keys (e.g. AutogradCPU and AutogradCUDA). If the previous behavior was that the custom operator would return Tensors that do not require grad if the inputs do require grad, then this PR changes it so that all floating-point and complex returns do require grad. See the "Context" section above for how to get the old behavior. Differential Revision: D47408353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105078 Approved by: https://github.com/soulitzer	2023-07-14 15:03:07 +00:00
PyTorch MergeBot	b4d91b1c5b	Revert "[Typing] Fix PEP 484 Violation (#105022 )" This reverts commit 4148b7badacace65b8d6309f3f364569c2b0e6a4. Reverted https://github.com/pytorch/pytorch/pull/105022 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/105022#issuecomment-1635967734))	2023-07-14 14:45:09 +00:00
Bin Bao	528ab477ce	[reland][inductor] Register an op for mm_plus_mm (#105153 ) Summary: Reland https://github.com/pytorch/pytorch/pull/104835 after fixing internal build issues Test Plan: CI Differential Revision: D47442849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105153 Approved by: https://github.com/clee2000	2023-07-14 14:35:29 +00:00
lezcano	c099b7e07a	ValueRange analysis for indirect indexing (#102611 ) We do so by forwarding ValueRange analysis from IR buffers to CSEvars Pull Request resolved: https://github.com/pytorch/pytorch/pull/102611 Approved by: https://github.com/eellison, https://github.com/peterbell10	2023-07-14 13:43:05 +00:00
lezcano	87a3ed58cb	Fix ranges for range vars (#104987 ) Ranges are inclusive on both ends... We take this chance to delete a stale comment Pull Request resolved: https://github.com/pytorch/pytorch/pull/104987 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-07-14 13:43:05 +00:00
lezcano	88dcecdf54	Remove unnecessary casting in triton (#104975 ) This used to be necessary before we advanced the pin past https://github.com/openai/triton/pull/1641 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104975 Approved by: https://github.com/peterbell10, https://github.com/Chillee	2023-07-14 13:43:05 +00:00
Andy Rock	dc4a0582fb	fix `hash_storage`'s padding calculation (#105036 ) Fixes #105035. The existing implementation attempts to make `x.numel() % 4 == 0` by appending `x.numel() % 4` zeros. This is backwards, e.g if `x.numel() % 4 == 1`, we need to append `[0, 0, 0]`, not `[0]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105036 Approved by: https://github.com/soulitzer, https://github.com/ezyang	2023-07-14 13:38:08 +00:00
Yukio Siraichi	8e01f75b1b	New `Mod` class for SymPy expressions. (#104968 ) This PR introduces a new `Mod` class to be used with SymPy expressions. The main reason being due to SymPy simplification errors (#97792). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104968 Approved by: https://github.com/ezyang	2023-07-14 13:34:52 +00:00
Animesh Jain	068f163dd3	[dynamo] Maintainable code - Move export impl to a different file (#105071 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105071 Approved by: https://github.com/voznesenskym	2023-07-14 09:28:33 +00:00
Rohan Varma	b7b44e766b	[Checkpoint] Separate implementation into generator (#105101 ) Separates the non-reentrant AC implementation into a generator so that other APIs such as composable checkpoint API can use the generator as pre and post forward logic. Differential Revision: [D47419387](https://our.internmc.facebook.com/intern/diff/D47419387/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105101 Approved by: https://github.com/soulitzer	2023-07-14 06:27:13 +00:00
FFFrog	6871cf6e1e	refactor GeneratorForPrivateuseone (#105038 ) 1. Modify the usage of std::mutex 2. restrict symbol scope 3. code format Pull Request resolved: https://github.com/pytorch/pytorch/pull/105038 Approved by: https://github.com/soulitzer	2023-07-14 06:10:11 +00:00
Jerry Zhang	90b50f0303	[quant][pt2e] change internal code to only import from _quantize_pt2e (#105162 ) Summary: This is to make public api clear so that we can make implementation details change easier in the future Test Plan: CIs Differential Revision: D47445767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105162 Approved by: https://github.com/andrewor14	2023-07-14 05:14:29 +00:00
Huy Do	893983e59f	Use GitHub REST API to get the merge base commit SHA (#105098 ) Fixes https://github.com/pytorch/pytorch/issues/104713 ### Testing Manual testing locally using #104121 and confirm that the correct merge base commit is returned [803c14490b189f9b755ecb9f2a969876088ea243](`1cb87771c1`) instead of the wrong value provided by `baseRefOid` (de7b6e55eb0f87f8d822f69bad6b4189a857b460). Here is the JSON output of the GraphQL query for PR info https://paste.sh/TJ-QQWz4#fvE3Y6qoJ8vDkRBZ3vowkZ3m Pull Request resolved: https://github.com/pytorch/pytorch/pull/105098 Approved by: https://github.com/malfet	2023-07-14 04:25:45 +00:00
Nisanth M P	9942a14e96	Fix torch.compile g++ flag error on ppc64le (#104956 ) g++ flag -march is not recognised on ppc64le. So adding a check for platform machine to be ppc64le and using -mcpu flag instead. Other architectures will still use -march flag This fixes the torch.compile feature failure on ppc64le Pull Request resolved: https://github.com/pytorch/pytorch/pull/104956 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-14 04:09:17 +00:00
mingfeima	a66f08d626	enable channels last for replication padding on CPU (#102597 ) Enable channels last support for replication padding on CPU. This patch add channels last support for ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad3d_cpu_float32 ``` The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.339 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 82.935 ms (after) ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.324 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 16.717 ms ``` ### single socket inference ``` (before) ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.135 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 7.203 ms (after) ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.029 ms ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 3.174 ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102597 Approved by: https://github.com/CaoE, https://github.com/cpuhrsch	2023-07-14 03:44:55 +00:00
PyTorch UpdateBot	c1877c741c	[vision hash update] update the pinned vision hash (#105191 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105191 Approved by: https://github.com/pytorchbot	2023-07-14 03:38:31 +00:00
Kefei Lu	4328138c1e	AOT inductor: error: ‘c10::Dispatcher’ has not been declared (#104742 ) Differential Revision: D47275262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104742 Approved by: https://github.com/desertfire	2023-07-14 01:47:52 +00:00
XiaobingSuper	46104882d7	inductor: enable cpu fusion for dynamic shapes path (#104945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104945 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-07-14 00:29:55 +00:00
Edward Z. Yang	8af8e1fe36	Add torch._dynamo.maybe_mark_dynamic (#105145 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105145 Approved by: https://github.com/aakhundov, https://github.com/Chillee	2023-07-14 00:29:16 +00:00
Edward Z. Yang	8a6e5d7cc7	CUDAGraph trees real inputs should never be SymInt (#105148 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105148 Approved by: https://github.com/Skylion007, https://github.com/eellison	2023-07-14 00:28:31 +00:00
Aleksandar Samardžić	d7e6040efa	Update sparse semi-structured linear operator (#104608 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104608 Approved by: https://github.com/cpuhrsch	2023-07-13 23:52:39 +00:00
Horace He	b88b742db8	fixed torch.manual_seed note (#105175 ) Fixes https://github.com/pytorch/pytorch/issues/87509 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105175 Approved by: https://github.com/ezyang	2023-07-13 23:43:44 +00:00
Tuan Tran	85745cd3d9	Fix bug in fuse_modules (#105069 ) Summary: This diff fixes the issue reported in https://github.com/pytorch/pytorch/issues/105063 and also related to internal caffe2 bug (reproduced error in internal fb pytorch: N3945540) Test Plan: Wait for sandcastle with the added unit test in caffe2/torch/ao/quantization/eager/test_fuse_eager Differential Revision: D47402357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105069 Approved by: https://github.com/jerryzh168	2023-07-13 23:39:59 +00:00
Danni Li	b33d63d97b	[BE] Use ValueError for input.dim check in torch.nn.modules (#105127 ) Summary: Use ValueError for input.dim check instead of Assertion Error. Fix: #104839 Test Plan: Please see GitHub actions. Differential Revision: D47427998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105127 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-07-13 23:20:46 +00:00
Jane Xu	cd15229950	[foreach][RMSprop] Minimize intermediates=2 to decrease peak memory (#105161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105161 Approved by: https://github.com/albanD	2023-07-13 23:18:54 +00:00
Jane Xu	219cf2a1c8	[foreach][ASGD] Minimize intermediates=1 to decrease peak memory (#105146 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105146 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-07-13 23:18:54 +00:00
angelayi	3a7d77f704	Serialize empty pytree cases (#105159 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/105159 Approved by: https://github.com/zhxchen17	2023-07-13 23:02:59 +00:00
bcoutinho	6c10edcb2d	move kineto submodule commit (#105144 ) kineto submodule pointer moved to 7c2c55054410346f1aa641256b82f6fb31d6c78f ``` Commit 7c2c55054410346f1aa641256b82f6fb31d6c78f (HEAD -> main, origin/main, origin/HEAD) Author: Aaron Enye Shi <enye.shi@gmail.com> Date: Thu Jul 6 08:00:51 2023 -0700 Update tb_plugin to new python and pytorch version (#778) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105144 Approved by: https://github.com/aaronenyeshi, https://github.com/Skylion007	2023-07-13 22:42:47 +00:00
Michael Voznesensky	485cad4a86	Dynamo tensor aliasing guards, dedup graphargs (#104921 ) The story here is relatively simple - when we go to wrap a tensor, we (1) ensure that it is a real, not fake tensor (2) check if we have seen it before. (3) If we have seen it, we create a positive alias guard and return the associated variable. If not, we proceed. By short circuiting here, we avoid lifting it to a graph input, and guarantee that the only names passed to tensors are unique. This allows us to guard on the unique relationships (pyboject addresses, aka IDs, cannot match) to give us guards for negative aliases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104921 Approved by: https://github.com/jansel, https://github.com/ezyang	2023-07-13 22:18:08 +00:00
Kurt Mohler	f987d11fa7	Reland: Make `torch.empty*` deterministic by filling with NaN or max int (#104995 ) Relands #101849 after #104302 reverted it. torchrec PR https://github.com/pytorch/torchrec/pull/1269 fixes the torchrec failure that caused #101849 to be reverted Part of #82004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104995 Approved by: https://github.com/albanD	2023-07-13 22:18:03 +00:00
BowenBao	42530c17fc	[ONNX] Fix UnsupportedFxNodesAnalysis after onnx dispatcher changes (#105156 ) Simplifies the logic to not depend on info within the exception raised. Due to changes in onnx dispatcher, the diagnostic within exception raised is now different, which broke this pass in retrieving the unsupported fx node kind. Adds proper unittest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105156 Approved by: https://github.com/thiagocrepaldi	2023-07-13 21:46:29 +00:00
AllenTiTaiWang	15c1e44d64	[ONNX] Apply param_manipulation.py from onnx-script to op validation and dispatcher (#104679 ) Previous to this PR, we are comparing torch args/kwargs with OnnxFunction OpSchema without normalizing args/kwargs first. Essentially, the function signature is different between ATen and OnnxFunction, and onnx-script has preprocessing on these args/kwargs with an internal tools: `param_manipulation` for both eager mode and graph mode. This PR references on the internal tool to normalize the torch args/kwargs before feeding them to OnnxFunction during op_level_debug and dispatching. The PR significantly reduces the dispatching need on nearest matching mechanism. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104679 Approved by: https://github.com/BowenBao	2023-07-13 21:16:35 +00:00
Aleksandar Samardžić	fc2f87b281	Add semi-structured sparse conversions (#103830 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103830 Approved by: https://github.com/amjames, https://github.com/jcaip, https://github.com/cpuhrsch	2023-07-13 21:09:09 +00:00
PyTorch MergeBot	15478a50ef	Revert "[export] Allow optional call-spec (#105041 )" This reverts commit 194fe1d12f9860734cc28ed21bdabda2fbb06336. Reverted https://github.com/pytorch/pytorch/pull/105041 on behalf of https://github.com/atalman due to broke lintrunner ([comment](https://github.com/pytorch/pytorch/pull/105041#issuecomment-1634911637))	2023-07-13 21:01:21 +00:00
Wei Chen	df3a64fb3e	[Dockerfile] Add `make triton` to the `build` target (#105114 ) Docker `build` layer is missing `triton` dependency, so image build for this target can not be used with `torch.compile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105114 Approved by: https://github.com/malfet	2023-07-13 20:20:26 +00:00
albanD	ef05c5f202	Use plain power operator in Adam/Adamw when capturing (#104254 ) The goal is to fix the problem from https://github.com/pytorch/pytorch/pull/102858 The full error this used to raise was : ``` 2023-06-27T15:12:15.0663239Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/adamw.py", line 409, in _single_tensor_adamw 2023-06-27T15:12:15.0663699Z bias_correction1 = 1 - beta1 ** step 2023-06-27T15:12:15.0664200Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 40, in wrapped 2023-06-27T15:12:15.0664547Z return f(args, kwargs) 2023-06-27T15:12:15.0665031Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 882, in __rpow__ 2023-06-27T15:12:15.0665483Z return torch.tensor(other, dtype=dtype, device=self.device) * self 2023-06-27T15:12:15.0665899Z RuntimeError: CUDA error: operation not permitted when stream is capturing 2023-06-27T15:12:15.0666401Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. ``` This pow issue was fixed in https://github.com/pytorch/pytorch/pull/104264 and so this problem should be solvable now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104254 Approved by: https://github.com/janeyx99, https://github.com/aws-murandoo	2023-07-13 19:24:25 +00:00
Angela Yi	194fe1d12f	[export] Allow optional call-spec (#105041 ) Summary: Submodules may have a none call-spec values, which is ok. Updating types + serializer to handle this Test Plan: CI Differential Revision: D47353101 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105041 Approved by: https://github.com/ydwu4, https://github.com/zhxchen17	2023-07-13 18:39:54 +00:00
janEbert	b06a426390	Fix typo (#105130 ) casual → causal Pull Request resolved: https://github.com/pytorch/pytorch/pull/105130 Approved by: https://github.com/Skylion007	2023-07-13 18:35:57 +00:00
nikitaved	44c8515d0d	SDPA: frontend for BSR masks (#104042 ) This PR implements a (yet private) frontend for scaled_dot_product_attention that works with BSR `attn_mask`. This function is directly comparable (with suitable masks) with `torch.nn.functional.scaled_dot_product_attention` once `attn_mask.dtype == torch.bool`, but it's behavior is different when `attn_mask.dtype != torch.bool`. This is because `torch.nn.functional.scaled_dot_product_attention` assumes that irrelevant values are supposed to be filled with `-inf`, while the selected ones should be `0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104042 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2023-07-13 18:01:21 +00:00
Michael Lazos	05eea20eb9	[dynamo] Simulate torch function enablement state (#105091 ) Part of https://github.com/pytorch/pytorch/issues/93723 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105091 Approved by: https://github.com/voznesenskym, https://github.com/anijain2305	2023-07-13 17:42:20 +00:00
Edward Z. Yang	87cf51cc7f	Switch automatic_dynamic_shapes to True by default in fbcode (#104883 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104883 Approved by: https://github.com/xw285cornell	2023-07-13 17:37:57 +00:00
Catherine Lee	c36dca7bc5	Revert "[inductor] Register an op for mm_plus_mm (#104835 )" (#105150 ) This reverts commit 9c46a1620c99626ee9db01985a569ba79888508b. Actual revert referenced in https://github.com/pytorch/pytorch/pull/105149 #104835 is causing internal builds to fail Pull Request resolved: https://github.com/pytorch/pytorch/pull/105150 Approved by: https://github.com/atalman	2023-07-13 17:13:45 +00:00
Catherine Lee	91c64f55ab	Revert "[inductor] fix a custom_op test problem (#104972 )" (#105149 ) This reverts commit be76bfb743c941278cc3cf94816d2181f0a30867. I need to revert https://github.com/pytorch/pytorch/pull/104835 and this is causing a merge conflict Pull Request resolved: https://github.com/pytorch/pytorch/pull/105149 Approved by: https://github.com/izaitsevfb, https://github.com/atalman	2023-07-13 17:06:09 +00:00
lezcano	d1fedad080	Perform value range analysis with rationals when possible (#105137 ) This is particularly useful for guards to avoid rounding errors, as most guards (all?) are rational functions. Fixes https://github.com/pytorch/pytorch/issues/105097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105137 Approved by: https://github.com/ezyang	2023-07-13 16:45:47 +00:00
Nikita Shulga	634659e262	Update mypy to 1.4.1 (#91983 ) Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91983 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/thiagocrepaldi, https://github.com/aaronenyeshi	2023-07-13 16:30:36 +00:00
mingfeima	f73757d551	enable channels last for reflection padding on CPU (#102518 ) Add channels last support for reflection padding on CPU. The following test cases will pass with this patch: ``` python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32 python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32 ``` The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.356 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 86.821 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.328 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 16.806 ms ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.142 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 7.367 ms (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.027 ms ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 3.181 ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102518 Approved by: https://github.com/CaoE, https://github.com/cpuhrsch	2023-07-13 16:22:31 +00:00
PyTorch MergeBot	d35137cc37	Revert "[PyTorch TB] Write raw tensor as tensor_proto (#104908 )" This reverts commit dceae41c29782399c84304812696a8382e9b4292. Reverted https://github.com/pytorch/pytorch/pull/104908 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/104908#issuecomment-1634532376))	2023-07-13 16:22:04 +00:00
Edward Z. Yang	e1502c0cdb	Add some more files to ciflow/inductor (#105112 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105112 Approved by: https://github.com/albanD	2023-07-13 14:44:42 +00:00
Brian Hirsh	c6b9c31a2c	[inductor] fix incorrect strides in copy() decomp, fix hf_LongFormer + hf_BigBird errors (#100115 ) Fixes https://github.com/pytorch/pytorch/issues/100067, https://github.com/pytorch/pytorch/issues/98268 and https://github.com/pytorch/pytorch/issues/93428. See the comment [here](https://github.com/pytorch/pytorch/issues/100067#issuecomment-1523856970) for details. The bug was that the decomposition that inductor uses for `aten.copy` doesn't respect the strides of the input in all cases. The fixes that I added should work, but will be pretty slow - we allocate a tensor (potentially larger than `self` if `self` is a slice), and perform an `as_strided_scatter` + `as_strided`. Longer term, stride-agnostic IR should let us remove this decomp? cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @anijain2305 @soumith @desertfire Pull Request resolved: https://github.com/pytorch/pytorch/pull/100115 Approved by: https://github.com/albanD, https://github.com/ngimel	2023-07-13 14:40:57 +00:00
yanbing-j	053654b9cf	Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427 ) ### Description This PR is to optimize scatter_add/scatter_reduce of BFloat16/Half data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057. Main point is creating a buffer among threads to accumulate intermediate data as fp32 data type. Next step: - [x] Add benchmarks - [x] Extend to Half - [x] Simplify code ### Performance test (Updated) Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz With jemalloc and iomp Single socket (40C) ![image](https://github.com/pytorch/pytorch/assets/61222868/4b4342f1-8cc3-46f7-81f5-651becd9b1e3) Single core ![image](https://github.com/pytorch/pytorch/assets/61222868/09e5f700-2c2e-4208-979e-74b85474dea6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103427 Approved by: https://github.com/mingfeima, https://github.com/albanD	2023-07-13 09:34:29 +00:00
Animesh Jain	735e6ae801	[dynamo] Maintainable code - Move decorators in a separate file (#105070 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105070 Approved by: https://github.com/ezyang	2023-07-13 07:41:19 +00:00
Michael Gschwind	4a0d773a08	Update attention.cpp to remove warning about preferring torch.bool type (#103362 ) Update attention.cpp to remove warning about preferring torch.bool data type Fixes #100469 #97532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103362 Approved by: https://github.com/mikaylagawarecki	2023-07-13 07:07:46 +00:00
Edward Z. Yang	0f322a300e	Transmute refined SymInt into int (#104828 ) Previously, x.size(0) could return a SymInt, even when the internal sympy expression was actually already constant (e.g., due to an introduced guard.) We now allow to query the Python object with maybe_as_int which allows us to transmute these objects back to int when possible. It is still possible to end up with a constant SymInt even after this change, e.g., if you get out a SymInt and while holding onto it specialize it, but casual users are more likely to get ints when they want to. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104828 Approved by: https://github.com/Skylion007	2023-07-13 07:02:52 +00:00
Rohan Varma	242fc29c96	[FSDP] Refactor optimizer in backward (#104813 ) 1) Use zero_grad(set_to_none=True) to set grad to None, 2) call prepare_grad_for_optim() before call to .step, 3) use _reset_flat_param_grad_info to set flat param gradient back to None. These changes should just be refactors and equivalent to how gradient memory was managed before. Differential Revision: [D47310761](https://our.internmc.facebook.com/intern/diff/D47310761/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104813 Approved by: https://github.com/awgu	2023-07-13 06:42:53 +00:00
Rohan Varma	f2eed129c4	FSDP optimizer overlap (#98667 ) constraints: 1. No support for gradient accumulation 2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU. 3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data. 4. Step is waited on in post backward final cb, when in theory it can wait until the next forward. Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98667 Approved by: https://github.com/awgu, https://github.com/fegin	2023-07-13 06:42:53 +00:00
Tugsbayasgalan Manlaibaatar	1d02106e03	Preserve source_fn or nn_module_stack in the lifted params (#105017 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105017 Approved by: https://github.com/angelayi	2023-07-13 06:03:28 +00:00
Sangsu Lee	dceae41c29	[PyTorch TB] Write raw tensor as tensor_proto (#104908 ) This is the first diff to support logging of raw tensors for [TensorBoard Intermediate Logging](https://www.internalfb.com/intern/wiki/TensorBoard/Intermediate_Logging/) Ultimately, we aim to support the feature where store full tensor is stored as a tensor protobuf to TB. Protobuf contains shape, dtype, and elements of the given tensor. 1. add `tensor_proto()` to `summary.py` which takes a tensor and convert to protobuf 2. add `add_tensor()` to `writer.py` 3. formatting changes introduced by `arc lint` ------------- Differential Revision: [D47302017](https://our.internmc.facebook.com/intern/diff/D47302017/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104908 Approved by: https://github.com/kunalb	2023-07-13 05:30:50 +00:00
Michael Lazos	b99d605a30	Add meta registration for foreach_mul_ (#105107 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/105107 Approved by: https://github.com/Chillee, https://github.com/voznesenskym	2023-07-13 04:45:22 +00:00
Richard Barnes	0faf8ed49f	Skip TS backend in FBCODE (#104354 ) Summary: Fixes: ``` Traceback (most recent call last): File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/1a4194a16794cc72/caffe2/test/__torch__/torch#link-tree/torch/testing/_internal/common_device_type.py", line 543, in setUpClass torch._lazy.ts_backend.init() File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/1a4194a16794cc72/caffe2/test/__torch__/torch#link-tree/torch/_lazy/ts_backend.py", line 6, in init torch._C._lazy_ts_backend._init() RuntimeError: TorchScript backend not yet supported in FBCODE/OVRSOURCE builds ``` Test Plan: Sandcastle Differential Revision: D47093028 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104354 Approved by: https://github.com/malfet	2023-07-13 02:46:58 +00:00
PaliC	0a20233e5b	create mergability check (#105086 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105086 Approved by: https://github.com/izaitsevfb	2023-07-13 02:21:27 +00:00
eqy	2c85f28c71	[CUDA][cudaMallocAsync] Reduce record-stream warning spam (#105015 ) Addresses #104925 CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/105015 Approved by: https://github.com/eellison	2023-07-13 02:06:14 +00:00
Thiago Crepaldi	7030403048	Fix initializer naming at torch.onnx.ExportOutput.save_model_with_external_data (#105002 ) This PR is only relevant for the Fake tensor Mode ONNX export. For the conventional export, everything is unchanged. * An optional `rename_initializer=False` argument is added to an internal function `torch/onnx/_internal/fx/serialization.py::save_model_with_external_data` which is used by the public API `ExportOutput.save`. * The default behavior (`rename_initializer=False`) is meant to be used by public API `torch.onnx.dynamo_export` with the default Dynamo-based FX tracer (`DynamoExport`). In this scenario, both graph ONNX graph inputs and initializers have matching name with `.` in it (e.g. `linear.weight`) * `rename_initializer=True` is meant to be used by `torch.onnx.dynamo_export` with a non-publicly-supported FX tracer called `FXSymbolicTracer`. This tracer lifts the FX graph initializers as inputs before FX->ONNX start, and because of this, the initializer names must be valid python identifiers (meaning `.` are not supported argument name and must be replaced by `_` or similar). This causes the graph inputs to have names with `_` (e.g. `linear_weight`) while the initializers have `.` (e.g. `linear.weight`) in their name. This flag resolves this mismatch by replacing `.` by `_` when saving the ONNX proto (`save_model_with_external_data`). * This PR also adds unit tests for numerical validation against pytorch eager for onnx export using dynamo-based fx tracer and fake mode enabled. (There are already tests for export with fx symbolic tracer with fake mode) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105002 Approved by: https://github.com/BowenBao	2023-07-13 02:03:16 +00:00
BowenBao	bf40561ab4	[ONNX] Support 'aten::randint' in torchscript onnx exporter (#105089 ) Export as 'ONNX::RandomUniform' which produces floating point result, then round it to integer with 'ONNX::Cast'. Fixes https://github.com/microsoft/onnx-converters-private/issues/173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105089 Approved by: https://github.com/thiagocrepaldi	2023-07-13 01:50:03 +00:00
Animesh Jain	9647a251cb	[dynamo] Dataclass variables with default field (#104840 ) The main complexity comes from the __init__ function of Dataclass variables which look something like this ``` [2023-07-10 05:01:29,548] torch._dynamo.symbolic_convert: [DEBUG] INLINING <code object __init__ at 0x7f7015154450, file "<string>", line 2> 3 0 LOAD_FAST 1 (b) 2 LOAD_FAST 0 (self) 4 STORE_ATTR 0 (b) 4 6 LOAD_FAST 2 (named_tensors) 8 LOAD_DEREF 0 (_HAS_DEFAULT_FACTORY) 10 IS_OP 0 12 POP_JUMP_IF_FALSE 20 14 LOAD_DEREF 1 (_dflt_named_tensors) 16 CALL_FUNCTION 0 18 JUMP_FORWARD 2 (to 22) >> 20 LOAD_FAST 2 (named_tensors) >> 22 LOAD_FAST 0 (self) 24 STORE_ATTR 1 (named_tensors) 26 LOAD_CONST 0 (None) 28 RETURN_VALUE ``` There are multiple issues * VariableBuilder call in functions.py was wrong. We were calling options as args. We were not setting source while tracking the new object. This led to no source for Dataclass variable, which has some new variables in its closures as seen in the above bytecode. * There is IS_OP in above bytecode, which brings more cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104840 Approved by: https://github.com/jansel	2023-07-13 01:25:57 +00:00
Horace He	601db856d1	elevated cudagraphs failure to warning, added lineno to recompiles (#105081 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105081 Approved by: https://github.com/mlazos	2023-07-13 01:17:58 +00:00
yanbing-j	3fe2b73416	Update use_mkldnn in LSTM op to avoid input and parameter not in the same device (#102050 ) This PR is to fix https://github.com/pytorch/pytorch/issues/101935. Only when input, parameters and hidden states are all in CPU device, LSTM will go into oneDNN fast path implementation. Otherwise, it will fallback to the original implmentation. Note here, if input and parameters are indeed not in the same device, it will encounter Error `Input and parameter tensors are not at the same device, found input tensor......` in `check_attributes`. Therefore, the proper usage of LSTM is `input.to(device)` and `model.to(device)` together. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102050 Approved by: https://github.com/XiaobingSuper, https://github.com/albanD	2023-07-13 01:13:59 +00:00
PyTorch MergeBot	5b4aacd691	Revert "[DCP] Add FsspecReader and FsspecWriter to checkpoint __init__.py (#105088 )" This reverts commit 76a053d55cb23948c7b331a48f921744db24601e. Reverted https://github.com/pytorch/pytorch/pull/105088 on behalf of https://github.com/atalman due to broke trunk and linux-focal-py3.9-clang7-asan ([comment](https://github.com/pytorch/pytorch/pull/105088#issuecomment-1633385350))	2023-07-13 00:59:55 +00:00
Andrew Gu	954bae8e53	[FSDP][Easy] Rename streams; add back stream sharing test (#104966 ) Purely out of preference, this PR renames the streams to `_unshard_stream` instead of `_streams_unshard` etc. since the former reads more naturally. The PR also removes some duplicated comments and adds back a unit test that streams are shared. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104966 Approved by: https://github.com/rohan-varma	2023-07-13 00:24:41 +00:00
albanD	59bb07ca46	Update vendored version of pythoncapi_compat (#105083 ) In preparation for 3.12 support Pull Request resolved: https://github.com/pytorch/pytorch/pull/105083 Approved by: https://github.com/Skylion007	2023-07-12 23:43:11 +00:00
Iris	4f8ba6f8f6	[DeviceMesh]Add validate mesh flag to DeviceMesh (#104807 ) When creating DeviceMesh, _init_process_group() would validate that all calling ranks pass in the same `mesh` argument. In FSDP, we are currently creating the DeviceMesh based on the pg of the root state so the mesh will always be valid. Adding the flag to DeviceMesh, so we can skip the all_gather_tensor of the validation during construction time. _validate_mesh is default to True, but we manually flip it to False when initializing device mesh in FSDP's _runtime_utils.py. Will modify skipping pg creation if existed for both 1D and 2D cases and then delete _init_process_groups flag in a follow up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104807 Approved by: https://github.com/wanchaol	2023-07-12 23:42:13 +00:00
Iris	76a053d55c	[DCP] Add FsspecReader and FsspecWriter to checkpoint __init__.py (#105088 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/105088 Approved by: https://github.com/kumpera	2023-07-12 23:40:35 +00:00
William Wen	15c67ca95c	Update troubleshooting.rst (#105018 ) Update with `TORCH_LOGS` information Pull Request resolved: https://github.com/pytorch/pytorch/pull/105018 Approved by: https://github.com/mlazos	2023-07-12 21:42:53 +00:00
Bert Maher	cf9d784e32	Skip test_indirect_device_assert in fbcode (#105065 ) It spawns a python subprocess, but this "python" isn't really what we want, since it doesn't have torch, etc. There are probably ways to make this work but it's not worth figuring it out. Differential Revision: [D47402347](https://our.internmc.facebook.com/intern/diff/D47402347/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105065 Approved by: https://github.com/ezyang	2023-07-12 21:26:53 +00:00
lezcano	398606e1c4	Fix bug when an index appears in two expressions (#104886 ) We were not adding the bounds to `replacement_vars` in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104886 Approved by: https://github.com/eellison, https://github.com/Skylion007	2023-07-12 21:26:30 +00:00
BowenBao	2563079d59	[ONNX] Allow None as operator argument (#105040 ) Needed by 'aten.index.Tensor', where 'indices' is list of optional tensors. Related https://github.com/microsoft/onnxscript/pull/862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105040 Approved by: https://github.com/titaiwangms, https://github.com/thiagocrepaldi	2023-07-12 21:11:25 +00:00
Tugsbayasgalan Manlaibaatar	f0ed71273e	Make ops functional (#105000 ) when you run DEBUG=1 mode, these op error because it thinks the implementation is not-functional even though the schema claims it functional. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105000 Approved by: https://github.com/angelayi	2023-07-12 20:46:39 +00:00
Huy Do	d77a2d8fe3	Remove shard ID and unstable suffix when comparing failed job names with the base commit (#104821 ) Fixes https://github.com/pytorch/test-infra/issues/4328 This goes together with https://github.com/pytorch/test-infra/pull/4353 and it updates `trymerge` to remove shard ID and the `unstable` suffix when comparing failed job names with the base commit. ### Testing Add unit tests with the reported issue as a test case https://github.com/pytorch/pytorch/pull/104214 to make sure that the failure there is reported as ignorable broken trunk instead of a new failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104821 Approved by: https://github.com/malfet	2023-07-12 19:51:18 +00:00
atalman	64bbe61600	Fix lint: [PyTorch] Add Vulkan support for at::softmax 1,2,3 dimension tensors (#105082 ) Fix lint. Follow up on: https://github.com/pytorch/pytorch/pull/105012 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105082 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2023-07-12 19:47:34 +00:00
Rodrigo Kumpera	fc012d716d	[core] Bring cpu device module closer to cuda's. (#103172 ) By implementing some of the functionality used by CUDA we make implementing device agnostic code a lot easier. With this set of changes it's now possible to get FSDP wrap a trivial module. FWD/BWD still TBD. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103172 Approved by: https://github.com/wz337, https://github.com/wanchaol	2023-07-12 19:43:22 +00:00
Peter Bell	66fb83293e	[inductor] Add min/max to index propagation pass (#105020 ) This allows `ops.minimum` and `ops.maximum` to be hoisted for indirect indexing into direct indexing expressions. I also add support to the cpp printer for Min/Max and fix the triton printer to support multi-argument Min/Max. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105020 Approved by: https://github.com/lezcano	2023-07-12 19:03:01 +00:00
PyTorch MergeBot	06a5df8d31	Revert "Transmute refined SymInt into int (#104828 )" This reverts commit 4694f5435657b157a37ccec0d4a90b27c4b003c7. Reverted https://github.com/pytorch/pytorch/pull/104828 on behalf of https://github.com/ezyang due to broke inductor ([comment](https://github.com/pytorch/pytorch/pull/104828#issuecomment-1633049980))	2023-07-12 18:57:58 +00:00
Rodrigo Kumpera	246dc0d9f2	[MTPG] Use TLS propagation to enable MTPG from bwd. (#104735 ) We use PyTorch's built-in tls propagation in ThreadLocalState to forward the world object from the fwd thread to the bwd thread. This further closes the gap on enabling FSDP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104735 Approved by: https://github.com/rohan-varma	2023-07-12 18:47:02 +00:00
Kaichen Liu	43c94360e2	[PyTorch] Add Vulkan support for at::softmax 1,2,3 dimension tensors (#105012 ) Summary: This rounds out the support for the [softmax function](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html) on the Vulkan GPU backend. The test inputs of the 1,2,3 dimension cases are simply the truncated existing 4 dimension inputs. The existing shader algorithms are reused. Test Plan: 1. `buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1` on Apple M1 MacBook 2. Confirm all tests pass with no regression, and the added tests `softmax` pass under `-- --gtest_filter="softmax"` 2a. All tests P782531732 2b. `softmax` tests P782529114 ``` ~/fbsource » buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="softmax" Buck UI: https://www.internalfb.com/buck2/692eb82d-c2ee-49bb-833f-3c11d6e2fea9 Jobs completed: 4. Time elapsed: 0.1s. Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = softmax [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from VulkanAPITest [ RUN ] VulkanAPITest.softmax [ OK ] VulkanAPITest.softmax (42 ms) [ DISABLED ] VulkanAPITest.DISABLED_log_softmax [----------] 1 test from VulkanAPITest (42 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (42 ms total) [ PASSED ] 1 test. YOU HAVE 1 DISABLED TEST ``` Reviewed By: SS-JIA Differential Revision: D46985319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105012 Approved by: https://github.com/SS-JIA	2023-07-12 18:41:03 +00:00
albanD	08cbfb2a58	Avoid tensor creation and use scalar overload (#104264 ) I would expect this preserves the behavior but there might be weird edge cases? @mruberry might know? The aim is to fix https://github.com/pytorch/pytorch/pull/104254 (and make `1 ** t` capturable via cudagraph) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104264 Approved by: https://github.com/zou3519	2023-07-12 18:11:27 +00:00
Zaili Wang	16d3638c11	Add best practices for CPU backend doc (#105051 ) Content same as #103948 @svekars the PR content is updated per your comment, but when trying to solve the conflict the original PR was closed by a mis-operation. Would you help handle this new one? sorry for the inconvenience. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105051 Approved by: https://github.com/svekars	2023-07-12 18:04:51 +00:00
Bin Bao	be76bfb743	[inductor] fix a custom_op test problem (#104972 ) Summary: https://github.com/pytorch/pytorch/pull/104349 added a test which sometimes triggers duplicated op registration on CI, e.g. https://github.com/pytorch/pytorch/issues/104856. This PR fixes it by only registrating the custom op once. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104972 Approved by: https://github.com/eellison	2023-07-12 18:00:28 +00:00
BowenBao	5c7e826f4d	[ONNX][TypePromo] Introduce ReductionTypePromotionRule (#104491 ) Introduce `ReductionTypePromotionRule` and rename `TypePromotionRule` as `ElementwiseTypePromotionRule`. Created base abstract class `TypePromotionRule`. Reduction rules are manually curated because the total number of ops is low, yet most of them require some special treatment. The list that are covered in our unittest is - "all", done - "amax", done - "amin", done - "any", done - "cumsum", done - "cumprod", no torchlib impl - "mean", done - "std", no torchlib impl - "std_mean", no torchlib impl - "sum", done - "sum_to_size", no torchlib impl - "prod", no torchlib impl - "var", no torchlib impl - "var_mean", tricky. Node has multiple outputs. Follow up in separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104491 Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi	2023-07-12 17:53:59 +00:00
PyTorch MergeBot	0e7529940d	Revert "Switch automatic_dynamic_shapes to True by default in fbcode (#104883 )" This reverts commit d1ca98665f15b6d71523048e3d6b0c9cfa3c2d1d. Reverted https://github.com/pytorch/pytorch/pull/104883 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/104883#issuecomment-1632931223))	2023-07-12 17:30:18 +00:00
Edward Z. Yang	4694f54356	Transmute refined SymInt into int (#104828 ) Previously, x.size(0) could return a SymInt, even when the internal sympy expression was actually already constant (e.g., due to an introduced guard.) We now allow to query the Python object with maybe_as_int which allows us to transmute these objects back to int when possible. It is still possible to end up with a constant SymInt even after this change, e.g., if you get out a SymInt and while holding onto it specialize it, but casual users are more likely to get ints when they want to. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104828 Approved by: https://github.com/Skylion007	2023-07-12 16:40:21 +00:00
drisspg	1ecef7d805	Remove unused private code from ATEN (#104751 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 52dac58</samp> Add support for `torch.linalg.cholesky_ex` function that returns the Cholesky factorization and an error indicator. Refactor existing `torch.cholesky` and `torch.linalg.cholesky` to use the new function internally. Update tests and documentation accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104751 Approved by: https://github.com/albanD	2023-07-12 16:09:42 +00:00
Edward Z. Yang	979f826015	Read out real strides from compilation result, rather than real args (#105010 ) This prefigures a refactor that will move the backward compilation to entirely ahead of time, so I need to extract these strides some other way. Straight from the compiler's mouth will do it. I can't easily get the information via the return result of `fw_compiler` without changing the calling convention, so instead I smuggle it via TracingContext. TracingContext may be None when we are compiling patterns for the joint graph pattern matcher. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105010 Approved by: https://github.com/shunting314	2023-07-12 11:33:08 +00:00
Nikita Shulga	4148b7bada	[Typing] Fix PEP 484 Violation (#105022 ) Not sure, how it worked before, but if arguments must be annotated is optional if they are defaulted to None Towards enabling mypy-1.4.1 in lintrunner <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 5e1b9f4</samp> > _We annotate the arguments of doom_ > _To show the `None` values of gloom_ > _We improve the type checking and readability_ > _With `Optional` annotations of metal-ity_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/105022 Approved by: https://github.com/izaitsevfb, https://github.com/huydhn, https://github.com/Skylion007	2023-07-12 10:20:48 +00:00
Sangsu Lee	603a777b09	[PyTorch TB] Refactor formatting (#105027 ) This is the first diff to support logging of raw tensors for [TensorBoard Intermediate Logging](https://www.internalfb.com/intern/wiki/TensorBoard/Intermediate_Logging/) Ultimately, we aim to support the feature where store full tensor is stored as a tensor protobuf to TB. Protobuf contains shape, dtype, and elements of the given tensor. This diff only contains formatting changes. ------------- Differential Revision: [D47302017](https://our.internmc.facebook.com/intern/diff/D47302017/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105027 Approved by: https://github.com/kunalb	2023-07-12 06:08:18 +00:00
Michael Gschwind	c7a76d9be5	Replace use of `first_layer` in init with `encoder_layer` argument to init (#104058 ) Summary: Replace use of `first_layer` in init with `encoder_layer` argument to init (better eng) Test Plan: sandcastle, github CI Differential Revision: D46940537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104058 Approved by: https://github.com/mikaylagawarecki	2023-07-12 05:31:15 +00:00
Nikita Karetnikov	c03558fa8d	[doc] apply `weight` after `p` in `MultiMarginLoss` (#104844 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104844 Approved by: https://github.com/lezcano	2023-07-12 03:42:14 +00:00
Jane Xu	0bc382ea55	[foreach][Adamax] Minimize intermediates=1 to decrease peak memory (#104991 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104991 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-07-12 03:09:17 +00:00
Jane Xu	ea6a563a8c	[foreach][Adagrad] Minimize intermediates=2 to decrease peak memory (#104988 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104988 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-07-12 03:09:17 +00:00
Jane Xu	455f495f04	[foreach][Adadelta] Minimize intermediates=3 to decrease peak memory (#104983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104983 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-07-12 03:09:15 +00:00
Bin Bao	9c46a1620c	[inductor] Register an op for mm_plus_mm (#104835 ) Summary: Currently the aten version of mm_plus_mm has no cpp implementation, and thus cpp_wrapper can not generate the correct cpp function call for it. Differential Revision: [D47372057](https://our.internmc.facebook.com/intern/diff/D47372057) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104835 Approved by: https://github.com/jansel, https://github.com/SherlockNoMad	2023-07-12 02:34:02 +00:00
blzheng	5f2a76ddf7	inductor: fix LoweringException of AdaptiveAvgPool with output_size 0 (#104691 ) Fix #104618 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104691 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/eellison	2023-07-12 02:28:13 +00:00
Edward Z. Yang	9d1f5a35df	Move more stuff into ViewAndMutationMeta (#105009 ) The one sort of tricksy thing about this PR is that `num_symints_saved_for_bw` is populated later; we compute the metadata with a forward pass, but we only know `num_symints_saved_for_bw` once we run partitioning. This seems... fine. Also, by pushing the conditionals into the slices, I can remove the top level if...else branch, for a nice simplification. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105009 Approved by: https://github.com/albanD	2023-07-12 02:22:44 +00:00
Kefei Lu	5913437a40	aot inductor: opportunistically fix check_output -> check_call (#104743 ) Summary: This usage is not ideal: subprocess.check_output(cmd, stderr=subprocess.STDOUT) * `check_output` will capture the command's stdout, and here we did not return it * not ideal to redirect the sub-command's stderr to the host process's stdout (with `check_call`, stdout stays stdout, stderr stays stderr). Differential Revision: D47275261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104743 Approved by: https://github.com/frank-wei	2023-07-12 00:36:27 +00:00
Danni Li	980fb94f9c	[Doc] Specify output parameters for FractionalMaxPool2d and FractionalMaxPool3d (#104941 ) Summary: Specify one of the output parameters must be set for FractionalMaxPool2d and FractionalMaxPool3d. Fix: #104861 Test Plan: Please see GitHub Actions Differential Revision: D47357240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104941 Approved by: https://github.com/mikaylagawarecki	2023-07-11 23:51:24 +00:00
Edward Z. Yang	73e179a5ca	Follow file move for functorch bits for ciflow/inductor (#105019 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105019 Approved by: https://github.com/Skylion007	2023-07-11 23:29:08 +00:00
Szymon Ożóg	1a6619a830	Added missing whitespace when reporting invalid gradient type (#104992 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104992 Approved by: https://github.com/albanD, https://github.com/soulitzer, https://github.com/Skylion007	2023-07-11 22:24:02 +00:00
Aaron Gokaslan	96b91ab248	Fix merged lintrunner error (#105005 ) Fixes lintrunner linter race condition. Follow up to #104917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105005 Approved by: https://github.com/malfet, https://github.com/ezyang	2023-07-11 22:04:49 +00:00
Joel Schlosser	ece19bf018	Update run_test.py to use TEST_WITH_SLOW_GRADCHECK flag (#104819 ) Finishes the job from #104537. See https://github.com/pytorch/pytorch/pull/104537#pullrequestreview-1520065008 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104819 Approved by: https://github.com/huydhn	2023-07-11 21:58:46 +00:00
PyTorch MergeBot	24aa8b9b9a	Revert "Deprecate registering autograd kernels at not an autograd key (#104481 )" This reverts commit ed13ab666419ae5dd3adbdb048c8f96f62b14b3d. Reverted https://github.com/pytorch/pytorch/pull/104481 on behalf of https://github.com/atalman due to failed in periodic tests ([comment](https://github.com/pytorch/pytorch/pull/104481#issuecomment-1631552846))	2023-07-11 21:48:22 +00:00
Aaron Gokaslan	2f95a3d0fc	[BE]: Apply ruff PERF fixes to torch (#104917 ) Applies automated ruff fixes in the PERF modules and enables all automatic ones. I also updated ruff which applied some additional fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104917 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-07-11 20:45:21 +00:00
Jack Taylor	c9a806be28	[ROCm] enable additional inductor/dynamo UTs (#104624 ) Enables additional inductor UTs on ROCm and un skips outdated skips. I have also removed a group of failures in `test_torchinductor_opinfo` which are now passing for CUDA and ROCm ``` - # The following 3 tests fail on CUDA with AssertionError: expected size 5==5, stride 5==1 at dim=0 - # linalg._svd's return value has different strides on CUDA vs CPU which causes this - # In test_meta.py there is a mechanism to skipping strides checks for some ops - # (including _linalg_svd), possibly we should have something similar here - "linalg.cond": {f32, f64}, - "linalg.svdvals": {f32, f64}, - "linalg.matrix_rank": {f32, f64}, - "linalg.svd": {f32, f64}, ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104624 Approved by: https://github.com/malfet	2023-07-11 20:44:02 +00:00
Svetlana Karslioglu	6f27c5185f	Fix broken link to torch.compile docs (#104982 ) The existing link https://pytorch.org/docs/master/dynamo/custom-backends.html is 404. Updating to use the new link. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104982 Approved by: https://github.com/msaroufim	2023-07-11 20:35:47 +00:00
William Wen	c60cb91700	[dynamo] fix bug where trace_source and graph_sizes artifacts were not being printed with TORCH_LOGS='+dynamo' (#104912 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104912 Approved by: https://github.com/Skylion007, https://github.com/mlazos	2023-07-11 20:09:44 +00:00
Edward Z. Yang	a2f04e9841	Force multi-line messages to still get log format prefix (#104932 ) This makes it easier to exclude multi-line messages using single line grepping. If your screen is wide enough this should not be a big problem. Example of what it looks like: ``` [2023-07-10 20:11:30,529] torch._dynamo.convert_frame.__guards: [DEBUG] GUARDS: [2023-07-10 20:11:30,529] torch._dynamo.convert_frame.__guards: [DEBUG] hasattr(L['x'], '_dynamo_dynamic_indices') == False [2023-07-10 20:11:30,529] torch._dynamo.convert_frame.__guards: [DEBUG] ___is_grad_enabled() [2023-07-10 20:11:30,529] torch._dynamo.convert_frame.__guards: [DEBUG] not ___are_deterministic_algorithms_enabled() [2023-07-10 20:11:30,529] torch._dynamo.convert_frame.__guards: [DEBUG] utils_device.CURRENT_DEVICE == None ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104932 Approved by: https://github.com/mlazos, https://github.com/albanD	2023-07-11 20:00:52 +00:00
Edward Z. Yang	515e3f2bb9	Add [rankN]: to log messages when distributed is initialized (#104929 ) Doing it in the formatter is kind of naughty but I stared a while at logging.setLogRecordFactory for a bit, and then decided it was a bit too global for a library to use well. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104929 Approved by: https://github.com/mlazos, https://github.com/Skylion007	2023-07-11 20:00:52 +00:00
Nikita Shulga	5e4ee15e85	[MPS] Fix unique flatten logic (#104938 ) Tensor must be flatted if dim is none before checking whether or not dim dimension is already None Fixes https://github.com/pytorch/pytorch/issues/104879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104938 Approved by: https://github.com/albanD	2023-07-11 19:55:56 +00:00
Yukio Siraichi	ad37dd5155	Make unspecified ints to range over negative and positive. (#104658 ) Currently, negative unspecified ints get specialized. This PR creates symbolic values for unspecified ints (including negative ones). For example, with this PR, the following code only compiles once, instead of 3 times: ```python def foo(x, y): return torch.fill(torch.zeros(x.shape), y) foo(10) foo(-5) foo(-3) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104658 Approved by: https://github.com/ezyang	2023-07-11 19:13:16 +00:00
Andrew Or	4b29829ece	[quant][pt2] Fix QAT convert for mobilenetv2 (#104110 ) Summary: QAT convert for mobilenetv2 was previously not working because we incorrectly applied dropout during eval as well as training. This is because, for exported models, model.eval() does not change the behavior of dropout, unlike models with torch ops. This commit simulates the effects of model.eval() for exported models as well by replacing the aten dropout pattern before eval. As of this commit, end-to-end QAT numerics now match for mobilenetv2 between FX and PT2. Test Plan: python test/test_quantization.py TestQuantizePT2EModels.test_qat_mobilenet_v2 Differential Revision: D46750343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104110 Approved by: https://github.com/jerryzh168	2023-07-11 18:42:42 +00:00
Svetlana Karslioglu	eb03af44ee	Fixes to the torch.compile doc and doctest (#104911 ) Fixing the user warning in doctest by removing autosummary from the compile/index.rst : ``` /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/__init__.py:docstring of torch.compile:1: WARNING: duplicate object description of torch.compile, other instance in compile/generated/torch.compile, use :noindex: for one of them ``` The error is no longer present in the log: https://github.com/pytorch/pytorch/actions/runs/5513741050/jobs/10052379357?pr=104911 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104911 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-07-11 17:54:12 +00:00
Yukio Siraichi	6abe0b2ee8	Disable translation validation on performance runs. (#104887 ) This PR disables translation validation (TV) when running the benchmark suits on performance workflows: inductor with A100s. In summary, the changes are: - Add flag for turning TV on and off on _benchmarks/dynamo/common.py_ - Turn TV on only on CI accuracy builds - Add `--no-translation-validation` target flag to _.ci/pytorch/test.sh_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104887 Approved by: https://github.com/ezyang	2023-07-11 17:30:40 +00:00
DanilBaibak	5d4b2fcc6f	Updated pillow version to 9.3.0 for Python version <= 3.8 (#104958 ) There are several vulnerabilities with pillow version 9.2.0. In the worst case, this can lead to arbitrary code execution - https://security.gentoo.org/glsa/202211-10. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104958 Approved by: https://github.com/jeanschmidt, https://github.com/malfet	2023-07-11 17:27:09 +00:00
PyTorch MergeBot	f01deb23d5	Revert "[dynamo][numpy] Add support for np.dtype (#103546 )" This reverts commit 07107919297db3f8ab37f11c12666b6d6d5f692e. Reverted https://github.com/pytorch/pytorch/pull/103546 on behalf of https://github.com/voznesenskym due to Failed on bench, unclear why bench test did not run on CI ([comment](https://github.com/pytorch/pytorch/pull/103546#issuecomment-1631203461))	2023-07-11 17:23:11 +00:00
Nikita Karetnikov	49a2b72927	[inductor] handle `Min` and `Max` in `TritonPrinter` (#104944 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104944 Approved by: https://github.com/ezyang	2023-07-11 17:11:31 +00:00
Jane Xu	15aa401baa	[foreach][NAdam] Minimize use of intermediates to decrease peak memory (#104910 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104910 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-07-11 17:08:07 +00:00
Jane Xu	6878d3a157	[foreach][RAdam] Minimize use of intermediates to decrease peak memory (#104904 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104904 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-07-11 17:08:07 +00:00
Richard Zou	ed13ab6664	Deprecate registering autograd kernels at not an autograd key (#104481 ) Context ------- This PR adds a new fallback to the Autograd dispatch keys. If you would prefer the old behavior: - A quick (unsupported) way to get the previous behavior is to call `torch._C._set_autograd_fallback("nothing")` - Register "torch::CppFunction::makeFallthrough()" to your Autograd key, like in https://gist.github.com/zou3519/d09a5f4b1afe2430af09fea67c6ff2c8 It is possible that this PR regresses performance of overhead-bound models. If this is the case, please reach out (and apply one of the temporary fixes in the previous section). Description for reviewers ------------------------- In order to deprecate registering autograd kernels at not an autograd key, we add a fallback to the Autograd dispatch keys. This fallback raises a warning if the user attempts to backprop through the operator and is also configurable to either warn or not warn. The goal of this PR is to - preserve as much BC as possible - raise a warning that whatever the user is doing is potentially wrong. - be as performant as possible There are roughly two cases: - if the post-autograd kernels return a Tensor that requires grad, then we install an autograd hook that raises a warning. We are preserving BC in that it is possible that the user has a torch::autograd::Function registered to their CPU key. - if the post-autograd kernels return Tensors that do not require grad, then we make them require_grad and install a WarnNotImplemented grad fn that warns in the backward pass. This is mildy BC-breaking (see next section). Test Plan: - bunch of new tests BC-Breaking Note ---------------- This PR adds a new fallback to the Autograd dispatch keys. It affects custom operators that do not have a kernel registered to the Autograd keys (e.g. AutogradCPU and AutogradCUDA). If the previous behavior was that the custom operator would return Tensors that do not require grad if the inputs do require grad, then this PR changes it so that all floating-point and complex returns do require grad. See the "Context" section above for how to get the old behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104481 Approved by: https://github.com/soulitzer	2023-07-11 16:48:39 +00:00
Jenny	e095716161	Add a note for Incorrect signature in nn.Module.register_full_backwar… (#104964 ) …d_pre_hook Fixes #102645 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104964 Approved by: https://github.com/albanD	2023-07-11 16:24:13 +00:00
Jane Xu	231364fd06	[optim] use lerp whenever possible (#104796 ) This is a better copy (with fixes) of #104781. Test plan: CI will pass once https://github.com/pytorch/pytorch/pull/104784 is landed Internal CI (and the newly enabled compiled optim tests) will pass after https://github.com/pytorch/pytorch/pull/104866 is landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104796 Approved by: https://github.com/albanD	2023-07-11 14:32:59 +00:00
Nikita Shulga	999abd56a7	[BE] Make ONNX imports lazy (#104843 ) This reduces total number of imported modules by default from 1419 to 1322 according to ``` time python -c "import sys;before=len(sys.modules);import torch;after=len(sys.modules);print(f'torch-{torch.__version__} imported {after-before} modules')" ``` and slightly reduces import time, while having no effect on UX (i.e. `torch.onnx.` submodule is kept intact) Suppress lint errors that appear after mypy accidentally starts listing more files, for more details see: https://github.com/pytorch/pytorch/issues/104940 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104843 Approved by: https://github.com/jansel, https://github.com/albanD	2023-07-11 12:54:22 +00:00
Huy Do	26f7f470df	Handle empty PR body in filter_test_configs (#104914 ) This is a bug discovered by https://github.com/pytorch/pytorch/pull/104810. Basically, when the PR body is empty, GitHub API returns a None value, which is passed into `parse_reenabled_issues` causing it to fail. ### Testing ``` python3 .github/scripts/filter_test_configs.py \ --workflow "pull" \ --job-name "linux-focal-py3-clang7-android-ndk-r19c-gradle-custom-build-single-full-jit / filter," \ --test-matrix "{ include: [ { config: 'default', shard: 1, num_shards: 1, runner: 'linux.2xlarge' }, ]}" \ --pr-number "104810" \ --tag "" \ --event-name "pull_request" \ --schedule "" \ --branch "" ``` The command works correctly without failing now Pull Request resolved: https://github.com/pytorch/pytorch/pull/104914 Approved by: https://github.com/clee2000	2023-07-11 10:16:58 +00:00
Danni Li	db4aed6a03	Include nn.ParameterDict in dynamo __getitem__ (#99771 ) Summary: Fix: #99735 Test Plan: Please see GitHub tests. Differential Revision: D45200616 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99771 Approved by: https://github.com/Skylion007, https://github.com/anijain2305	2023-07-11 08:19:04 +00:00
chunyuan	ba167e6578	Inductor cpp wrapper: fix codegen of ScatterFallback (#104524 ) Fix cpp wrapper failure on TorchBench model `basic_gnn_edgecnn` and `hf_Reformer` which contain scatter OP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104524 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-11 08:17:56 +00:00
Mengwei Liu	0710791929	[dynamo][numpy] Add support for np.dtype (#103546 ) ## Problem Trying to support numpy function call in dynamo, with numpy dtype as argument. For example: ``` def fn(x: int): return np.empty_like(x, dtype=np.float64) ``` ## Solution This currently doesn't work because `NumpyVariable` doesn't implement `as_proxy()`. The idea in `as_proxy()` for now is to convert `np.float64` and other np.<dtype> into `torch.dtype` and then feed into the corresponding `torch_np` method. For previous example, we convert `np.float64` to `torch.float64` in `as_proxy()` and then feed it into `torch_np.empy_like()` method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103546 Approved by: https://github.com/ezyang	2023-07-11 06:29:15 +00:00
kshitij12345	90eaa98d13	dynamo : kwarg support for wrap (higher order op) (#104180 ) Ref: https://github.com/pytorch/pytorch/issues/100278 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104180 Approved by: https://github.com/zou3519	2023-07-11 06:08:18 +00:00
Elias Ellison	ed5ea15714	[Easy] remove debug code (#104915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104915 Approved by: https://github.com/mlazos	2023-07-11 04:01:02 +00:00
Thiago Crepaldi	f1bff6601c	[ONNX] Add fake tensor support to torch.onnx.dynamo_export (#103865 ) ## Context prior to this PR https://github.com/pytorch/pytorch/pull/100017/ was merged onto PyTorch `main` branch with the goal of enabling `torch._dynamo.export` to perform symbolic tracing. In that context, symbolic tracing is defined as tracing of a model using fake inputs and weights. An input is Fake when `torch.nn.Tensor` is replaced by `torch._subclasses.FakeTensor`, whereas a weight is fake when a `torch.nn.Parameter` is replaced by `torch._subclasses.FakeTensor`. For additional context, several strategies were discussed with Meta to enable this feature, including 1) calling `torch._dynamo.export` within a `torch._subclass.FakeTensorMode` context and 2) fakefying input and model as separate step and then call `torch._dynamo.export` without an active `torch._subclass.FakeTensorMode` context. At the end, 2) was preferred and implemented by #100017 to minimize the number of side-effects the fake tensor mode has on the code base. As a consequence, `torch._dynamo.export` API introduced a new argument called `fake_mode`. When symbolic tracing is used, the user must pass in the `fake_mode` used to fakefy both the input and the model. Internally, `torch._dynamo.export` will adopt this `fake_mode` instead of creating its own instance. This is needed because each instance of `FakeTensorMode` has metadata on the tensor/parameter it fakefied. Thus, using real tensor/model and specify a `fake_mode` to `torch._dynamo.export` is an error. Also, specify a `fake_mode` instance to `torch._dynamo.export` different than the one used to fakefy the model and input is also an error. ## Changes introduced from this PR This PR is intended to integrate `torch._dynamo.export(fake_mode=...)` through `torch.onnx.dynamo_export`. In essence, it * Introduces a new public API `ONNXFakeContext` which wraps a `FakeTensorMode` under the hood. This removes complexity from the user side while still allow the exporter to leverage the fake mode. * Adds a new public API `enable_fake_mode` context manager that instantiates and return a `ONNXFakeContext`. * Adds a new `ExportOptions.fake_context` that will be used to persist the `ONNXFakeContext` created by `enable_fake_mode` and plumb through until it reaches the call to `torch._dynamo.export`. * Adds a `model_state_dict` argument to `ExportOutput.save` API. * When model is exported with fake tensors, no actual data exist in the FX module and, therefore, in the ONNX graph. * In fact, `torch.fx.make_fx` lifts initializers as model input when fake tensors are used * https://github.com/pytorch/pytorch/pull/104493 is needed to enforce name matching between Parameters and inputs * A model checkpoint file or state_dict is needed to populate the ONNX graph with real initializers through `export_output.save(model_state_dict=...)` API Symbolic tracing, or onnx fake mode, is only enabled when the user instantiates the input and model within the `enable_fake_mode` context. Otherwise, real tracing is done, which preserves the current behavior. ## Usability Because symbolic tracing depends a lot on having changes made on Dynamo side before it can be consumed on ONNX exporter, this feature may have its API and assumptions changed as symbolic tracing matures upstream. Nonetheless, it is still important to have this feature merged ASAP on the ONNX exporter side to "lock" changes on Dynamo that would otherwise break ONNX exporter without warning. Example: ```python class Model(torch.nn.Module): def __init__(self) -> None: super().__init__() self.linear = torch.nn.Linear(2, 2) def forward(self, x): out = self.linear(x) return out with torch.onnx.enable_fake_mode() as fake_context: x = torch.rand(5, 2, 2) model = Model() # Export the model with fake inputs and parameters export_options = ExportOptions(fake_context=fake_context) export_output = torch.onnx.dynamo_export( model, x, export_options=export_options ) model_state_dict = Model().state_dict() # optional export_output.save("/path/to/model.onnx", model_state_dict=model_state_dict) ``` ## Next steps * Add unit tests running the exported model with ORT Today this is not possible yet because `make_fx` used by our Decomposition pass lifts initializers as model inputs. However, the initializer names are not preserved by FX tracing, causing a mismatch between the initializer and input name. https://github.com/pytorch/pytorch/pull/104493 and https://github.com/pytorch/pytorch/pull/104741 should fix the initializer mismatch, enabling model execution * Revisit `ONNXTorchPatcher` and how the ONNX initializers are saved in the graph as external data We can try to get rid of the PyTorch patcher. If we can't, we might prefer to create specific patchers, say `FXSymbolicTracePatcher` used specifically during an export using `torch.fx.symbolic_trace` and maybe a `ExportOutputSavePacther` used specifically for `ExportOutput.save` to prevent "patching too many pytorch API that we don't need ## References * [FakeTensor implementation](https://github.com/pytorch/pytorch/blob/main/torch/_subclasses/fake_tensor.py) * [PR that adds fake tensor support to torch._dynamo.export](https://github.com/pytorch/pytorch/pull/100017) * [Short fake tensor documentation](https://pytorch.org/torchdistx/latest/fake_tensor.html) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103865 Approved by: https://github.com/BowenBao	2023-07-11 03:17:17 +00:00
haozhe.zhu	ca8c56ff5d	fix QuantizeAvx512 (#104400 ) For quantize ``` for (; i < len / VLEN * VLEN; i += VLEN) { __m512 x_vals = _mm512_load_ps(src + i); __m512 x_transformed_v = _mm512_mul_ps(x_vals, inverse_scale_v); x_transformed_v = _mm512_min_ps(x_transformed_v, _mm512_set1_ps(int32_float_max_val)); __m512i x_rounded_v = _mm512_cvtps_epi32(x_transformed_v); x_rounded_v = _mm512_add_epi32(x_rounded_v, _mm512_set1_epi32(zero_point)); __m512i x_clipped_v = _mm512_max_epi32(min_v, _mm512_min_epi32(max_v, x_rounded_v)); x_clipped_v = _mm512_shuffle_epi8(x_clipped_v, shuffle_mask_v); x_clipped_v = _mm512_permutexvar_epi32(permute_mask_l8_v, x_clipped_v); _mm_storeu_si128( reinterpret_cast<__m128i>(dst + i), _mm512_castsi512_si128(x_clipped_v)); } ``` ``` x_clipped_v = _mm512_shuffle_epi8(x_clipped_v, shuffle_mask_v); x_clipped_v = _mm512_permutexvar_epi32(permute_mask_l8_v, x_clipped_v); ``` is aiming to cast `int32` to `int8` and shuffle 16 `int8` to the first 128 bits. For example, `A1` represent 8bit ``` x_clipped_v = _mm512_shuffle_epi8(x_clipped_v, shuffle_mask_v); A1A2A3A4* B1B2B3B4 C1C2C3C4 D1D2D3D4 -> D4C4B4A4 other 32 * 3 bit E1E2E3E4 F1F2F3F4 G1G2G3G4 H1H2H3H4 -> H4G4F4E4 other 32 * 3 bit I1I2I3I4 J1J2J3J4 K1K2K3K4 L1L2L3L4 -> L4K4J4I4 other 32 * 3 bit M1M2M3M4 N1N2N3N4 O1O2O3O4 P1P2P3P4 -> P4O4N4M4 other 32 * 3 bit x_clipped_v = _mm512_permutexvar_epi32(permute_mask_l8_v, x_clipped_v); D4C4B4A4 other 32 * 3 bit -> D4C4B4A4 H4G4F4E4 L4K4J4I4 P4O4N4M4 H4G4F4E4 other 32 * 3 bit other 3 * 4 * 32 bits L4K4J4I4 other 32 * 3 bit P4O4N4M4 other 32 * 3 bit ``` Based on https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_permutexvar_epi32&ig_expand=4966,5088. ``` FOR j := 0 to 15 i := j32 id := idx[i+3:i]32 dst[i+31:i] := a[id+31:id] ENDFOR dst[MAX:512] := 0 ``` the `permute_mask_l8_v` should satisfy ``` permute_mask_l8_v[3:0] = 0 permute_mask_l8_v[3 + 32:0 + 32] = 4 permute_mask_l8_v[3 + 64:0 + 64] = 8 permute_mask_l8_v[3 + 96:0 + 96] = 12 ``` The other part of `permute_mask_l8_v` does not matters. `AVX2` version is correct. It is not exposed before it is only called with fixed length `64` https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec512/vec512_qint.h#L545-L546. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104400 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/jerryzh168	2023-07-11 02:02:23 +00:00
Michael Lazos	dbb69f78fe	Add assert + test for artifact log booleans (#104907 ) Fixes https://github.com/pytorch/pytorch/issues/104885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104907 Approved by: https://github.com/ezyang	2023-07-11 01:59:23 +00:00
Driss Guessous	d184c81166	Add -fstandalone-debug debug flag (#104475 ) # Summary While debugging something in lldb, I found that the formatter I wrote for c10::intarrayref was not working correctly producing: `(std::string) $6 = error: summary string parsing error` Based off of this thread: https://github.com/vadimcn/codelldb/issues/415 I adde the standalone-debug information and fixed the std::string formatting issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104475 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-07-11 01:29:20 +00:00
Andrew Gu	63d1fb21f5	[FSDP] Default `limit_all_gathers=True` (#104900 ) This PR defaults to `limit_all_gathers=True`. I included a `record_function()` for the rate limiter synchronization to help with user confusion on the gap in the pre-forward: <img width="874" alt="Screenshot 2023-07-10 at 3 28 18 PM" src="https://github.com/pytorch/pytorch/assets/31054793/61f55e0e-58d7-4162-9395-bea06d3e8d8a"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104900 Approved by: https://github.com/fegin	2023-07-11 01:04:29 +00:00
Rodrigo Kumpera	7c3c3dd7ca	[C10D] Reimplement TCPStore wait timeout logic. (#100594 ) Current TCPStore wait logic leaves the client socket in a bad state if waiting timesout. This happens because all recv functions raise an exception on timeout and that's it. The problem is that on timeout we need to unregister the wait. We implement this with client side cancelation by adding a new CANCEL_WAIT instruction. So, if no data arrives before the deadline, the client sends a CANCEL_WAIT command. The server sends a WAIT_CANCELED response to that command, always. This gets us down to the last issue, which is that there's a race between timeout'ing, canceling the wait and the wait completing. The client needs to handle the server sending a STOP_WAITING followed by a WAIT_CANCELED answer. This ensures client and server state are synchronized regardless of whether the wait timeouts or not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100594 Approved by: https://github.com/H-Huang	2023-07-11 00:36:41 +00:00
maxren	332f2057df	[XNNPACK][QS8] torch.nn.ELU (#104307 ) Differential Revision: [D47075933](https://our.internmc.facebook.com/intern/diff/D47075933/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104307 Approved by: https://github.com/digantdesai	2023-07-11 00:35:13 +00:00
maxren	c4e084e3c7	[XNNPACK][QS8] torch.nn.ConstantPad2d (#104306 ) Differential Revision: [D47075932](https://our.internmc.facebook.com/intern/diff/D47075932/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104306 Approved by: https://github.com/digantdesai	2023-07-11 00:35:02 +00:00
maxren	2c960c73a3	[XNNPACK][QS8] torch.permute (#104305 ) Differential Revision: [D47075934](https://our.internmc.facebook.com/intern/diff/D47075934/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104305 Approved by: https://github.com/digantdesai	2023-07-11 00:34:58 +00:00
maxren	d41c4a8338	[XNNPACK][QS8] torch.clamp (#104304 ) Differential Revision: [D47075935](https://our.internmc.facebook.com/intern/diff/D47075935/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104304 Approved by: https://github.com/digantdesai	2023-07-11 00:34:58 +00:00
Huy Do	66c41e1c5e	Avoid generating core dumps when CONTINUE_THROUGH_ERROR is set (#104905 ) Fixes https://github.com/pytorch/pytorch/issues/104234. This closes another loop hole where multiple core files could be generated when CONTINUE_THROUGH_ERROR flag is set in CI. This ensures that only one core file is generated in regular Linux test job. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104905 Approved by: https://github.com/clee2000	2023-07-11 00:20:33 +00:00
Ying Zhang	e940d5d3c3	Disable cudagraphs by default when dynamic shape is enabled. (#104448 ) Disable cudagraphs when dynamic shape is enabled (via torch.compile(dynamic=True)). Otherwise, Inductor recompiles for each new shape, which doesn't seem to be very reasonable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104448 Approved by: https://github.com/jansel, https://github.com/ezyang	2023-07-11 00:16:37 +00:00
Matthew Hoffman	3279f06410	Merge and improve torch optim optimizer type stubs (#102593 ) Fixes #102428 Also improves hook registration type hints: ```python from typing import Any, Dict, Tuple from torch import nn from torch.optim import Adam, Adagrad, Optimizer linear = nn.Linear(2,2) optimizer = Adam(linear.parameters(), lr=0.001) def pre_hook_fn_return_none(optimizer: Adam, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None: return None def pre_hook_fn_return_modified( optimizer: Optimizer, inputs: Tuple[Any, ...], kwargs: Dict[str, Any] ) -> Tuple[Tuple[Any, ...], Dict[str, Any]]: return inputs, kwargs def hook_fn(optimizer: Optimizer, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None: return None def hook_fn_other_optimizer(optimizer: Adagrad, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None: return None optimizer.register_step_post_hook(hook_fn) # OK optimizer.register_step_pre_hook(pre_hook_fn_return_none) # OK optimizer.register_step_pre_hook(pre_hook_fn_return_modified) # OK optimizer.register_step_post_hook(hook_fn_other_optimizer) # Parameter 1: type "Adam" cannot be assigned to type "Adagrad" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102593 Approved by: https://github.com/janeyx99	2023-07-11 00:07:30 +00:00
Edward Z. Yang	6059fea760	Make perf_hint_log report at info level (#104873 ) If you do it at warning, these log messages will get displayed by default, which is not the intended behavior. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104873 Approved by: https://github.com/mlazos	2023-07-10 23:46:34 +00:00
Michael Lazos	4063158df9	Enable running compiled optimizers in CI (#104888 ) as title for reference: this is a followup to https://github.com/pytorch/pytorch/pull/104121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104888 Approved by: https://github.com/janeyx99	2023-07-10 23:45:41 +00:00
Jane Xu	7e9c891056	[foreach][AdamW] Minimize intermediates to save peak memory (#104898 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104898 Approved by: https://github.com/albanD	2023-07-10 23:40:52 +00:00
Yukio Siraichi	d5dbe77629	Fix mod semantics for `Z3Ops`. (#104827 ) Python `mod` semantics is not the same as the mathematical modulus operation. According to the Python reference: `a = floor(a / b) * b + a % r`. In other words: `a % b = a - floor(a / b) * b`. This PR fixes the old implementation which used SMT-LIB2 semantics for `mod`. In short, it only worked with integers and had the following guarantee: `0 <= a % b < b`. In summary, the changes are: - `a % b = a - floordiv(a, b) * b` - `a` and `b` can be both integer or real - The result will be real if any of the arguments is real. Otherwise, it will be integer Pull Request resolved: https://github.com/pytorch/pytorch/pull/104827 Approved by: https://github.com/lezcano	2023-07-10 23:35:04 +00:00
Edward Z. Yang	951b9a6a14	Update torchbench pin (#104829 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104829 Approved by: https://github.com/albanD	2023-07-10 23:31:27 +00:00
Edward Z. Yang	0300be5b7b	Fix AttributeError("'constexpr' object has no attribute 'type'") (#104831 ) Fixes #104759 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104831 Approved by: https://github.com/Skylion007, https://github.com/voznesenskym	2023-07-10 23:26:42 +00:00
fduwjj	aa84078c6c	[PTD][TP] Add BWD support for colwise embedding sharding (#104820 ) Originally, we didn't enable BWD for colwise embedding because we thought it was just for inference, but it turns out that we do need it for training. So, let's enable it for now and unit test is also added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104820 Approved by: https://github.com/fegin	2023-07-10 22:33:20 +00:00
atalman	fd378db6a8	Fix lint after 104902 (#104909 ) Fix lint after PR: #104902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104909 Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/huydhn	2023-07-10 22:17:06 +00:00
Michael Lazos	9861c4a3f8	Add lerp decomps + meta registrations (#104866 ) as title Pull Request resolved: https://github.com/pytorch/pytorch/pull/104866 Approved by: https://github.com/janeyx99	2023-07-10 22:07:57 +00:00
Shunting Zhang	dff42857bd	[inductor] update triton pin (#104303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104303 Approved by: https://github.com/desertfire	2023-07-10 21:30:50 +00:00
Joel Schlosser	c2e286daf9	Testing: Print test reproduction command on failure (#104537 ) MS2 of the Reproducible Testing BE initiative. For context, this is the ask: ``` Another thing that would be really great as we start to have more dependent systems or types of tests (functorch, dynamo, crossref) would be to have a minimally reproducible version of the test (something at the end of the HUD comment like: "Run python test/test_file.py -k test_name" but also if you need flags, like crossref it would be like "Run <flag to run crossref> python test/..." ). I'll often go through the test infra to find the flags that I need to pass when something only breaks crossref/dynamo tests. ``` Implementation details: * Adds a new flag `PRINT_REPRO_ON_FAILURE` that is settable through the environment variable `PYTORCH_PRINT_REPRO_ON_FAILURE=1` * Default is ON but I can be persuaded otherwise * When the flag is enabled, our base `TestCase` will wrap the test method in a context manager that catches any non-skip exceptions and appends a repro string to the exception message. The repro includes setting of necessary test flags through env vars. Example: ``` To execute this test, run the following from the base repo dir: PYTORCH_TEST_WITH_CROSSREF=1 python test/test_ops.py -k test_foo_add_cuda_float32 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` * To keep track of flag settings, this PR introduces a new `TestEnvironment` class that defines global flags by querying related environment variables. Flag and env var names are purposefully kept searchable via full names. Example usages: ```python TestEnvironment.def_flag("TEST_WITH_TORCHINDUCTOR", env_var="PYTORCH_TEST_WITH_INDUCTOR") # can track implication relationships to avoid adding unnecessary flags to the repro TestEnvironment.def_flag( "TEST_WITH_TORCHDYNAMO", env_var="PYTORCH_TEST_WITH_DYNAMO", implied_by_fn=lambda: TEST_WITH_TORCHINDUCTOR or TEST_WITH_AOT_EAGER) # can use include_in_repro=False to keep the flag from appearing in the repro command TestEnvironment.def_flag( "DISABLE_RUNNING_SCRIPT_CHK", env_var="PYTORCH_DISABLE_RUNNING_SCRIPT_CHK", include_in_repro=False) # the default default value is False, but this can be changed TestEnvironment.def_flag( "PRINT_REPRO_ON_FAILURE", env_var="PYTORCH_PRINT_REPRO_ON_FAILURE", default=(not IS_FBCODE), include_in_repro=False) ``` * AFAICT it is only feasible to achieve this from within the test framework rather than at the CI level. This is because CI / `run_test.py` are unaware of individual test cases. Implementing it in our base `TestCase` class has the broadest area of effect, as it's not isolated to e.g. OpInfo tests. * I couldn't find an easy way to test the logic via `test_testing.py`, as the logic for extracting the test filename doesn't work for generated test classes. I'm open to ideas on testing this, however. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104537 Approved by: https://github.com/ezyang, https://github.com/janeyx99, https://github.com/huydhn	2023-07-10 21:24:02 +00:00
Bert Maher	912a6a1b5a	[pt2][test] Loosen stack trace check in test (#104902 ) Depending on inlining and demangling provided by the underlying compiler, we may get different function names and namespaces in the stack trace. Allow everything I've seen so far. Differential Revision: [D47344213](https://our.internmc.facebook.com/intern/diff/D47344213/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D47344213/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/104902 Approved by: https://github.com/eellison	2023-07-10 21:12:25 +00:00
Michael Lazos	86680a6c0b	[dynamo] handle calls to typing.cast (#104799 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/104799 Approved by: https://github.com/jansel	2023-07-10 21:05:17 +00:00
drisspg	2ee440054b	Small tweaks to SDPA docs (#104749 ) Fixes #104652 <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 2d61112</samp> No summary available (An error occurred while summarizing these changes: Gave up after 3 retries: Failed to read error response) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104749 Approved by: https://github.com/mikaylagawarecki	2023-07-10 21:01:45 +00:00
Edward Z. Yang	d1ca98665f	Switch automatic_dynamic_shapes to True by default in fbcode (#104883 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104883 Approved by: https://github.com/xw285cornell	2023-07-10 20:32:22 +00:00
Peter Bell	bcdd4130b4	[inductor] Fix float64 constants in triton codegen (#104830 ) Fixes #101684 Before this change, we get a float constant in triton ``` tmp0 = 0.2 ``` which in triton IR becomes a float32 value ``` %cst_0 = arith.constant dense<2.000000e-01> : tensor<2xf32> ``` After, we get a tensor with explicit type ``` tmp0 = tl.full([1], 0.2, tl.float64) ``` which does generate a float64 in the triton IR ``` %cst_0 = arith.constant dense<2.000000e-01> : tensor<2xf64> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104830 Approved by: https://github.com/lezcano	2023-07-10 19:40:50 +00:00
Iris Zhang (PyTorch)	7b538d8987	[DCP][fsspec] Consolidate OSS FsspecWriter/Reader and internal FsspecWriter/Reader (#104724 ) Summary: This diff does the following: 1. re-enable single_file_per_rank for FsspecWriter, as the issue of file slicing error is resolved because of [https://github.com/pytorch/pytorch/pull/99167] 2. remove sync_files from FsspecWriter as there is no fsspec equivalence. 3. remove the internal implementation of FsspecWriter/Reader, as it has been upstreamed to PyTorch OSS 4. keep the internal test for manifold inside internal as we can only test it in fb environment 5. consolidate test to remove duplicates 6. remove unnecessary TARGETS Test Plan: ``` buck test @//mode/dev-nosan //caffe2/test/distributed/checkpoint/fb:test_fsspec_filesystem -- --print-passing-details ---------------------------------------------------------------------- Ran 1 test in 54.894s OK /usr/local/fbcode/platform010/lib/python3.8/tempfile.py:818: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpzomokvh6'> _warnings.warn(warn_message, ResourceWarning) Buck UI: https://www.internalfb.com/buck2/4cb722a2-3ee7-48f2-a9ef-55ee6fb1a498 Test UI: https://www.internalfb.com/intern/testinfra/testrun/8725724447995201 Network: Up: 8.8 MiB Down: 1.5 GiB (reSessionID-04c29f56-ae94-4187-8a1a-c812f432674d) Jobs completed: 209847. Time elapsed: 1:56.5s. Cache hits: 100%. Commands: 85687 (cached: 85687, remote: 0, local: 0) Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D47266068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104724 Approved by: https://github.com/fegin, https://github.com/fduwjj	2023-07-10 19:31:01 +00:00
toma	48a49b2683	use more informative error message for ConstandPad2d/3d (#104762 ) Fixes #104508 As discussed in #104508, the current error message for `torch.nn.ConstantPad2d` and `torch.nn.ConstantPad3d` is misleading, this PR fixes the problem. The fixed error message is shown below: For `torch.nn.ConstantPad2d`: <img width="619" alt="image" src="https://github.com/pytorch/pytorch/assets/6964699/dd15f42a-b6ad-4c6d-aa41-f26d08144189"> For `torch.nn.ConstantPad3d`: <img width="630" alt="image" src="https://github.com/pytorch/pytorch/assets/6964699/ac99b80f-73c1-4d7f-b9a1-74bf45ee4c21"> cc: @mikaylagawarecki Please help me check this PR, thanks! Pull Request resolved: https://github.com/pytorch/pytorch/pull/104762 Approved by: https://github.com/mikaylagawarecki	2023-07-10 19:00:47 +00:00
Mikayla Gawarecki	1ad435772b	Added option to always call nn.Module global/non-global forward hooks (#104278 ) Fix #103997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104278 Approved by: https://github.com/albanD	2023-07-10 18:58:07 +00:00
Michael Lazos	0433cb0596	[dynamo] simulate tracing tree_map_only (#104815 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/104815 Approved by: https://github.com/voznesenskym	2023-07-10 18:05:35 +00:00
Michael Lazos	b93590b692	Copy debug artifacts instead of renaming (#104561 ) Fixes https://github.com/pytorch/pytorch/issues/100567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104561 Approved by: https://github.com/jansel	2023-07-10 17:44:13 +00:00
Jane Xu	35f0e35529	[foreach][Adam] Minimize use of intermediates to decrease peak memory (#104780 ) Starts addressing https://github.com/pytorch/pytorch/issues/97712 by - Minimizing intermediates usage for foreach Adam - Document the extra memory usage - Add comments within the code for clarity now that we reuse intermediates - Add tests - Did some refactoring Next steps involve doing this for all other foreach implementations. Note that even after this change, foreach mem usage will be higher than forloop due to the fact that we have a minimum budget of 1 intermediate (to not muddle the input values) and the intermediate will be larger. For capturable, the memory usage is higher due to moving more tensors to CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104780 Approved by: https://github.com/albanD	2023-07-10 17:38:46 +00:00
Jane Xu	e25f5732c8	Add meta registrations and distributed decomps: _foreach_div_.Scalar, sqrt_.default (#104779 ) This PR unblocks #104780 by resolving spmd tracing test issues and by adding meta registrations for foreach inplace ops (div_ and sqrt_) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104779 Approved by: https://github.com/fegin, https://github.com/albanD	2023-07-10 17:38:46 +00:00
Jane Xu	038cb4075a	Add capturable/maximize tests to Adam(W) optim configs (#104669 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104669 Approved by: https://github.com/albanD	2023-07-10 17:38:46 +00:00
Iris	af52f6b928	[DCP] Add documentation for HSDP saving using DCP (#104810 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104810 Approved by: https://github.com/fduwjj	2023-07-10 17:33:05 +00:00
stelzo	e695b397e1	Fix broken ROCm quick start link (#104527 ) The AMD ROCm docs got a new subdomain and the naming changed a bit, so the old link went 404. This PR just updates the link to the newest quick start guide which includes the installation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104527 Approved by: https://github.com/pruthvistony, https://github.com/hongxiayang, https://github.com/malfet	2023-07-10 16:49:17 +00:00
Adnan Akhundov	4911b80b8e	[inductor] addmm + ReLU / GELU fusion pass (#104132 ) Summary: Add a new path in `post_grad.py` for replacing addmm + ReLU / GELU activation with the corresponding `_addmm_activation` call (with `use_gelu=False` or `True`, respectively). The replacement is done only on `max_autotune_gemm=False` and when the activation is fusible. Test Plan: $ python test/inductor/test_pattern_matcher.py -k test_addmm_activation -v (__main__.TestPaternMatcher.test_addmm_activation) ... /data/users/aakhundov/pytorch/torch/_inductor/compile_fx.py:128: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. warnings.warn( Using FallbackKernel: aten._addmm_activation.default Using FallbackKernel: aten._addmm_activation.default /data/users/aakhundov/pytorch/torch/_dynamo/eval_frame.py:373: UserWarning: changing options to `torch.compile()` may require calling `torch._dynamo.reset()` to take effect warnings.warn( frames [('total', 1), ('ok', 1)] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] inductor [] ok ---------------------------------------------------------------------- Ran 1 test in 13.415s OK Reviewers: @eellison Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/104132 Approved by: https://github.com/eellison, https://github.com/jansel	2023-07-10 16:44:14 +00:00
Edward Z. Yang	7166df8094	Add big doc to wrap_fx_proxy_cls (#103407 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103407 Approved by: https://github.com/voznesenskym	2023-07-10 16:00:11 +00:00
Michael Lazos	4b8378967a	Fix pytest test discovery for vscode (#104864 ) With the latest update, this test class name started breaking pytest test discovery in vscode Pull Request resolved: https://github.com/pytorch/pytorch/pull/104864 Approved by: https://github.com/Chillee, https://github.com/albanD, https://github.com/malfet	2023-07-10 14:56:41 +00:00
Edward Z. Yang	af34123caf	Consolidate example_value int cases in wrap_fx_proxy_cls (#104836 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104836 Approved by: https://github.com/voznesenskym	2023-07-10 13:06:06 +00:00
PyTorch MergeBot	e7fe2a797c	Revert "[optim] use lerp whenever possible (#104796 )" This reverts commit fbe2a7e50a940ba7a12b003241a2699f7a731afb. Reverted https://github.com/pytorch/pytorch/pull/104796 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/104796#issuecomment-1628591105))	2023-07-10 09:36:41 +00:00
Chien-Chin Huang	46154c4c35	[FSDP][optim_state_dict] The correct way to initialize optimizer states if the corresponding param is empty (#104765 ) When using KeyedOptimizer.init_state(), some optimizers initializes the states even if the param is empty (size() == 0) while some optimizer avoid initializing the states. There is no way FSDP can tell. Instead, FSDP should look up `optim.state`. Fortunatelly, `optim.state` does not rely on FQNs which some internal users change the FQNs. Differential Revision: [D47285562](https://our.internmc.facebook.com/intern/diff/D47285562/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104765 Approved by: https://github.com/fduwjj	2023-07-10 08:00:55 +00:00
XiaobingSuper	54f33265db	inductor(re-land): support cpu fusion path for bfloat16 amp (#104399 ) This PR is about the fusion of amp path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104399 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-07-10 00:58:04 +00:00
Edward Z. Yang	1b24a75175	Generalize sympy.Rel test to sympy.logic.boolalg.Boolean (#104833 ) Constant booleans are not relational, but you typically will still want to match against them. I grepped the codebase and this appears to be exhaustive. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104833 Approved by: https://github.com/voznesenskym	2023-07-10 00:07:06 +00:00
Edward Z. Yang	26ff7a7e2a	Allow for torch.sym_int to return int while tracing (#104837 ) Per https://github.com/pytorch/pytorch/pull/103303 we cannot universally allow tracing in all functions that return int, as the graph breaks appear to be load bearing in some cases. However, allowing for torch.sym_int to be traced in even if the result is statically known is fine; this can happen in case of a SymBool to int conversion. This PR is not exhaustive but e.g., I fixed size/stride/numel handling in https://github.com/pytorch/pytorch/pull/103438 The biggest risk is that arithmetic operations on sizes end up getting constant-ified (this appears to have happened in practice for modulus, which is why it's in this list.) If we don't care about spewing useless computation into the graph, a more aggressive version of this PR would be to greatly expand the list of allowed to specialize to int targets and then undo https://github.com/pytorch/pytorch/pull/103438 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104837 Approved by: https://github.com/voznesenskym	2023-07-09 23:17:59 +00:00
Michael Lazos	dfe7a1e089	[dynamo] Support wrapping + returning tensor subclasses (#104802 ) as title - used for [tracing the FSDP collectives](`d8cb80e382/torch/distributed/_functional_collectives.py (L425)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104802 Approved by: https://github.com/jansel	2023-07-09 22:16:10 +00:00
Aaron Gokaslan	51e246affc	Update cuDNN frontend submodule to v9.1 (#104847 ) Updates cudnn_frontend to the from v9 to v9.1 with the latest bugfixes and cmake fixes. Most notable the previous release forgot to increment the version constants. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104847 Approved by: https://github.com/ezyang	2023-07-09 21:53:16 +00:00
Michael Suo	546db2e36e	[fx] make fx.wrap idempotent (#104838 ) Previously, if you called `torch.fx.wrap()` on the same thing twice, it would add two entries to `_wrapped_fns_to_patch`. Then, when tracing, the patcher would process them both. On the second entry, the patcher would double-wrap the fn (e.g. `wrap(wrap(orig_fn))`) This makes it so that wrapping is observable after the trace. While normally, a Patcher instance will "revert" the wrapping after tracing, the double wrapped function goes from `wrap(wrap(orig_fn)) -> wrap(orig_fn)`. This happens to work in normal fx stuff (after all, the wrapper function will behave exactly like the original function). But it upsets torch.package, which is not expecting to see a weird wrapper function in the graph. This PR adds a dictionary to deduplicate `wrap()` calls, ensuring that the patcher only operates each once per frame-fn pair. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104838 Approved by: https://github.com/Chillee	2023-07-09 20:57:46 +00:00
Angela Yi	87e6b19ee0	[export] Make serializer more composable (#104816 ) Test Plan: CI Differential Revision: D47311044 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104816 Approved by: https://github.com/zhxchen17	2023-07-09 19:02:35 +00:00
pbialecki	98d48709fe	update cudnn==8.9.2.26 in .ci/docker (#104795 ) Follow-up of https://github.com/pytorch/builder/pull/1436 CC @atalman @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/104795 Approved by: https://github.com/malfet	2023-07-09 18:55:34 +00:00
BowenBao	395a0ba303	Training skip list should not be applied on inference bench (#104738 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104738 Approved by: https://github.com/thiagocrepaldi, https://github.com/desertfire	2023-07-09 17:39:17 +00:00
Bin Bao	a860b965f1	[inductor] Relax custom op schema checking for cpp_wrapper (#104349 ) Summary: Remove fallback ops whitelist because FallbackKernel.set_cpp_kernel is doing sufficient checking Differential Revision: [D47269612](https://our.internmc.facebook.com/intern/diff/D47269612) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104349 Approved by: https://github.com/jgong5, https://github.com/chunyuan-w, https://github.com/jansel	2023-07-09 17:31:31 +00:00
PyTorch UpdateBot	dd6c38cb59	[vision hash update] update the pinned vision hash (#104834 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104834 Approved by: https://github.com/pytorchbot	2023-07-09 03:30:51 +00:00
Yang Chen	3179c21286	remove aot_inductor_lib from deeplearning (#104730 ) Summary: AOTInductor model wrapper code has been moved to torch/_inductor so that we can remove the duplicates from deeplearning, which were placed there temporarily. This PR also made the following changes to inductor codecache to make it work with AOTInductor: * take the full input and output paths in aot_mode * use a more suitable way to retrieve dirname from the input_path Differential Revision: D47118805 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104730 Approved by: https://github.com/jansel	2023-07-08 21:26:14 +00:00
Jason Ansel	dffcf999bd	Misc changes from compiled autograd branch (#104316 ) This PR pulls out some standalone changes from #103822 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104316 Approved by: https://github.com/ezyang	2023-07-08 20:59:20 +00:00
Peter Bell	e80787c8e1	[inductor] Split ops.reduction into reduction and store_reduction (#102737 ) This is intended as a first step towards reductions with multiple outputs. This also incidentally improves CSE of reductions under C++ codegen. For example, ```python def fn(x): return torch.argmin(x, dim=-1), torch.argmin(x, dim=-1) ``` Currently this generates two reductions, where the common load is CSEd ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } if (tmp_acc1.value > tmp0) { tmp_acc1.index = i1; tmp_acc1.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; auto tmp2 = tmp_acc1.index; out_ptr1[static_cast<long>(i0)] = tmp2; ``` but with this change it gets CSEd to a single accumulator ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; out_ptr1[static_cast<long>(i0)] = tmp1; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102737 Approved by: https://github.com/jgong5, https://github.com/lezcano	2023-07-08 20:48:29 +00:00
Peter Bell	0ceca92f80	[inductor] Add single pass "var_unnormalized" reduction_type (#102486 ) This is a bit inefficient because it computes the mean and throws it away since ir.Reduction nodes only have 1 output. However, the mean can at least be scheduled into the same loop as the variance now since there is no data dependency. Thus we can take fewer passes over the data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102486 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-07-08 20:48:29 +00:00
Edward Z. Yang	26108d5d2b	Add --check-str support to after_aot minifier (#104758 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104758 Approved by: https://github.com/janeyx99, https://github.com/voznesenskym	2023-07-08 20:20:55 +00:00
Yukio Siraichi	85cbe7e6fd	Add timeout for translation validation instances. (#104654 ) As of now, translation validation runs to its completion. However, Z3 is time consuming. PR #104464, for example, disables translation validation for a few benchmarks. Instead, this PR introduces a timeout for translation validation. In that case, Z3 will return `unknown`, since it wasn't able to prove or disprove the assertions. Then, we log it as a warning, but don't stop execution. Here's a summary of the changes: - Added an environment variable for turning translation validation on and off - Added an environment variable for setting the translation validation timeout - Possibly reverts the changes in #104464 - ~~Move from "QF_NRA" to "QF_NIRA" logic~~ - ~~It makes more sense, given the nature of the problems~~ - "QF_NRA" seems to solve more instances of _dynamo/test_dynamic_shapes.py_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104654 Approved by: https://github.com/ezyang	2023-07-08 19:19:00 +00:00
soulitzer	91dcc3b272	Fix activation checkpoint for mps (#104787 ) Fixes https://github.com/pytorch/pytorch/issues/104478 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104787 Approved by: https://github.com/albanD	2023-07-08 14:57:05 +00:00
soulitzer	c85468a94c	[autograd Function] Add private API to not materialize grads for non-differentiable outputs (#104291 ) Fixes https://github.com/pytorch/pytorch/issues/104272 This PR adds a new private API `materialize_non_diff_grads` (default True) such that when set to False, grad outputs corresponding to outputs marked non-differentiable would receive None instead of a zero-filled tensor. This is overrides the setting of `materialize_grads`, i.e. grad outputs corresponding non-differentiable outputs would still be None even if `materialize_grads=True` (the default). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104291 Approved by: https://github.com/albanD	2023-07-08 14:53:54 +00:00
Andrew Gu	e600505e32	[FSDP][5/N] Unblock `ignored_states` + auto wrap (for now) (#104418 ) The "for now" is because we still have the issue that when using the parameter `ignored_states` path, we do not recover the ignored modules, so FSDP still wraps those as empty shells (no managed parameters), which is not ideal. This is not a blocking issue as far as I know. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104418 Approved by: https://github.com/rohan-varma	2023-07-08 12:40:14 +00:00
Andrew Gu	610f74627e	[FSDP][4/N] Remove `_get_fully_sharded_module_to_states` (#104409 ) `_get_fully_sharded_module_to_states()` was used to emulate auto wrapping without actually calling `fully_shard`. Since we committed to unifying (see previous PR), we can remove this function and its helpers/tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104409 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2023-07-08 12:40:14 +00:00
Andrew Gu	d9be0366d3	[FSDP][3/N] Unify `fully_shard` auto wrap (#104408 ) This moves `fully_shard` to use `_auto_wrap()` just like `FullyShardedDataParallel`. This means that `fully_shard` goes through the `_init_param_handle_from_module()` path (i.e. 1 `fully_shard` per "wrap"), removing the need for `_init_param_handles_from_module()` (which was 1 `fully_shard` for all "wraps" of a given policy). `_auto_wrap()` simply calls `fully_shard` on target submodules. This includes several important fixes: - We should register the pre/post-forward hooks on the module regardless of it has managed parameters. - We can permit `_module_handles` to return `[]` in the composable path (for when the module has no managed parameters). - We should unify the paths for `_get_buffers_and_dtypes_for_computation()` (previously, composable path was buggy in some cases). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104408 Approved by: https://github.com/rohan-varma	2023-07-08 12:40:12 +00:00
Andrew Gu	6d71b4f9f1	[FSDP][2/N][Easy] Prepare `_auto_wrap` for `fully_shard` (#104407 ) This mainly just changes the `_auto_wrap()` function signature and generalizes the `_check_nested_wrapping()` to both wrapper and composable paths (though the composable path will not hit in this PR). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104407 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2023-07-08 12:40:09 +00:00
Andrew Gu	d58f75be8b	[FSDP][1/N] Move wrapper `ModuleWrapPolicy` to new path (#104346 ) This PR is the first in refactoring the auto wrapping, only affecting `ModuleWrapPolicy` for wrapper `FullyShardedDataParallel`. The end goal is to improve the auto wrapping infra to support: - Checking valid frozen parameters (uniform frozenness per FSDP) - Checking valid shared parameters (shared parameters assigned to their lowest-common-ancestor module or higher) - Writing auto wrapping policies that may take multiple passes over the module tree - Specifying different FSDP kwargs per FSDP instance (instead of enforcing the same for all FSDP instances constructed via an auto wrap policy) The way I envision achieving this is that, we decouple the actual "wrapping" (which is `_post_order_apply()` in this PR) from constructing the wrapping targets and kwargs (which is `target_module_to_kwargs` in this PR). In that way, a policy reduces to just constructing that latter `target_module_to_kwargs` mapping. I do not personally recommend the size-based policy, but if we wanted to implement that under this new organization, the tracking of wrapped/nonwrapped numel should be done in the pass over the module tree prior to the actual "wrapping". This modularization keeps the actual "wrapping" part simple. The change to how `old_dtype` is handled is mainly to avoid keeping a reference to `_override_module_mixed_precision()` function closure in each hook and to allow the function to take in all module clases at once to return which ones actually got overridden for the downstream error message. (We can directly store the global state as a mapping.) To-do in follow-ups (not in order): - Add frozen parameter check before `_post_order_apply()` - Add shared parameter check before `_post_order_apply()` - Expose wrapping policy that allows per module / per module class kwarg customization (where any unspecified kwarg adopts the root's kwarg) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104346 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2023-07-08 12:40:07 +00:00
Huy Do	f334b54d7f	Handle the list of skipped messages when uploading disabled test stats (#104803 ) This fixes the failure when a list of skipped messages is encountered when uploading disabled test stats, for example https://github.com/pytorch/pytorch/actions/runs/5489936777/jobs/10004725533. This happens for ONNX tests (running regularly), i.e. https://ossci-raw-job-status.s3.amazonaws.com/log/14868893973: ``` onnx/test_op_consistency.py::TestOnnxModelOutputConsistency_opset13CPU::test_output_match_tile_cpu_bool SUBSKIP [0.0000s] (Logic not implemented for size 0 inputs in op.Reshape) [ 47%] onnx/test_op_consistency.py::TestOnnxModelOutputConsistency_opset13CPU::test_output_match_tile_cpu_bool SUBSKIP [0.0000s] (Logic not implemented for size 0 inputs in op.Reshape) [ 47%] ... onnx/test_op_consistency.py::TestOnnxModelOutputConsistency_opset13CPU::test_output_match_tile_cpu_bool SUBSKIP [0.0000s] (Logic not implemented for size 0 inputs in op.Reshape) [ 47%] onnx/test_op_consistency.py::TestOnnxModelOutputConsistency_opset13CPU::test_output_match_tile_cpu_bool PASSED [0.3136s] [ 47%] ``` The corresponding XML output is as follows https://paste.sh/b1DbSLJD#M-0WsXd9snjEVFh4ZsxPPIlv where `skipped` is a list of skipped messages instead of a dictionary. As we only care about gathering disabled tests stats in this script, the list of skipped messages can be safely ignored. ### Testing * Gathering disabled test stats works correctly when running under rerunning disabled tests mode https://github.com/pytorch/pytorch/actions/runs/5487829458/jobs/9999835911 * The command works locally for the above failed workflow (which is not a rerunning disabled tests workflow): ``` python3 -m tools.stats.check_disabled_tests --workflow-run-id "5488337480" --workflow-run-attempt 1 --repo "pytorch/pytorch" ... The following 0 tests should be re-enabled: The following 0 are still flaky: Writing 0 documents to S3 Done! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104803 Approved by: https://github.com/clee2000	2023-07-08 07:23:46 +00:00
Jane Xu	fbe2a7e50a	[optim] use lerp whenever possible (#104796 ) This is a better copy (with fixes) of #104781. Test plan: CI will pass once https://github.com/pytorch/pytorch/pull/104784 is landed Pull Request resolved: https://github.com/pytorch/pytorch/pull/104796 Approved by: https://github.com/albanD	2023-07-08 07:13:38 +00:00
BowenBao	5da4745c24	[ONNX] Fix exported onnx initializer name (#104741 ) Restore ONNX initializer name to exactly match torch parameter and buffer name. Fixes #104670 that could lead to potentially duplicated ONNX initializers after export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104741 Approved by: https://github.com/thiagocrepaldi	2023-07-08 04:36:33 +00:00
BowenBao	012561ff39	[ONNX] Restore readable names for parameters and buffers (#104493 ) This PR introduces a new pass that restore parameter and buffer names from original module. It is useful for readability of the exported ONNX graph. It restores the parameter and buffer names from the original module. For example, if the original module has a parameter named `root.linear.0.weight`, and the parameter is renamed to `_param_constant9` by FX, this pass will rename it back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104493 Approved by: https://github.com/wschin, https://github.com/thiagocrepaldi	2023-07-08 04:36:33 +00:00
AllenTiTaiWang	3d51c2e06d	[ONNX] Refactor FX Registry and Support Custom Operator in FX Exporter (#103943 ) ## ONNXRegistry ### Motivation In #100660, we used the torchscript registry to allow dispatcher. However, it doesn't meet the needs of FX exporter. The idea of torchscript exporter is built on top of three points: (1) Use `_SymbolicFunctionGroup` to dispatch opset version as we need ops to fall back when we don't have it in the current exporter opset version (2) One aten maps to multiple supported opset versions, and each version maps to one symbolic function (3) Custom symbolic function is considered prior to default symbolic function Now that TorchLib will support all aten op across all opset versions, we don't need the opset version dispatch layer. And with onnx overloads created by torchlib, we need a way to support custom operators and prioritize them among all overloads. ### Feature Introduce a public OnnxRegistry API initiated with fixed opset version which supports user registered operators. The dispatching opset version is no longer needed as TorchLib is expected to provide full aten support across all opset version. And Dispatcher is expected to prioritize custome operators than the defaults. ### API: (1) `register_custom_op(self, function: OnnxFunction, domain: str, op_name: str, overload: Optional[str] = None)`: Register a custom operator into the current OnnxRegistry. This is expected to be used when the default operators don't mee the need of users. For example, need a different opset version from the registry, or different calculation. (2) `is_registered_op(self, domain: str, op_name: str, overload: Optional[str] = None)`: Whether the aten op is registered. (3) `get_functions(domain: str, op_name: str, overload: Optional[str] = None)`: Return a set of registered SymbolicFunctions under the aten ### TODO: (1)`remove_op(op_name: str)`: removing the whole support for certain op allows decompose the graph to prims. (2)Expose OnnxRegistry to users, and disable the opset_version option in export API. Export API should use the ops in registry only. --- ## OnnxDispatcher The Changes in the function `dispatch` and `_find_the_perfect_or_nearest_match_onnxfunction` are meant to allow complex type and custom operator supports. ### Respect Custom Ops (1) Override: Check if we can find the perfect match in custom operator overloads prior to defaults (2) Tie breaker: If we have the same nearest match of default and custom overload, we choose the custom. ### Supplementary [Design discussion doc](https://microsoft-my.sharepoint.com/:w:/p/thiagofc/EW-5Q3jWhFNMtQHHtPpJiAQB-P2qAcVRkYjfbmeSddnjWA?e=QUX9zG&wdOrigin=TEAMS-ELECTRON.p2p.bim&wdExp=TEAMS-TREATMENT&wdhostclicktime=1687554493295&web=1) Please check the Registry and Dispatcher sections. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103943 Approved by: https://github.com/BowenBao, https://github.com/justinchuby	2023-07-08 04:15:58 +00:00
Catherine Lee	f45629d6ed	Pin pillow (#104760 ) Pin pillow to fix inductor periodic failure `eb4a1a07af` https://github.com/pytorch/pytorch/actions/runs/5488678286/jobs/10002712674 ``` File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/detectron2/data/transforms/transform.py", line 46, in ExtentTransform def __init__(self, src_rect, output_size, interp=Image.LINEAR, fill=0): AttributeError: module 'PIL.Image' has no attribute 'LINEAR'. Did you mean: 'BILINEAR'? ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104760 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-07-08 04:14:59 +00:00
David Radley	dbc2216800	Add autograd modes table to docs (#104774 ) Fixes #104461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104774 Approved by: https://github.com/soulitzer	2023-07-08 03:14:10 +00:00
David Berard	2df939aaca	[inductor] Update ops.bucketize to take offsets_size as a sympy.Expr (#104756 ) Background/problem: ops.bucketize needs to take a value `offsets_size`, which is the length of the `offsets` tensor. It is used, e.g., for the bounds of the binary search over the `offsets` tensor. The previous implementation of `ops.bucketize` expected `offsets_size` to be a CSEVariable; i.e. we'd pass `offsets_size = ops.index_expr(offsets.get_size()[0])` into `ops.bucketize()`. However, `ops.index_expr` will sometimes broadcast, turning the scalar `offsets_size` into a tensor. That caused errors, because [triton_helpers.bucketize_binary_search](`a2fe6953bc/torch/_inductor/triton_helpers.py (L153-L155)`) expects `offsets_size` to be a scalar. [Link - where the broadcasting happens](`a2fe6953bc/torch/_inductor/codegen/triton.py (L1056)`) Solution (this PR): Instead of passing `offsets_size` into `ops.bucketize` as a CSEVariable, pass in a sympy.Expr. Then, inside ops.bucketize, convert the sympy.Expr into a string that can be used in the generated triton code. Differential Revision: [D47282413](https://our.internmc.facebook.com/intern/diff/D47282413) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104756 Approved by: https://github.com/jansel	2023-07-08 01:08:55 +00:00
lezcano	3d07184930	Move optimize indexing to use the class Bounds (#104558 ) This PR removes plenty of duplicated code. In particular, it removes the two repeated implementations of `get_expr_range`, which are superseded by the more correct `bound_sympy`. The two duplicated `get_expr_range`s were a result of an oversight in https://github.com/pytorch/pytorch/pull/100549. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104558 Approved by: https://github.com/eellison	2023-07-07 23:52:14 +00:00
lezcano	710abc41cc	Implement bound_sympy (#104559 ) The analysis for SymPy expressions was incorrect as, even though it said that the assumption was "smoothness" the assumption was, in fact, that he formula was monotone in every variable. In other words, it was assuming that the derivative does not change signs in any variable (!!). We implement a function that, given bounds on the values of the free symbols of a sympy expression, it gives a bound on a the expression itself. We reshuffle a few things in value_ranges.py to create a `SymPyValueRangeAnalysis` class, but we do not change any code really. The only relevant change in that file is the addition of the `sympy_bound`s function. We do this because we don't want to inadvertently use any fallbacks in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104559 Approved by: https://github.com/eellison	2023-07-07 23:52:14 +00:00
lezcano	ff05f81e1d	Simplify and extend ValueRanges (#104557 ) This PR: - It adds a few boolean variants of some methods that were missing - It simplifies the implementation of plenty of the operations - Adds ModularIndexing to the SymPy interpreter Pull Request resolved: https://github.com/pytorch/pytorch/pull/104557 Approved by: https://github.com/eellison	2023-07-07 23:52:13 +00:00
Chien-Chin Huang	2f04aab140	[SPMD] Disable all SPMD tests (#104784 ) SPMD is not actively developed and is out-of-sync with the PyTorch compiler code. Disable the tests for now. Differential Revision: [D47296840](https://our.internmc.facebook.com/intern/diff/D47296840/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104784 Approved by: https://github.com/fduwjj	2023-07-07 23:31:54 +00:00
BowenBao	ae12081e70	[ONNX] Remove unnecessary deepcopy on args in 'DynamoExport' (#104736 ) The comment is outdated. There should be no side-effects on `args` and `kwargs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104736 Approved by: https://github.com/thiagocrepaldi	2023-07-07 22:26:27 +00:00
Nicolas Macchioni	c68fac9c25	[pt2][inductor] include `allow_tf32` in system information (#104129 ) Summary: include `allow_tf32` in system information; previously aten results did not specify whether `allow_tf32` was true or not Test Plan: sandcastle + CI Differential Revision: D46568468 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104129 Approved by: https://github.com/jansel	2023-07-07 21:47:35 +00:00
AllenTiTaiWang	ed4a8869af	[ONNX] Fix third party custom operator support in torchscript exporter (#104785 ) Previous to this PR, to support onnxscript function proto in torchscript exporter, the registered custom symbolic functions are all forced to call `.function_proto` API as onnxscript functions. The PR makes sure the custom function is onnxscript function before using the API. To avoid the dependency, hasattr is used instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104785 Approved by: https://github.com/BowenBao	2023-07-07 21:29:33 +00:00
David Berard	d8cb80e382	[inductor] If a kernel contains bucketize, try using config with num_elements_per_warp=32 (#104456 ) In binary search triton implementations, (#104007) num_elements_per_warp=32 performs a lot better than larger values. This PR adds an autotuning config option for this purpose. But since autotuning can affect compile times and this config isn't generally useful, we only try this config if bucketize is present. This is done by adding an extra field to triton_meta which is used by the pointwise autotuning Performance: reused https://gist.github.com/davidberard98/066fd2115f59f5889ef61e4527d1eba5. Before: ``` Eager 0.30088499188423157 ms PT2 0.9296960234642029 ms ``` After: ``` Eager 0.3011910021305084 ms PT2 0.22977299988269806 ms ``` Differential Revision: [D47237103](https://our.internmc.facebook.com/intern/diff/D47237103) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104456 Approved by: https://github.com/eellison	2023-07-07 20:32:41 +00:00
Jerry Zhang	1a661639f7	[quant] Support integer implementations for adaptive_avg_pool2d (#104226 ) Summary: This is needed for representing quantized model in pt2 export quantization flow Test Plan: tested by opinfo, python test/test_ops.py Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/104226 Approved by: https://github.com/jgong5, https://github.com/andrewor14	2023-07-07 19:36:31 +00:00
BowenBao	98e14ac37e	[ONNX][TypePromo] Simplify API `_run_node_and_set_meta` (#104720 ) Previously it is defined as `_run_node_and_update_meta_val` which selectively only updates `meta["val"]`. The behavioral difference stems from two type of scenarios: node creation and node modification. `node.meta` is empty for the former, while already exist and populated for the latter. This PR updates the API to handle both scenarios. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104720 Approved by: https://github.com/thiagocrepaldi	2023-07-07 17:33:53 +00:00
BowenBao	fa262eb46e	[ONNX][TypePromo] aten.div (#104229 ) Unlike the majority of other operators, `aten.div` requires this specialized rule since its type promotion kind depends on value of kwargs `rounding_mode`. Checkout note `Update type promotion rule` in `type_promotion.py` before manually adding a type promotion rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104229 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms, https://github.com/thiagocrepaldi	2023-07-07 17:31:05 +00:00
Angela Yi	29c30b1db8	[export] Fix serialize nn_module_stack (#104721 ) Summary: Some serialized nn_module_stacks contain nested commas, something like: `(getitem(L['module'],0),torch.nn.modules.linear.Linear)` Fixing the parsing so that we can deserialize the string in the format of: `(local identifier, module type)` Test Plan: CI Differential Revision: D47252881 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104721 Approved by: https://github.com/zhxchen17	2023-07-07 17:13:17 +00:00
ydwu4	6a3d5f1986	[HigherOrderOp] Remove _deprecated_global_ns from cond (#104380 ) Remove _deprecated_global_ns from cond following #104105. We change the module attribute of HigherOrderOperator instances in the constructor from torch.ops to torch.ops.higher_order when self.namespace is "higher_order". For subclasses (e.g. customized higher order operator), we leave their \_\_module\_\_ unchanged. Will import this PR to fix internal tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104380 Approved by: https://github.com/zhxchen17, https://github.com/zou3519	2023-07-07 17:13:09 +00:00
Angela Yi	d5a83a5f27	[export] Fix deserialization of symint (#104722 ) Test Plan: CI Differential Revision: D47269143 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104722 Approved by: https://github.com/zhxchen17	2023-07-07 17:03:46 +00:00
Angela Yi	199e93a0da	[export] Serialize optional tensors (#104723 ) Test Plan: Test in model inventory Differential Revision: D47269141 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104723 Approved by: https://github.com/zhxchen17	2023-07-07 16:55:12 +00:00
PyTorch MergeBot	78734a76ad	Revert "Add libxml2 and libxslt in docker image (#104663 )" This reverts commit 315a77a02d3648caaffa0b6fd56f35606c50aaef. Reverted https://github.com/pytorch/pytorch/pull/104663 on behalf of https://github.com/clee2000 due to broke periodic inductor testing ([comment](https://github.com/pytorch/pytorch/pull/104663#issuecomment-1625683229))	2023-07-07 16:53:38 +00:00
BowenBao	2fdf1175cd	[ONNX][TypePromo] Explicit type promotion pass (#104064 ) This PR adds the `ExplicitTypePromotionPass` that does an fx graph to fx graph transformation explicitly adding cast nodes into the graph to emulate the PyTorch type promotion behavior. Full design doc and discussion at https://microsoft-my.sharepoint.com/:w:/p/bowbao/Edj2lF1oi0JIitT_3ntyuqkBo6ll7N6NJDmavM0lp_KkEA?e=OElyjR Pull Request resolved: https://github.com/pytorch/pytorch/pull/104064 Approved by: https://github.com/titaiwangms, https://github.com/justinchuby	2023-07-07 16:52:21 +00:00
Edward Z. Yang	eb4a1a07af	Upgrade HuggingFace to v4.30.2 (#104657 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104657 Approved by: https://github.com/albanD	2023-07-07 16:01:00 +00:00
Peter Bell	c500f1d13b	[CMake] Fix TORCH_CUDA_ARCH_LIST warning (#104680 ) The warning complains that `TORCH_CUDA_ARCH_LIST` is set on the environment instead of being defined as a build variable, which is fixed by the change to `tools/setup_helpers/cmake.py`. However, I still see the warning even with this fix because ```cmake if((NOT EXISTS ${TORCH_CUDA_ARCH_LIST}) ... ``` is actually checking whether a file exists called "7.5" (or whatever arch is being requested). Instead we want to check if the variable is defined. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104680 Approved by: https://github.com/albanD	2023-07-07 15:12:54 +00:00
Richard Zou	6970ffbbc7	[HigherOrderOps] Clean up side effect handling (#104685 ) I think after https://github.com/pytorch/pytorch/pull/104077, we don't need to do a diff between the SideEffects object before/after for HigherOrderOps -- the ability is baked into speculate_subgraph. The rationale for this PR is that diff-ing the SideEffects object didn't work very well: it was overly conservative. If a variable gets tracked for mutation, or a new cell variable is created, then the SideEffects object changes. The SideEffects object tracks two types of side effects: - variable assignment/modification. This is covered by [check_allowed_side_effect](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/side_effects.py#L146C9-L146C34) - save_for_backward tracking. I don't think we even need to track this; if the inputs require grad, then we cannot graph break in the middle of autograd.Function, so we never need to replay calling `save_for_backward`. If the inputs don't require grad, then `save_for_backward` doesn't do anything, so it doesn't need to be replayed either. If we wanted to be safe we could also call `check_allowed_side_effect` there. Test Plan: - #104077 introduced some heavy testing already. This PR adds some more test cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104685 Approved by: https://github.com/ydwu4	2023-07-07 14:23:35 +00:00
Richard Zou	4ad5081794	[HigherOrderOp] Fix returning captured value (#104371 ) Fixes #104298. The bug was: - we were only checking for freevars in SubgraphTracer.create_proxy - freevars can also show up in SubgraphTracer.create_node This PR adds handles free variable handling of the output of the graph (which is created via `create_node`) in `speculate_subgraph`. Because `create_proxy` calls `create_node`, you may be wondering why we can't do the freevar lifting in `create_node`. The answer is that: - `create_node` only gets used by Dynamo to create outputs of a graph, so it is called rarely. All other callsites go through `create_proxy`. - our freevar system is based off of VariableTrackers being associated with Proxy objects which are associated with a SubgraphTracer. - `create_proxy` accepts Proxy args while `create_node` accepts Node args - Given a node, there isn't a way to retrieve the existing proxy that wraps the node. Test Plan: - add new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/104371 Approved by: https://github.com/ydwu4	2023-07-07 14:23:35 +00:00
Anthony Alayo	8d65635378	Prefixing DeviceType with c10 namespace to avoid name collisions (#104364 ) Fixes #91338 Follow up from https://github.com/pytorch/pytorch/pull/91342 > 🚀 The feature, motivation and pitch > We have an existing DeviceType class all over the place in our code base, and it conflicts with the one that is used in torch. > Thankfully the pytorch DeciceType enum class is under the c10 namespace. ``` In file included from /xxx/build/_deps/torch-src/../../aten/src/ATen/ops/view.h:5: /xxx/_deps/torch-src/aten/src/ATen/Context.h:265:14: error: reference to 'DeviceType' is ambiguous if (p == DeviceType::HIP) { ^ /xxx/include/Common_types.h:178:8: note: candidate found by name lookup is 'DeviceType' struct DeviceType { ^ /xxx/build/_deps/torch-src/c10/../c10/core/DeviceType.h:32:12: note: candidate found by name lookup is 'c10::DeviceType' enum class DeviceType : int8_t { ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104364 Approved by: https://github.com/albanD	2023-07-07 13:23:03 +00:00
Damian Szwichtenberg	296b45f9d3	Cleanup scatter-related code (#103074 ) This patch cleans up scatter-related code. GNN-specific implementation for scatter operation uses `radix_sort` to sort the indices, as `radix_sort` was recently moved to FBGEMM common utils (via [pytorch/FBGEMM#1672](https://github.com/pytorch/FBGEMM/pull/1672)), we do not need a local copy of the algorithm anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103074 Approved by: https://github.com/mikaylagawarecki, https://github.com/albanD	2023-07-07 12:38:46 +00:00
wgb	63dc24b4a6	Expose some APIs in FunctionsManual.h (#104684 ) Fixes #ISSUE_NUMBER Exporse some api in FunctionsManual.h for custom devices. This can be used in codegen features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104684 Approved by: https://github.com/albanD	2023-07-07 11:22:40 +00:00
Rohan Varma	0bf39d5663	[FSDP] Option for eval in fp32/bf16 (#104682 ) In https://github.com/pytorch/pytorch/pull/97645 and some follow up diffs, we made FSDP run in full precision in eval mode, even if mixed precision was specified. However, this is probably not the best idea and we should provide a flag for users to have control over this a bit more. Adding an env var FSDP_FULL_PREC_IN_EVAL and defaulting it to off, users who want to run eval in fp32 can toggle this before wrapping model in FSDP: os.environ["FSDP_FULL_PREC_IN_EVAL"] = "1" Verified that unittests, APS workflow, TNT workloads can run eval appropriately with this change. Differential Revision: [D47246556](https://our.internmc.facebook.com/intern/diff/D47246556/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104682 Approved by: https://github.com/awgu	2023-07-07 08:14:23 +00:00
Jongsoo Park	e517b3651a	[pytorch] put more to pytorch_fmha namespace (#104628 ) Summary: Without this diff we get ``` CUDA error (./fbcode/caffe2/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_bwd_launch_template.h:113): an illegal memory access was encountered ``` Test Plan: hg up e49463501 fbcode/ai_codesign/gen_ai/xlformers/scripts/run_xlformers_train_local.sh Reviewed By: drisspg Differential Revision: D47220255 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104628 Approved by: https://github.com/drisspg	2023-07-07 06:46:01 +00:00
Nikita Shulga	348dfc1cf3	Update cuDNN to 8.9.2.26 (#104757 ) Companion PR for https://github.com/pytorch/builder/pull/1436 Should fix `cuDNN version incompatibility: PyTorch was compiled against (8, 9, 2) but found runtime version (8, 8, 1). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN.one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.` error in [manywheel-py3_8-cuda12_1-with-pypi-cudnn-test](https://github.com/pytorch/pytorch/actions/runs/5480628146/jobs/9986843286#step:16:2347) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104757 Approved by: https://github.com/drisspg, https://github.com/xw285cornell, https://github.com/r-barnes	2023-07-07 05:46:46 +00:00
PyTorch MergeBot	8ca63ff9a8	Revert "[inductor] Add single pass "var_unnormalized" reduction_type (#102486 )" This reverts commit 7e098f95593240d45d28f040ff53f268ad3d9a93. Reverted https://github.com/pytorch/pytorch/pull/102486 on behalf of https://github.com/clee2000 due to sorry but this seems to have broken inductor/test_torchinductor.py::CpuTests::test_std_cpu on mac x86 64 machines `7e098f9559` https://github.com/pytorch/pytorch/actions/runs/5479008241/jobs/9981443710 ([comment](https://github.com/pytorch/pytorch/pull/102486#issuecomment-1624739465))	2023-07-07 04:57:20 +00:00
PyTorch MergeBot	1280b19827	Revert "[inductor] Split ops.reduction into reduction and store_reduction (#102737 )" This reverts commit 59b8d5be7405c6f8a445b504b73a7e8c7812e860. Reverted https://github.com/pytorch/pytorch/pull/102737 on behalf of https://github.com/clee2000 due to sorry but i need to revert this to revert the other one in the stack ([comment](https://github.com/pytorch/pytorch/pull/102737#issuecomment-1624735108))	2023-07-07 04:53:14 +00:00
Wenzhe Xue	a2fe6953bc	Generate `nearbyint` for Round in tensorexpr llvm codegen, match `torch.round` result (#104430 ) Fixes #103465, which matches the behavior of `torch.round` ([doc](https://pytorch.org/docs/stable/generated/torch.round.html?highlight=round#torch.round)) - “round half to even” Using the repro code, the output is correct: ``` Using torch version=2.1.0a0+git84fedbc and optimization enabled=True [cpu ] Python = 2, Torch = 2, Torch traced = 2 Using torch version=2.1.0a0+git84fedbc and optimization enabled=False [cpu ] Python = 2, Torch = 2, Torch traced = 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104430 Approved by: https://github.com/jgong5, https://github.com/davidberard98	2023-07-07 01:47:46 +00:00
XiaobingSuper	8ce3a18b6a	inductor: reduce complie time by reducing repr calls of quantize or Opaque tensor (#104696 ) For quantize or opaue tensor, if they are constant values, the calls of tensor ```__repr__``` will have memory copy(https://github.com/pytorch/pytorch/blob/main/torch/_tensor_str.py#L550): `db1ac4e29b/torch/_inductor/codegen/wrapper.py (L289-L292)` for CPP codegen, there have many times of initiation of ```WrapperCodeGen```: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/cpp.py#L2023, which consumes much time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104696 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-07-07 01:12:34 +00:00
Kurt Mohler	0ccdbbe233	Add deterministic path for `Tensor.resize_` (#104300 ) New elements added to a tensor by `torch.Tensor.resize_` are set to NaN/MAX_INT when deterministic mode is turned on. When `torch.Tensor.resize_` is called on a quantized tensor and deterministic mode is turned on, a nondeterministic error is raised. Part of #82004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104300 Approved by: https://github.com/albanD	2023-07-07 00:22:13 +00:00
Will Constable	d64bada876	Refactor funcol for readability and dynamo tracing (#104387 ) Move eager kernel impls to separate file, which is eaiser to read (since users may be confused about 2 versions of each kernel in the same file) and easier to set a dynamo policy to trace only the first file currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104387 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/kumpera	2023-07-06 23:29:49 +00:00
Nikita Shulga	456ecefd52	[BE] Fix warning in top-level CMakeLists.txt (#104726 ) Fixes warning introduced by https://github.com/pytorch/pytorch/issues/102594: ``` CMake Warning (dev) in CMakeLists.txt: A logical block opening on the line /pytorch/CMakeLists.txt:726 (if) closes on the line /pytorch/CMakeLists.txt:735 (endif) with mis-matching arguments. ``` <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at b7555d5</samp> > _`DEBUG_CUDA` on_ > _No more CUDA in exe_ > _Winter bug is fixed_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104726 Approved by: https://github.com/huydhn, https://github.com/atalman	2023-07-06 22:13:29 +00:00
William Wen	8c13e96be2	[dynamo] add logging artifact for traced graph tensor sizes (#104672 ) Log tensor size information with the `graph_sizes` logging artifact, as part of the model x-ray feature requests. Typically can be combined with `graph_code`. Sample: ```python import torch def fn(a, b, c, d): return (a + b) @ (c + d) opt_fn = torch.compile(fn, backend="eager", dynamic=False) opt_fn(torch.randn(10, 20), torch.randn(1, 20), torch.randn(20, 15), torch.randn(1, 15)) opt_fn(torch.randn(5, 2), torch.randn(1, 2), torch.randn(2, 4), torch.randn(1, 4)) ``` Output: ```shell $ TORCH_LOGS="graph_sizes,graph_code" python playground8.py [2023-07-06 01:42:39,093] torch._dynamo.output_graph.__graph_code: [DEBUG] TRACED GRAPH ===== __compiled_fn_0 ===== <eval_with_key>.0 class GraphModule(torch.nn.Module): def forward(self, L_a_ : torch.Tensor, L_b_ : torch.Tensor, L_c_ : torch.Tensor, L_d_ : torch.Tensor): l_a_ = L_a_ l_b_ = L_b_ l_c_ = L_c_ l_d_ = L_d_ # File: playground8.py:66, code: return (a + b) @ (c + d) add = l_a_ + l_b_; l_a_ = l_b_ = None add_1 = l_c_ + l_d_; l_c_ = l_d_ = None matmul = add @ add_1; add = add_1 = None return (matmul,) [2023-07-06 01:42:39,093] torch._dynamo.output_graph.__graph_sizes: [DEBUG] TRACED GRAPH TENSOR SIZES ===== __compiled_fn_0 ===== l_a_: (10, 20) l_b_: (1, 20) l_c_: (20, 15) l_d_: (1, 15) add: (10, 20) add_1: (20, 15) matmul: (10, 15) [2023-07-06 01:42:39,198] torch._dynamo.output_graph.__graph_code: [DEBUG] TRACED GRAPH ===== __compiled_fn_1 ===== <eval_with_key>.1 class GraphModule(torch.nn.Module): def forward(self, s0 : torch.SymInt, s1 : torch.SymInt, L_a_ : torch.Tensor, L_b_ : torch.Tensor, s4 : torch.SymInt, L_c_ : torch.Tensor, L_d_ : torch.Tensor): l_a_ = L_a_ l_b_ = L_b_ l_c_ = L_c_ l_d_ = L_d_ # File: playground8.py:66, code: return (a + b) @ (c + d) add = l_a_ + l_b_; l_a_ = l_b_ = None add_1 = l_c_ + l_d_; l_c_ = l_d_ = None matmul = add @ add_1; add = add_1 = None return (matmul,) [2023-07-06 01:42:39,198] torch._dynamo.output_graph.__graph_sizes: [DEBUG] TRACED GRAPH TENSOR SIZES ===== __compiled_fn_1 ===== l_a_: (s0, s1) l_a_ (concrete): (5, 2) l_b_: (1, s1) l_b_ (concrete): (1, 2) l_c_: (s1, s4) l_c_ (concrete): (2, 4) l_d_: (1, s4) l_d_ (concrete): (1, 4) add: (s0, s1) add (concrete): (5, 2) add_1: (s1, s4) add_1 (concrete): (2, 4) matmul: (s0, s4) matmul (concrete): (5, 4) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104672 Approved by: https://github.com/ezyang	2023-07-06 21:44:05 +00:00
William Wen	c7c9aa797f	[dynamo] New logging artifacts for source code attribution (#104013 ) Prototype for the feature request: >When working on a codebase that is unfamiliar to you, it can be helpful to single step through all of the code to see what is getting executed, what conditional branches are taken, and where indirect function jumps go. Model x-ray uses dynamo to give you a single step log of every source code line that does something relevant (i.e., a Tensor operation) Dynamo logs to the ~`starts_line`~ `trace_source` logging artifact at the start of tracing new bytecode with a new line. It logs the line of source code associated with that bytecode. ~~Dynamo logs to the `graph_source` logging when a FX GraphModule is constructed. For each node in the graph, it logs the location of the original source code associated with that node.~~ Development notes: https://docs.google.com/document/d/1LjFeHzCgDDt535QUq5HydcQs56d7jWl5RvW8TLZN19g/edit?usp=sharing Since the draft, we removed the `graph_source` logging artifact since printing the code of `GraphModule`s already displays the original source. Sample: ```python import torch from functorch.experimental.control_flow import cond def true_fn(x): return x * 2 def false_fn(x): return x * 3 def f_cond(pred, x): return cond(pred, true_fn, false_fn, [x]) def f_outer(pred, x): y = f_cond(pred, x) if x.sum() > 0: x = x * 2 else: x = x * 3 return x, y opt_f_cond = torch.compile(f_outer, backend="eager") opt_f_cond(torch.tensor(True), torch.randn(3, 3)) ``` Logs: ```shell $ TORCH_LOGS="trace_source" python playground8.py TRACE starts_line f_outer playground8.py:54 def f_outer(pred, x): TRACE starts_line f_outer playground8.py:55 y = f_cond(pred, x) TRACE starts_line f_cond playground8.py:51 (inline depth: 1) def f_cond(pred, x): TRACE starts_line f_cond playground8.py:52 (inline depth: 1) return cond(pred, true_fn, false_fn, [x]) TRACE starts_line true_fn playground8.py:45 (inline depth: 2) def true_fn(x): TRACE starts_line true_fn playground8.py:46 (inline depth: 2) return x * 2 TRACE starts_line false_fn playground8.py:48 (inline depth: 2) def false_fn(x): TRACE starts_line false_fn playground8.py:49 (inline depth: 2) return x * 3 TRACE starts_line f_outer playground8.py:56 if x.sum() > 0: TRACE starts_line <resume in f_outer> playground8.py:56 if x.sum() > 0: TRACE starts_line <resume in f_outer> playground8.py:57 x = x * 2 TRACE starts_line <resume in f_outer> playground8.py:60 return x, y ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104013 Approved by: https://github.com/ezyang	2023-07-06 21:43:55 +00:00
Ilya Sherstyuk	8c0b9a2d69	[ONNX] Export dynamic step size for aten::slice() (#104385 ) This commit improves the export of aten::slice() to ONNX in the following ways: 1. The step size can be an input tensor rather than a constant. 2. Fixes a bug where using a 1-D, 1-element torch tensor as an index created a broken ONNX model. This commit also adds tests for the new functionality. Fixes #104314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104385 Approved by: https://github.com/thiagocrepaldi	2023-07-06 21:38:59 +00:00
Aleksei Nikiforov	c42fd73cf9	Add functions to get and set default endianness in load() functions (#101973 ) By default interpret tensor data as native endian, but add an option to interpret data as little endian or big endian. Related to #101688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101973 Approved by: https://github.com/mikaylagawarecki	2023-07-06 20:12:56 +00:00
Brian Hirsh	2efe4d809f	[hotfix inductor test] disable cpp vectorization codegen in fbcode for inductor (#104560 ) Summary: After D46364355 landed, a few inductor internal tests started failing. When I ran this locally: ``` buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:config ``` The test appeared to hang with this output, until it would fail with a timeout after 10 minutes passed: ``` Test caffe2/test/inductor:config -- discovering tests [local_execute] ``` Eventually, I realized that inductor has a value `HAS_CPU` (https://www.internalfb.com/code/fbsource/[6cc47fa5eb77a93d91a519d3eb3df67ceddb8faa]/fbcode/caffe2/torch/testing/_internal/inductor_utils.py?lines=23) that is implemented lazily. Part of that implementation involves inspecting `/proc/cpuinfo` to figure out what vectorized intructions are available, and that call appeared to hang (https://www.internalfb.com/code/fbsource/[6cc47fa5eb77a93d91a519d3eb3df67ceddb8faa]/fbcode/caffe2/torch/_inductor/codecache.py?lines=568). Since vectorized codegen for inductor cpu internally already isn't working, I hardcoded that test to fail for now in fbcode. Test Plan: Confirmed that this passes: `buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:config` Differential Revision: D47199912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104560 Approved by: https://github.com/desertfire, https://github.com/bertmaher	2023-07-06 19:00:13 +00:00
lezcano	b190f46514	Allow NumPy code in torch.compile to run on cuda (#104699 ) This can be achieved by doing `torch.set_default_device("cuda")`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104699 Approved by: https://github.com/ezyang, https://github.com/larryliu0820	2023-07-06 18:43:09 +00:00
PyTorch MergeBot	b073f6a5e8	Revert "inductor: support cpu fusion path for bfloat16 amp (#104399 )" This reverts commit c46869a9415ef152be15bac65b64e8a75503c27d. Reverted https://github.com/pytorch/pytorch/pull/104399 on behalf of https://github.com/clee2000 due to Sorry but it seems like this PR broke slow tests (and maybe also mac periodic tests?) inductor/test_cpp_wrapper.py::TestCppWrapper::test_conv2d_unary_cpu_cpp_wrapper `c46869a941` https://github.com/pytorch/pytorch/actions/runs/5477792452/jobs/9977634660 ([comment](https://github.com/pytorch/pytorch/pull/104399#issuecomment-1624131181))	2023-07-06 18:26:17 +00:00
Shunting Zhang	a358a9262e	[inductur] coordesc tuner bug fix with no_x_dim kernel (#104692 ) We recently have an optimization to squash x dimension for persistent reduction kernel when we are confident that XBLOCK will always be 1. We need update the code so that coordinate descent tuner does not tune XBLOCK in this case. Test command. Fail before the fix and pass after. ``` TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --accuracy --only BertForMaskedLM --inference ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104692 Approved by: https://github.com/jansel	2023-07-06 17:47:02 +00:00
Jerry Zhang	c42de84708	[quant] Skip some x86 quantizer tests for now due to time out (#104666 ) Summary: att Test Plan: sandcastle ci Reviewed By: malfet Differential Revision: D47234616 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104666 Approved by: https://github.com/DanilBaibak	2023-07-06 17:34:13 +00:00
willfengg	202fb95c68	[benchmark][export] Add torch.export passrate for TB/TIMM benchmarks (#104382 ) issues resolved: https://github.com/pytorch/pytorch/issues/104294 local test on TB and TIMM * python benchmarks/dynamo/torchbench.py -d cuda --inference --accuracy --progress --export --print-dataframe-summary * python benchmarks/dynamo/timm_models.py -d cuda --inference --accuracy --progress --export --print-dataframe-summary why not HF * huggingface use kwargs (dict) to torch.nn.module * we will need to support kwargs in torch._export.export, which is in progress local test result timm 95% pass rate (58 ouf of 61 passed) P781702926 * 1 x [export specific]1 x ERROR:common:Mutating module attribute rel_indices during export * 1 x[not relevant to export] Unknown model (SelecSls42b) * 1 x [not relevant to export] Failed to load model: HTTP Error 409: Public access is not permitted on this storage account torchbench 54% pass rate (41 out of 75 passed) P781690552 * 7 x ERROR:common:Dynamo input and output is a strict subset of traced input/output * 3 x ERROR:common:call_method NNModuleVariable() / UserDefinedObjectVariable * 3 x ERROR:common:Mutating module attribute {xx} during export. * 2 x ERROR:common:inline in skipfiles * 2 x ERROR:common:Consider annotating your code using constrain_as_(). It appears that you're trying 1 x ERROR:common:guard on data-dependent symbolic int/float * 1 x ERROR:common:Tensor.tolist * 1 x ERROR:common:Tensor.numpy. Turn on config.numpy_ndarray_as_tensor and install torch_np to support tensor.numpy(). [may be dev * env?] * 1 x ERROR:common:missing: BUILD_SET * 1 x ERROR:common:whole graph export entails exactly one guard export * 1 x ERROR:common:call_function BuiltinVariable(str) [GetAttrVariable(UserMethodVariable(<function * 1 x ERROR:common:Dynamic slicing on data-dependent value is not supported * 1 x ERROR:common:Failed running call_function <function interpolate at 0x7f60a8361ea0>((FakeTensor(..., device='cuda:0', size=(1, 3, 427, * 1 x ERROR:common:Dynamo attempts to add additional input during export: value=0.6177528500556946, source=RandomValueSource(random_call_index=0) * 1 x Found following user inputs located at [16, 17, 18, 19, 20, 21, 22] are mutated. This is currently banned in the aot_export workflow. * 1 x RuntimeError: cumsum_cuda_kernel does not have a deterministic implementation * 4 x pass_due_to_skip * 1 x eager_2nd_run_OOM * 1 x fail_accuracy Pull Request resolved: https://github.com/pytorch/pytorch/pull/104382 Approved by: https://github.com/zhxchen17	2023-07-06 17:16:07 +00:00
PyTorch MergeBot	f8aedf1efe	Revert "Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427 )" This reverts commit da7675621efce341c80187e404ac62cb6c22bbf8. Reverted https://github.com/pytorch/pytorch/pull/103427 on behalf of https://github.com/clee2000 due to sorry but it looks like this pr broke test_scatter_gather_ops.py::TestScatterGatherCPU::test_scatter_expanded_index_cpu_bfloat16 on periodic parallelnative testing `da7675621e` https://github.com/pytorch/pytorch/actions/runs/5477783108/jobs/9977608393 ([comment](https://github.com/pytorch/pytorch/pull/103427#issuecomment-1624008753))	2023-07-06 17:02:03 +00:00
XiaobingSuper	c4cf90aad1	inductor: fix assert error when load a bfloat16 inf constant (#104614 ) Fix ```nanogpt_generate``` bfloat16 path error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104614 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-07-06 17:01:04 +00:00
Mengwei Liu	4fafe0b74c	[export][serde] Hookup export upgrader with TorchScript upgrader entries (#104227 ) Adding an API to get the upgraders entry map directly from: https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/operator_upgraders/upgraders_entry.cpp#L17 Combine the information there along with the operator version map from: https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/operator_upgraders/version_map.cpp#L18 We can get a upgrader map with: upgrader name, old schema and upgrader string. This dict will be sent to GraphModuleOpUpgrader to populate the upgrader passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104227 Approved by: https://github.com/angelayi, https://github.com/zhxchen17	2023-07-06 16:57:36 +00:00
Andrew Gu	6c1d959889	[FSDP] Annotate modules for `fully_shard` (#104363 ) This annotates modules managed by `fully_shard` for TorchDynamo to treat them specially. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104363 Approved by: https://github.com/fegin	2023-07-06 16:56:59 +00:00
Digant Desai	7c8dded9db	[BE] QNNPACK Test - FC, use ASSERT_NEAR (#104651 ) Compare fp numbers using assert_near with reference * 10e-4. Somewhat arbitrary threhold which makes the test to pass on SSE2, given the absolute numbers are in somewhat wider range. Differential Revision: [D47195286](https://our.internmc.facebook.com/intern/diff/D47195286/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104651 Approved by: https://github.com/mcr229	2023-07-06 16:32:56 +00:00
Digant Desai	cbad55f6c4	[BE] QNNPACK Test - Sparsegemm tests, use ASSERT_NEAR (#104650 ) Compare fp numbers using assert_near with reference * 10e-3. Somewhat arbitrary threhold which makes the test to pass on SSE2, given the absolute numbers are in somewhat wider range. Differential Revision: [D47195288](https://our.internmc.facebook.com/intern/diff/D47195288/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104650 Approved by: https://github.com/mcr229	2023-07-06 16:32:56 +00:00
Digant Desai	ce1a40519f	[BE] QNNPACK - Q8[g]avg, loosen threshold to allow fp compare to pass (#104649 ) 0.5 --> 0.5001 to tolorate fp-op reordering surfaced with LLVM15. Not the best fix. Differential Revision: [D47195289](https://our.internmc.facebook.com/intern/diff/D47195289/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104649 Approved by: https://github.com/mcr229	2023-07-06 16:32:56 +00:00
Digant Desai	833faccce2	[BE] QNNPACK Test - DQgemm tests, use ASSERT_NEAR (#104648 ) Compare fp numbers using assert_near with reference * 10e-4. Somewhat arbitrary threhold which makes the test to pass on SSE2, given the absolute numbers are in somewhat wider range. Differential Revision: [D47195287](https://our.internmc.facebook.com/intern/diff/D47195287/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104648 Approved by: https://github.com/mcr229	2023-07-06 16:32:56 +00:00
Catherine Lee	5c2dc9b0b2	Label for mem leack check (#104643 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/104643 Approved by: https://github.com/huydhn	2023-07-06 16:32:49 +00:00
Catherine Lee	315a77a02d	Add libxml2 and libxslt in docker image (#104663 ) lxml got updated to 4.9.3 and wants libxml2 and libxslt https://github.com/pytorch/pytorch/actions/runs/5467965204/jobs/9956542063#step:5:5064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104663 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-07-06 16:32:30 +00:00
Peter Bell	59b8d5be74	[inductor] Split ops.reduction into reduction and store_reduction (#102737 ) This is intended as a first step towards reductions with multiple outputs. This also incidentally improves CSE of reductions under C++ codegen. For example, ```python def fn(x): return torch.argmin(x, dim=-1), torch.argmin(x, dim=-1) ``` Currently this generates two reductions, where the common load is CSEd ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } if (tmp_acc1.value > tmp0) { tmp_acc1.index = i1; tmp_acc1.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; auto tmp2 = tmp_acc1.index; out_ptr1[static_cast<long>(i0)] = tmp2; ``` but with this change it gets CSEd to a single accumulator ```cpp for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10Li0))]; if (tmp_acc0.value > tmp0) { tmp_acc0.index = i1; tmp_acc0.value = tmp0; } } auto tmp1 = tmp_acc0.index; out_ptr0[static_cast<long>(i0)] = tmp1; out_ptr1[static_cast<long>(i0)] = tmp1; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102737 Approved by: https://github.com/jgong5, https://github.com/lezcano	2023-07-06 16:22:19 +00:00
BJ Hargrave	def7b3ed60	Enable bitwise shift operations tests (#97150 ) With #70904 fixed, we can remove the skips for the bitwise shift tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97150 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-07-06 15:32:57 +00:00
Rodrigo Kumpera	17ab4f85e9	[c10d] Adopt allgather_into_tensor_coalesced for NCCL. (#103086 ) This is done by adding c10d::_allgather_into_tensor_coalesced wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103086 Approved by: https://github.com/rohan-varma	2023-07-06 15:05:55 +00:00
XiaobingSuper	0aa6486441	inductor: reduce compile time for cpu backend by reducing weight conversion (#104402 ) Before this PR, we always add ```to_mkldnn``` before doing weight packing, it is redundant, we can directly convert a dense tensor to block tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104402 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/eellison, https://github.com/desertfire	2023-07-06 13:44:50 +00:00
Richard Zou	adf1405909	[HigherOrderOp] Simplify design by removing reliance on name match (#104350 ) Previously: - we were keeping a list of proxies seen by the current SubgraphTracer. It turns out, fx.Proxy has a .tracer field that we should be able to use instead. - we were using name matching to determine if a freevar was already lifted to being the input of the parent SubgraphTracer. Voz and I have previously expressed concerns about the robsustness of name matching. This PR introduces a simplified design with more invariants: - When doing HigherOrderOp tracing, we may encounter Proxys - Each Proxy object is associated with a SubgraphTracer. - The new invariant is that SubgraphTracer should only construct Nodes using Proxy that come from the SubgraphTracer. This helps us avoid malformed graphs. - If the Proxy object came from another SubgraphTracer, then this means it is a free variable. We need to lift it to being an input of the current SubgraphTracer, which will result in the construction of a new Proxy in the current SubgraphTracer. This new Proxy should be used whenever the old Proxy is seen by the current SubgraphTracer. Test Plan: - existing tests + some new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/104350 Approved by: https://github.com/ydwu4, https://github.com/voznesenskym	2023-07-06 13:32:33 +00:00
David Radley	69c4314945	Add more child links to benchmark readme (#104627 ) Fixes #104625 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104627 Approved by: https://github.com/drisspg	2023-07-06 12:11:00 +00:00
Wanchao Liang	db1ac4e29b	fix functional collective's allgather for gloo (#104681 ) Summary: We should explicitly check for the gloo backend instead of relying on the shard's device, because user might pass a GPU tensor as input and a process group gloo as the pg, and expect that should work. Differential Revision: D47249172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104681 Approved by: https://github.com/rohan-varma, https://github.com/fduwjj	2023-07-06 09:52:48 +00:00
Denis Vieriu	b1ea0d90fe	[MPS] Set the default optimization level (#104661 ) Set the graph optimization level to 0 (avoids dispatches to Neural Engine). Fixes https://github.com/pytorch/pytorch/issues/104642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104661 Approved by: https://github.com/razarmehr	2023-07-06 08:46:44 +00:00
PyTorch MergeBot	917ac30aeb	Revert "inductor: reduce compile time for cpu backend by reducing weight conversion (#104402 )" This reverts commit 6bfd507c15f2d26212d3e2b9e581d9525bfd37d1. Reverted https://github.com/pytorch/pytorch/pull/104402 on behalf of https://github.com/XiaobingSuper due to introduce compile error for fp32 linear ([comment](https://github.com/pytorch/pytorch/pull/104402#issuecomment-1623189759))	2023-07-06 08:13:02 +00:00
leslie-fang-intel	8e2e2d730e	[Quant][PT2E]Accelerate test of conv2d_add and conv2d_add_relu by reducing test configs (#104686 ) Summary Reduce the test time of `test_conv2d_binary_with_quantizer_api` and `test_conv2d_binary_unary_with_quantizer_api`. * For `test_conv2d_binary_with_quantizer_api`, reduce the number of test config from 12 to 2. * For `test_conv2d_binary_unary_with_quantizer_api`, reduce the number of test config from 24 to 2. Test Plan ``` python -m pytest test_x86inductor_quantizer.py -k test_conv2d_binary_with_quantizer_api python -m pytest test_x86inductor_quantizer.py -k test_conv2d_binary_unary_with_quantizer_api ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104686 Approved by: https://github.com/jerryzh168	2023-07-06 07:34:46 +00:00
Li-Huai (Allan) Lin	ac9c2aa6ee	Use random inputs for mps extension tests (#104597 ) The tested function simply adds `x` and `y`, given this fact, using random inputs instead of zeros makes more sense. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104597 Approved by: https://github.com/albanD	2023-07-06 07:14:56 +00:00
XiaobingSuper	6bfd507c15	inductor: reduce compile time for cpu backend by reducing weight conversion (#104402 ) Before this PR, we always add ```to_mkldnn``` before doing weight packing, it is redundant, we can directly convert a dense tensor to block tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104402 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/eellison	2023-07-06 06:07:05 +00:00
Iris	434fcffa21	[6/n][FSDP] Update _sharded_pre_load_state_dict_hook to use DTensor when use_dtensor=True in ShardedStateDictConfig (#104087 ) This allows us use use_dtensor=True for ShardedStateDictConfig() before calling model.load_state_dict(). It only works for offload_to_cpu=False now. Next PR will make use_dtensor=True work with offload_to_cpu=True for load_state_dict(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104087 Approved by: https://github.com/fegin	2023-07-06 05:36:19 +00:00
Xu Han	a956b1c849	optimize mimalloc build options. (#104497 ) 1. pytorch only need static lib, disable other libs. 2. disable override, pytorch only access mimalloc via cpu_alloc/cpu_free. Reference: https://github.com/microsoft/mimalloc/blob/master/CMakeLists.txt#L10-L25 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104497 Approved by: https://github.com/jgong5, https://github.com/albanD	2023-07-06 04:44:21 +00:00
Aaron Gokaslan	c3f29ed16e	Update cutlass submodule to stable 3.1 from RC (#104638 ) CUTLASS was on a release candidate previously this updates it up to stable with a few additional fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104638 Approved by: https://github.com/ezyang	2023-07-06 04:32:23 +00:00
XiaobingSuper	22520964ae	inductor: convert view to reshape before doing fake_tensor_prop at freezing step (#104612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104612 Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/shunting314	2023-07-06 04:27:50 +00:00
PyTorch UpdateBot	13763f58ad	[vision hash update] update the pinned vision hash (#104677 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104677 Approved by: https://github.com/pytorchbot	2023-07-06 03:26:41 +00:00
Zhengxu Chen	df281bf788	Refactor unwrap_proxy() for proxy tensor tracing. (#104667 ) Test Plan: CI Differential Revision: D47241815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104667 Approved by: https://github.com/tugsbayasgalan	2023-07-06 03:03:13 +00:00
Animesh Jain	d0e5c681f5	[dynamo][ddp][ac] Fallback to single bucket when higher order op (#104639 ) This helps unblock an internal model. The real fix requires lot of work, which might question the alternate approach of partitioning AOT graphs instead of Dynamo graphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104639 Approved by: https://github.com/wconstab	2023-07-06 02:20:15 +00:00
yanbing-j	da7675621e	Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427 ) ### Description This PR is to optimize scatter_add/scatter_reduce of BFloat16/Half data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057. Main point is creating a buffer among threads to accumulate intermediate data as fp32 data type. Next step: - [x] Add benchmarks - [x] Extend to Half - [x] Simplify code ### Performance test (Updated) Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz With jemalloc and iomp Single socket (40C) ![image](https://github.com/pytorch/pytorch/assets/61222868/4b4342f1-8cc3-46f7-81f5-651becd9b1e3) Single core ![image](https://github.com/pytorch/pytorch/assets/61222868/09e5f700-2c2e-4208-979e-74b85474dea6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103427 Approved by: https://github.com/mingfeima, https://github.com/albanD	2023-07-06 01:23:56 +00:00
Sergii Dymchenko	bf127d236a	Update xla.txt (#104671 ) As discussed with @JackCaoG Pull Request resolved: https://github.com/pytorch/pytorch/pull/104671 Approved by: https://github.com/JackCaoG	2023-07-06 01:02:21 +00:00
XiaobingSuper	c46869a941	inductor: support cpu fusion path for bfloat16 amp (#104399 ) This PR is about the fusion of amp path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104399 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-07-06 00:51:29 +00:00
XiaobingSuper	e802900bdc	inductor: move the CPU weight packing path after of AOTAutograd (#103851 ) At next step: 1. support amp path for applying more fusion. 2. support dynamic shape path for applying more fusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103851 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-07-06 00:48:35 +00:00
Animesh Jain	8c191d8eef	[dynamo][ac] Reland #104397 - Remove disable monkeypatching of utils.checkpoint (#104665 ) NO CHANGE from before. The ancestor diff was reverted, so this diff got reverted as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104665 Approved by: https://github.com/wconstab	2023-07-06 00:48:02 +00:00
Animesh Jain	0444f9f85b	[dynamo] Reland #104317 - Lazy disable_dynamo API out-of-dynamo (#104664 ) Internal failed because of torch.deploy issues with disable_dynamo in fx/* and _jit/* files. Removing disable_dynamo for both. Added a comment in the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104664 Approved by: https://github.com/wconstab	2023-07-06 00:48:02 +00:00
XiaobingSuper	d3589c9456	reduce computation of batch_norm when weight or bias is none (#104616 ) For batch_norm decomposition, if weight or bias is None, we can skip some computations for better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104616 Approved by: https://github.com/lezcano, https://github.com/desertfire, https://github.com/jgong5	2023-07-06 00:47:41 +00:00
David Berard	13ea3d8530	[jit] Fix inspect.get_annotations usage in python >= 3.10 (#104485 ) Fixes #104484 For >= 3.10, we use `inspect.get_annotations` instead of `getattr(.., "__annotations__")`. [Docs](https://docs.python.org/3/library/inspect.html#inspect.get_annotations) say that get_annotations() "Ignores inherited annotations on classes. If a class doesn’t have its own annotations dict, returns an empty dict.". In practice though, this doesn't seem always true; until you call inspect.getmembers it seems like you still get inherited annotations. In particular, this means that if you script a certain type twice, the first time it may pass scripting but on the second try it may not pass scripting. This PR adds a more comprehensive handling of get_annotations by recursively reading the annotations of the base types. (TorchScript doesn't officially support this; but since it worked in <3.10, it's now breaking internal stuff as python gets upgraded to 3.10) Verified in #104486 that the test does actually fail before the changes in this PR were added. Differential Revision: [D47163891](https://our.internmc.facebook.com/intern/diff/D47163891) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104485 Approved by: https://github.com/eellison	2023-07-06 00:37:47 +00:00
Peter Bell	7e098f9559	[inductor] Add single pass "var_unnormalized" reduction_type (#102486 ) This is a bit inefficient because it computes the mean and throws it away since ir.Reduction nodes only have 1 output. However, the mean can at least be scheduled into the same loop as the variance now since there is no data dependency. Thus we can take fewer passes over the data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102486 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-07-06 00:00:59 +00:00
Huy Do	63755efb90	Disable git fsmonitor daemon on Windows (#104662 ) Looking at the issue on https://github.com/actions/checkout/issues/1018, I suspect that this is same flaky issue failing GHA checkout on Windows https://github.com/pytorch/pytorch/actions/runs/5459471366/jobs/9935736289 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104662 Approved by: https://github.com/clee2000	2023-07-06 00:00:22 +00:00
Jerry Zhang	611febf6cf	[quant] Support integer implementations for max_pool2d (#104225 ) Summary: This is needed for representing quantized model in pt2 export quantization flow Test Plan: tested by opinfo, python test/test_ops.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/104225 Approved by: https://github.com/kimishpatel	2023-07-05 23:54:07 +00:00
Michael Lazos	a290cbf32b	Enable fused foreach Adam compilation (#104121 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/104121 Approved by: https://github.com/janeyx99	2023-07-05 23:40:03 +00:00
Nikita Shulga	01e6d64dd2	[MPS] Fix unary ops over sparse-mapped tensors (#100765 ) If input tensor is backed by a sparse view, create a dense copy before running unary op, otherwise op will be applied against the wrong elements. Introduce `is_dense_in_storage` that returns true if tensor/view are mapped to a dense area in the tensor storage. Add unit test to validate the fix. Fixes https://github.com/pytorch/pytorch/issues/98074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100765 Approved by: https://github.com/albanD	2023-07-05 23:17:43 +00:00
Animesh Jain	4005152b92	[dynamo] Organize higherorderops variable trackers (#104565 ) The main change is moving the higherorderops from torch.py to higher_order_ops.py. And creating smaller subclasses of HigherOrderOp for cond, map etc Pull Request resolved: https://github.com/pytorch/pytorch/pull/104565 Approved by: https://github.com/zou3519	2023-07-05 22:19:26 +00:00
Edward Z. Yang	666aeaa313	Preserve original co_filename when FX symbolic_trace (#103885 ) Previously, you'd get `<eval_with_key>.0`; now you get `<eval_with_key>.0 from /data/users/ezyang/b/pytorch/test/dynamo/test_misc.py:5683 in forward` I used to do this with globals, but now I do it with a `co_fields` parameter that's plumbed around, because putting things in globals has implications(TM). Happy to bikeshed on the `co_fields` structure. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103885 Approved by: https://github.com/albanD	2023-07-05 22:00:05 +00:00
Aaron Gokaslan	4baac20117	[BE] switch fprintf to fmt::print (#104640 ) Testing out the new automated clang-tidy check in master. Code should be faster, more modern, and more efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104640 Approved by: https://github.com/malfet	2023-07-05 21:11:39 +00:00
Deng, Weishi	f00f1d4cfb	add fused support for xpu devices (#104517 ) We want to add fused support for xpu devices in optimizer so we add 'xpu' to the fused support list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104517 Approved by: https://github.com/ezyang	2023-07-05 21:07:00 +00:00
wgb	b5c2404116	Expose TorchDispatchUtils Api for Extensions (#104619 ) Fixes #ISSUE_NUMBER Exporse some api in TorchDispatchUtils.h for custom devices. This can be used in DEBUG package. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104619 Approved by: https://github.com/ezyang	2023-07-05 20:21:23 +00:00
Edward Z. Yang	5b600dee19	Properly preserve --tracing-mode when isolated minify (#104101 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104101 Approved by: https://github.com/voznesenskym	2023-07-05 20:19:11 +00:00
Edward Z. Yang	3dc4adc7a6	Don't build CUDA with debug info by default. (#102617 ) Fixes https://github.com/pytorch/pytorch/issues/102594 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102617 Approved by: https://github.com/malfet	2023-07-05 20:16:19 +00:00
Yukio Siraichi	0cee4e3c32	Turn translation validation off on timeouts. (#104464 ) Follow-up to PR: #97964 After the introduction of translation validation, (TV) a few TIMM and TorchBench benchmarks started failing due to TIMEOUT. This PR turns TV off for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104464 Approved by: https://github.com/malfet	2023-07-05 19:01:50 +00:00
Yukio Siraichi	40b8d10d5e	Re-land: Turn translation validation on for tests and accuracy runs by default. (#104467 ) Re-landing: #103611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104467 Approved by: https://github.com/malfet	2023-07-05 19:01:50 +00:00
PyTorch MergeBot	5ab2b27353	Revert "Re-enable low memory dropout (#103330 )" This reverts commit f32593630bceed0eb51656304841d9f5de09ec7c. Reverted https://github.com/pytorch/pytorch/pull/103330 on behalf of https://github.com/davidberard98 due to large compilation time regression ([comment](https://github.com/pytorch/pytorch/pull/103330#issuecomment-1622304072))	2023-07-05 19:00:40 +00:00
Andy Rock	fb1ad02833	Support bit shifting `SymInt`s (#104318 ) Fixes #104228. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104318 Approved by: https://github.com/ezyang	2023-07-05 18:35:57 +00:00
yewentao	d3ba8901d8	Adding precision issue note docs for `functional.interpolate` (#104622 ) Fixes #104157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104622 Approved by: https://github.com/ezyang	2023-07-05 16:20:57 +00:00
mingfeima	05eaf5ab51	optimized the backward of index_select when dim = 0 on CPU (#102961 ) This one is targeting at improving the performance of backward for `index_select` when dim = 0; The fast path uses the existing kernel for `scatter_add` when dim = 0. The following result is based on weight size [50000, 128], index size [50000]. CPU type: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, dual sockets, 24 cores per socket. This patch will bring 7.9x speedup for the backward of `index_select`. * before, each index_add takes `8.678ms` ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ autograd::engine::evaluate_function: IndexSelectBack... 0.05% 9.709ms 88.92% 18.095s 9.048ms 2000 IndexSelectBackward0 0.09% 18.823ms 88.83% 18.077s 9.038ms 2000 aten::index_select_backward 0.00% 694.000us 88.78% 18.067s 9.033ms 2000 aten::index_add_ 85.21% 17.341s 85.29% 17.356s 8.678ms 2000 aten::index_select 5.45% 1.110s 5.59% 1.138s 563.205us 2020 autograd::engine::evaluate_function: torch::autograd... 0.05% 9.443ms 5.15% 1.047s 523.517us 2000 torch::autograd::AccumulateGrad 0.12% 24.361ms 5.10% 1.038s 519.034us 2000 aten::add_ 4.98% 1.014s 4.98% 1.014s 507.092us 1999 aten::new_zeros -0.01% -1189.000us 3.42% 696.845ms 348.423us 2000 aten::zero_ 0.14% 28.879ms 3.30% 671.912ms 335.956us 2000 aten::fill_ 3.28% 666.919ms 3.28% 666.919ms 333.459us 2000 aten::randn 0.00% 49.000us 0.33% 67.018ms 33.509ms 2 aten::normal_ 0.33% 66.778ms 0.33% 66.778ms 33.389ms 2 aten::select 0.13% 25.738ms 0.17% 33.620ms 4.182us 8040 aten::empty 0.10% 19.746ms 0.10% 19.746ms 4.908us 4023 aten::new_empty 0.03% 5.371ms 0.07% 14.532ms 7.266us 2000 aten::as_strided 0.04% 7.945ms 0.04% 7.945ms 0.988us 8040 aten::is_same_size 0.01% 2.228ms 0.01% 2.228ms 1.114us 2000 aten::randint 0.00% 27.000us 0.00% 600.000us 600.000us 1 aten::random_ 0.00% 564.000us 0.00% 564.000us 564.000us 1 aten::detach 0.00% 4.000us 0.00% 30.000us 30.000us 1 detach 0.00% 26.000us 0.00% 26.000us 26.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ``` * after, each index_add takes `1.093ms` ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ autograd::engine::evaluate_function: IndexSelectBack... 0.34% 17.992ms 55.21% 2.897s 1.449ms 2000 IndexSelectBackward0 0.18% 9.661ms 54.86% 2.879s 1.440ms 2000 aten::index_select_backward -0.04% -2182.000us 54.68% 2.870s 1.435ms 2000 aten::index_add_ 41.54% 2.180s 41.67% 2.187s 1.093ms 2000 aten::index_select 23.52% 1.234s 24.15% 1.267s 627.413us 2020 autograd::engine::evaluate_function: torch::autograd... 0.22% 11.786ms 19.22% 1.009s 504.313us 2000 torch::autograd::AccumulateGrad 0.43% 22.459ms 19.02% 998.351ms 499.175us 2000 aten::add_ 18.59% 975.864ms 18.59% 975.864ms 488.176us 1999 aten::new_zeros -0.03% -1351.000us 12.76% 669.825ms 334.913us 2000 aten::zero_ 0.58% 30.644ms 12.30% 645.262ms 322.631us 2000 aten::fill_ 12.20% 640.196ms 12.20% 640.196ms 320.098us 2000 aten::randn 0.00% 54.000us 1.33% 70.001ms 35.001ms 2 aten::normal_ 1.33% 69.745ms 1.33% 69.745ms 34.873ms 2 aten::empty 0.43% 22.406ms 0.43% 22.406ms 5.569us 4023 aten::select 0.30% 15.539ms 0.41% 21.411ms 5.300us 4040 aten::new_empty 0.10% 5.016ms 0.28% 14.731ms 7.365us 2000 aten::as_strided 0.24% 12.460ms 0.24% 12.460ms 2.063us 6040 aten::is_same_size 0.05% 2.675ms 0.05% 2.675ms 1.337us 2000 aten::randint 0.00% 26.000us 0.01% 632.000us 632.000us 1 aten::random_ 0.01% 598.000us 0.01% 598.000us 598.000us 1 aten::detach 0.00% 6.000us 0.00% 28.000us 28.000us 1 detach 0.00% 22.000us 0.00% 22.000us 22.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102961 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-07-05 15:32:58 +00:00
Thiago Crepaldi	3834582327	[ONNX] Add autograd_inlining flag to torch.onnx.export (#104067 ) Fixes #88286, Fixes #97160 Repro: ```python import torch import io from torch.utils.checkpoint import checkpoint class A(torch.nn.Module): # A supported module. def __init__(self): super(A, self).__init__() self.l1 = torch.nn.Linear(2, 2) def forward(self, x): return self.l1(x) class B(torch.nn.Module): # This module is not exportable to ONNX because it # uses gradient-checkpointing. However, its two sub-module's # are exportable, so ORTModule should be used to compute them. def __init__(self): super(B, self).__init__() self.l1 = torch.nn.Linear(2, 2) self.a = A() def forward(self, x): def custom(): def custom_forward(x_): return self.a(x_) return custom_forward z = self.l1(checkpoint(custom(), x)) return z torch.onnx.export( B(), (torch.randn(2, 2),), io.BytesIO(), autograd_inlining=True ) ``` `torch.onnx.export(autograd_inlining=True)` should repro the user error as this is the original execution path. ```bash Traceback (most recent call last): File "repro88286.py", line 36, in <module> torch.onnx.export( File "<@beartype(torch.onnx.utils.export) at 0x7f0f011faee0>", line 385, in export File "/opt/pytorch/torch/onnx/utils.py", line 511, in export _export( File "/opt/pytorch/torch/onnx/utils.py", line 1576, in _export graph, params_dict, torch_out = _model_to_graph( File "<@beartype(torch.onnx.utils._model_to_graph) at 0x7f0f01187dc0>", line 11, in _model_to_graph File "/opt/pytorch/torch/onnx/utils.py", line 1130, in _model_to_graph graph, params, torch_out, module = _create_jit_graph(model, args) File "/opt/pytorch/torch/onnx/utils.py", line 1006, in _create_jit_graph graph, torch_out = _trace_and_get_graph_from_model(model, args) File "/opt/pytorch/torch/onnx/utils.py", line 910, in _trace_and_get_graph_from_model trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph( File "/opt/pytorch/torch/jit/_trace.py", line 1269, in _get_trace_graph outs = ONNXTracedModule(f, strict, _force_outplace, return_inputs, _return_inputs_states)(args, kwargs) File "/opt/pytorch/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl return self._call_impl(args, *kwargs) File "/opt/pytorch/torch/nn/modules/module.py", line 1511, in _call_impl return forward_call(args, *kwargs) File "/opt/pytorch/torch/jit/_trace.py", line 128, in forward graph, out = torch._C._create_graph_by_tracing( File "/opt/pytorch/torch/jit/_trace.py", line 119, in wrapper outs.append(self.inner(trace_inputs)) File "/opt/pytorch/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/opt/pytorch/torch/nn/modules/module.py", line 1511, in _call_impl return forward_call(args, *kwargs) File "/opt/pytorch/torch/nn/modules/module.py", line 1492, in _slow_forward result = self.forward(input, *kwargs) File "repro88286.py", line 32, in forward z = self.l1(checkpoint(custom(), x)) File "/opt/pytorch/torch/utils/checkpoint.py", line 412, in checkpoint return CheckpointFunction.apply(function, preserve, args) File "/opt/pytorch/torch/autograd/function.py", line 506, in apply return super().apply(args, *kwargs) # type: ignore[misc] RuntimeError: _Map_base::at ``` By using `autograd_inlining=False`, the export still fail with a different error because autograd inlining is not enabled: ```bash Traceback (most recent call last): File "repro88286.py", line 36, in <module> torch.onnx.export( File "<@beartype(torch.onnx.utils.export) at 0x7f6088b32ee0>", line 385, in export File "/opt/pytorch/torch/onnx/utils.py", line 511, in export _export( File "/opt/pytorch/torch/onnx/utils.py", line 1615, in _export ) = graph._export_onnx( # type: ignore[attr-defined] RuntimeError: ONNX export failed: Couldn't export Python operator CheckpointFunction ``` To allow `CheckpointFunction` into the onnx graph, `operator_export_type=torch.onnx.OperatorExportTypes.ONNX_FALLTHROUGH` flag can be added to `torch.onnx.export`, which would lead to the following ONNX graph: ```bash Exported graph: graph(%prim::PythonOp_0 : Float(2, 2, strides=[2, 1], requires_grad=0, device=cpu), %l1.weight : Float(2, 2, strides=[2, 1], requires_grad=1, device=cpu), %l1.bias : Float(2, strides=[1], requires_grad=1, device=cpu)): %/PythonOp_output_0 : Float(2, 2, strides=[2, 1], requires_grad=0, device=cpu) = ^CheckpointFunction[inplace=0, module="torch.utils.checkpoint", onnx_name="/PythonOp"](<function B.forward.<locals>.custom.<locals>.custom_forward at 0x7fdf9182f670>, True)(%prim::PythonOp_0), scope: __main__.B:: # /opt/pytorch/torch/autograd/function.py:506:0 %6 : Float(2, 2, strides=[2, 1], requires_grad=1, device=cpu) = onnx::Gemm[alpha=1., beta=1., transB=1, onnx_name="/l1/Gemm"](%/PythonOp_output_0, %l1.weight, %l1.bias), scope: __main__.B::/torch.nn.modules.linear.Linear::l1 # /opt/pytorch/torch/nn/modules/linear.py:114:0 return (%6) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104067 Approved by: https://github.com/BowenBao, https://github.com/kit1980	2023-07-05 15:27:36 +00:00
Nikita Karetnikov	c00dd43e43	[pt2] add metas for `multilabel_margin_loss` ops (#104388 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104388 Approved by: https://github.com/ezyang	2023-07-05 13:42:22 +00:00
Nikita Karetnikov	a3aa4da154	[pt2] add metas for `multi_margin_loss` ops (#104236 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104236 Approved by: https://github.com/ezyang	2023-07-05 13:40:05 +00:00
Nikita Karetnikov	ad58aba932	[pt2] add metas for `adaptive_max_pool` ops (#104167 ) Fixes #103892. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104167 Approved by: https://github.com/ezyang	2023-07-05 07:02:07 +00:00
PyTorch MergeBot	54e320d4d1	Revert "[dynamo] Lazy disable_dynamo API out-of-dynamo (#104317 )" This reverts commit 5c12a810ac2d40ee74098c8adcf9ec7dddd9476e. Reverted https://github.com/pytorch/pytorch/pull/104317 on behalf of https://github.com/huydhn due to This has been reverted internally by D47166892, so I need to also revert it on OSS to keep them in sync ([comment](https://github.com/pytorch/pytorch/pull/104317#issuecomment-1621099151))	2023-07-05 06:21:48 +00:00
PyTorch MergeBot	40f53912cf	Revert "[dynamo][ac] Remove disable monkeypatching of utils.checkpoint (#104397 )" This reverts commit 537a6c0651edda1e1a55b90658a6c24d049ff982. Reverted https://github.com/pytorch/pytorch/pull/104397 on behalf of https://github.com/huydhn due to This has been reverted internally by D47216591, so I need to also revert it on OSS to keep them in sync ([comment](https://github.com/pytorch/pytorch/pull/104397#issuecomment-1621086360))	2023-07-05 06:11:08 +00:00
Connor Baker	0c8323e4a4	cmake: allow USE_SYSTEM_ZSTD (#104611 ) Fixes #44255. This is part of larger work I'm doing to allow for more `USE_SYSTEM_*` options to allow Nix to have faster re-builds of PyTorch: https://github.com/NixOS/nixpkgs/pull/239291. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104611 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-07-05 04:47:35 +00:00
leslie-fang-intel	ea4d5c4538	[Quant][PT2E] Enable vec code gen for pair of quant/dequant (#104503 ) Summary We have supported the vectorization code gen with pattern of `dequant-relu-quant`, for which `to_uint8` is the last node of quant pattern before store into memory. However, there is another case that `dequant1-relu-quant2-dequant2-relu-quant3`. In this case, `quant2` is at the middle of fusion pattern, we enable vectorization code gen of `quant2-dequant2` in this PR. Test Plan ``` python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_relu_quant_dequant_relu_quant_lowering ``` Next Step * For better performance, we can add another pass to eliminate pair nodes of `float_to_uint8` and `uint8_to_float`. * For better performance, we should annotate `dequant1` and `quant2` as share observer in quantization recipe. Then we can lower `dequant1-relu-quant2` into a QReLU node to fully eliminate the calculation of `dequant1` and `quant2`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104503 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-05 01:59:00 +00:00
Peter Bell	12ca224662	Add hacked_twin overloads for _unsafe indexing functions (#104127 ) Fixes #104037 This hacky workaround already exists for the normal overloads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104127 Approved by: https://github.com/ezyang	2023-07-05 01:05:27 +00:00
Edward Z. Yang	2385dad4b3	Enable automatic_dynamic_shapes by default (#103623 ) Some notes: * I now manually turn off `_generate` jobs from running with cudagraphs, as it is unrealistic to expect to cudagraph autoregressive generation up to max sequence length, this would imply compiling the entire unrolled sequence generation. Concretely, cm3leon_generate was timing out post this change, likely due to the compile time slowdown of dynamic shapes ON TOP OF accidentally unrolling all the loops * A few torch._dynamo.reset tactically inserted to force recompiles on tests that expected it * expectedFailureAutomaticDynamic flip into patching automatic_dynamic_shapes=False Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103623 Approved by: https://github.com/voznesenskym	2023-07-05 00:25:02 +00:00
toma	2abbed42ee	correct the generated code and corresponding text to make them consistent (#104596 ) Fixes #104500 As discussed in #104500, the [corresponding doc](https://pytorch.org/docs/stable/dynamo/get-started.html#getting-started) for dynamo is inconsistent between the code and explanation. I have run the code example to get the correct code. ![image](https://github.com/pytorch/pytorch/assets/6964699/d11e0f2f-2225-4ba9-8934-b06c9fc78721) This PR fixes the problem and makes the doc more readable. cc: @davidberard98 @ezyang please help me check this PR, thanks! Pull Request resolved: https://github.com/pytorch/pytorch/pull/104596 Approved by: https://github.com/ezyang	2023-07-04 22:56:03 +00:00
PyTorch MergeBot	bfd995f0d6	Revert "Specialize storage_offset - Does not cover automatic dynamic (#104204 )" This reverts commit 803c14490b189f9b755ecb9f2a969876088ea243. Reverted https://github.com/pytorch/pytorch/pull/104204 on behalf of https://github.com/ezyang due to also due to https://github.com/pytorch/pytorch/issues/104563 ([comment](https://github.com/pytorch/pytorch/pull/104204#issuecomment-1620653507))	2023-07-04 19:41:32 +00:00
Connor Baker	e8174faa02	cmake: respect USE_SYSTEM_LIBS when USE_NCCL is set (#104511 ) Even though `USE_SYSTEM_LIBS` is set to true, we still need to set `USE_SYSTEM_NCCL` for the system NCCL to be used. This fixes that by adding a conditional `set` similar to what is done for `USE_TBB`: `e9ebda29d8/CMakeLists.txt (L426-L428)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104511 Approved by: https://github.com/ezyang	2023-07-04 19:08:50 +00:00
wangxiyuan	52094a3454	Correct warning message info in fork_rng (#104525 ) the warning message in fork_rng missed format prefix. This PR add it back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104525 Approved by: https://github.com/Skylion007	2023-07-04 19:08:16 +00:00
Peter Bell	5c580a9846	[decomp] Add test tracking core ATen operators (#104262 ) This adds an expect-test that finds the set of core ATen operators by subtracting the operators with decomposition in core_aten_decompositions from the set of all operators that have decompositions and could be decomposed. This is useful because if you add a new decomposition but forget to add it to the list of core decompositions, it will appear in the PR diff. Also, by going through this list I have identified some operators where the functional variant is decomposed, but not the inplace variant which must be an oversight. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104262 Approved by: https://github.com/lezcano	2023-07-04 16:41:44 +00:00
jiayisun	d62a80adc3	remove ipex backend (#104329 ) Move IPEX backend from PyTorch to IPEX. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104329 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-04 09:21:27 +00:00
lezcano	7ae100628e	Move most SymPy functions to their own file (#104556 ) All these are standalone implementations of some functions and they don't depend on anything else, so we better have them under the `_sympy/` folder on their own Pull Request resolved: https://github.com/pytorch/pytorch/pull/104556 Approved by: https://github.com/ezyang	2023-07-04 03:53:48 +00:00
PyTorch UpdateBot	985cb5055c	[vision hash update] update the pinned vision hash (#104562 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104562 Approved by: https://github.com/pytorchbot	2023-07-04 03:32:09 +00:00
leslie-fang-intel	2a21469a77	[Quant][PT2E] Enable conv2d unary and binary recipe for x86 inductor quantizer (#98826 ) Summary - Recipe to annotate `conv2d_relu` for `X86InductorQuantizer` is added. - Recipe to annotate `conv2d_add` for `X86InductorQuantizer` is added. - Recipe to annotate `conv2d_add_relu` for `X86InductorQuantizer` is added. Test Plan ``` python -u -m pytest -s -v test_x86inductor_quantizer.py -k TestQuantizePT2EX86Inductor ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98826 Approved by: https://github.com/jerryzh168	2023-07-04 00:01:10 +00:00
Justin Chu	8780bd6a01	[ONNX] Use load_model_from_string (#104533 ) Instead of load_from_string because it is an alias: `3645b70caa/onnx/__init__.py (L320)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104533 Approved by: https://github.com/BowenBao	2023-07-03 23:00:57 +00:00
Masaki Kozuki	07c60d11b3	replace `AT_ERROR(...)` with `TORCH_CHECK(false, ...)` (#104534 ) Merely cosmetic for `AT_ERROR` I found by chance, following `e9d2d74f0a/c10/util/Exception.h (L622)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104534 Approved by: https://github.com/soulitzer	2023-07-03 22:43:19 +00:00
Kunal Vaishnavi	709c9b5c93	Fix tabulate import error (#104468 ) ### Description This PR fixes issue #104166 by re-raising the exception. ### Context The `tabulate` package needs to be installed with `pip install tabulate` before calling `tabulate(...)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104468 Approved by: https://github.com/Skylion007, https://github.com/BowenBao	2023-07-03 21:55:53 +00:00
Sergii Dymchenko	d7b5cd7d0b	Fix `mH()` to `mH` in Python examples (#104532 ) `mH()` results in TypeError: 'Tensor' object is not callable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104532 Approved by: https://github.com/lezcano	2023-07-03 21:35:47 +00:00
David Berard	e9d2d74f0a	[inductor] Add prims._inductor_bucketize and add lowerings (#104007 ) TL;DR: This PR is a first step in adding lowerings for torch.bucketize. It adds an initial lowering for this op - but because this implementation is not currently efficient, it registers the lowering for prims._inductor_bucketize. After we make the implementation more efficient, we'll remove prims._inductor_bucketize and add the lowering directly to torch.bucketize. Background - torch.bucketize: torch.bucketize(values, boundaries, right=False): for an arbitrary tensor of values and a non-decreasing 1D tensor of boundaries that define buckets, it returns the index of the bucket that each of the values will fall in. e.g. for values [0, 1, 2, 3, 4] and boundaries [1, 3], it will return [0, 0, 1, 1, 2]. Implementation: This PR adds a new inductor op called "bucketize". In this PR it only has a triton implementation - for CPU it is a fallback. The triton implementation uses a binary search in `triton_helpers.py`. This PR also adds a new prim `_inductor_bucketize()` for testing purposes and adds lowering for this op. ~~"right": The current behavior of the "right" kwarg in the inductor op is the opposite of the behavior of the torch op. "right" controls how the op treats a value that is equal to one of the boundary values. In the torch op, "right=True" means "if a value is equal to a boundary value, then put it in the bucket to the right". In the inductor op, "right=True" means "the right boundary of a bucket is closed". These are opposite. I'm open to switching the behavior of the inductor op - but I chose to implement this way because I think it makes more sense, and I think the torch.bucketize behavior may have been a mistake (it's the opposite of numpy.digitize).~~ Switched the behavior of the inductor bucketize op to match the torch op * places where "right" means "if a value is equal to a boundary value, then put it in the bucket to the right" (i.e. current torch.bucketize behavior) + current torch.bucketize behavior + table in [torch.bucketize docs](https://pytorch.org/docs/stable/generated/torch.bucketize.html) * places where "right" means "the right boundary of a bucket is closed": + the text description of [torch.bucketize docs](https://pytorch.org/docs/stable/generated/torch.bucketize.html) (observed in #91580) + [numpy.digitize](https://numpy.org/doc/stable/reference/generated/numpy.digitize.html) (which is basically the same op) Performance: Benchmark script: "values" as a [16, 1024, 1024] float32 tensor and "boundaries" as a [1025] tensor (i.e. defining 1024 buckets). As is: ``` Eager 0.30117499828338623 ms PT2 0.9298200011253357 ms ``` But performance improves significantly if we add an additional pointwise autotuning config (WIP in #104456): ``` Eager 0.3015420138835907 ms PT2 0.23028500378131866 ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104007 Approved by: https://github.com/jansel	2023-07-03 16:52:38 +00:00
atalman	0ac2666d72	Advance docker builds to cuda 11.8 (#104528 ) Advance docker builds to cuda 11.8 This should fix Docker build nightly failure: https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=Docker The following docker image no longer exists: ``nvidia/cuda:11.7.0-cudnn8-devel-ubuntu20.04`` Hence advancing build to ``nvidia/cuda/11.8.0-cudnn8-devel-ubuntu20.04`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104528 Approved by: https://github.com/DanilBaibak	2023-07-03 16:44:26 +00:00
Justin Chu	d6b1f12846	Add onnx to common_methods_invocations.py approvers (#104530 ) Add onnx to common_methods_invocations.py approvers so that the torch.onnx group can contribute OpInfos and add skips. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104530 Approved by: https://github.com/kit1980	2023-07-03 16:43:22 +00:00
Nikita Vedeneev	437bc5b1b7	sparse_mask: backward support for sparse lhs (take 2) (#104341 ) This is a copy of https://github.com/pytorch/pytorch/pull/95165 with some bug fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104341 Approved by: https://github.com/albanD, https://github.com/pearu, https://github.com/amjames	2023-07-03 14:12:44 +00:00
PyTorch MergeBot	f353d17755	Revert "[ROCm] reduce tolerance for triangular solve with well_conditioned set to True (#104425 )" This reverts commit ef7bc3e23d128b92e7826342e879438d844f7312. Reverted https://github.com/pytorch/pytorch/pull/104425 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. It is failing CUDA test in trunk built in debug mode https://github.com/pytorch/pytorch/actions/runs/5429187622/jobs/9874360641 ([comment](https://github.com/pytorch/pytorch/pull/104425#issuecomment-1617247699))	2023-07-03 04:18:04 +00:00
Andrew Gallagher	9f7ad25c98	[PyTorch][Dispatcher] Fix destruction order fiasco crash (#104393 ) Summary: The current implementation of `Dispatcher` returns an RAII object from it's `register` methods which, on destruction, uses a saved reference to the `Dispatcher` to call the associated `deregister` method. However, nothing guarantees that the `Dispatcher` is destroyed after all RAII objects have been destroyed and, in practice, we see segfaults caused when a global `Dispatcher` is cleaned up before RAII globals. This diff fixes by keeping the `Dispatcher` lock and "alive" marker in a `std::shared_ptr` which the callbacks copy and then use to verify that the `Dispatcher` is still alive before continuing. https://fb.workplace.com/groups/1405155842844877/posts/7143161099044294/ https://fb.workplace.com/groups/python.builds/posts/3510588832595867/ S349108 Test Plan: CI Differential Revision: D47113122 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104393 Approved by: https://github.com/ezyang	2023-07-03 00:17:42 +00:00
leslie-fang-intel	707d265db2	[Inductor][Quant]Refactor load and store vectorization code generation with uint8 data type (#104075 ) Summary Refactor the vectorization code generation of uint8 input data type. Previously, we combine the uint8 data load and uint8 to float data convert into one step as `load_uint8_as_float` and `store_float_as_uint8`. After refactor, we split them into 2 steps of load/store and data type convert to make the behavior same as BFloat16 data type . The previous generated code is: ``` #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(432L); i0+=static_cast<long>(16L)) { auto tmp0 = at::vec::load_uint8_as_float(in_ptr0 + static_cast<long>(i0)); auto tmp1 = (tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100.0)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.01)); auto tmp5 = tmp3 * tmp4; auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0)); auto tmp7 = tmp6 * tmp2; auto tmp8 = tmp7.round(); auto tmp9 = tmp8 + tmp2; auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp11 = at::vec::maximum(tmp9, tmp10); auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp13 = at::vec::minimum(tmp11, tmp12); auto tmp14 = (tmp13); at::vec::store_float_as_uint8(tmp14, out_ptr0 + static_cast<long>(i0)); } ``` After this PR, the generated code is: ``` #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(432L); i0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<uint8_t>::loadu(in_ptr0 + static_cast<long>(i0), 16); auto tmp1 = cvt_uint8_to_fp32_with_same_elem_num(tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100.0)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.01)); auto tmp5 = tmp3 * tmp4; auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0)); auto tmp7 = tmp6 * tmp2; auto tmp8 = tmp7.round(); auto tmp9 = tmp8 + tmp2; auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp11 = at::vec::maximum(tmp9, tmp10); auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp13 = at::vec::minimum(tmp11, tmp12); auto tmp14 = cvt_fp32_to_uint8(tmp13); tmp14.store(out_ptr0 + static_cast<long>(i0), 16); } ``` Test Plan ``` python -m pytest test_cpu_repro.py -k test_decomposed_dequant_relu_quant python -m pytest test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant python -m pytest test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104075 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-01 23:12:43 +00:00
PyTorch MergeBot	fcb53c1394	Revert "[6/n][FSDP] Update _sharded_pre_load_state_dict_hook to use DTensor when use_dtensor=True in ShardedStateDictConfig (#104087 )" This reverts commit 49af83cf442ef569c8eb4f5a29f46a65abc0e2d2. Reverted https://github.com/pytorch/pytorch/pull/104087 on behalf of https://github.com/huydhn due to This is failing in trunk `49af83cf44`, probably due to a land race ([comment](https://github.com/pytorch/pytorch/pull/104087#issuecomment-1615608189))	2023-07-01 07:50:31 +00:00
Kimish Patel	bd0f0f40a1	[PT2][Quant] Enable symbolic shape in linear quantization (#104473 ) When tracing with symbolic shapes, arbitrary sym_size nodes can appear in the graph. Earlier changes did not account for this and quantizer fails to annotate the right nodes. This diff fixes that by not annotating sym_size nodes, which should really not be relevant for quantization. As next steps, we should validate in quant workflow that a) sym_int nodes are not being quantized and b) add similar support, as this diff, for generic annotations Differential Revision: [D47132050](https://our.internmc.facebook.com/intern/diff/D47132050/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104473 Approved by: https://github.com/jerryzh168	2023-07-01 05:14:30 +00:00
PyTorch UpdateBot	4e27e6c160	[vision hash update] update the pinned vision hash (#104490 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104490 Approved by: https://github.com/pytorchbot	2023-07-01 03:34:21 +00:00
lcskrishna	004ff536e8	[ROCm] Fix circular recursion issue in hipification (#104085 ) This PR fixes the circular issue during hipification process by introducing current_state to track whether a file is processed for hipification. (Iterative DFS) The issue arises when two header files try to include themselves, which leads to a circular recursion or an infinite loop. Fixes the related issues such as : https://github.com/pytorch/pytorch/issues/93827 https://github.com/ROCmSoftwarePlatform/hipify_torch/issues/39 Error log: ``` File "/opt/conda/lib/python3.8/posixpath.py", line 471, in relpath start_list = [x for x in abspath(start).split(sep) if x] File "/opt/conda/lib/python3.8/posixpath.py", line 375, in abspath if not isabs(path): File "/opt/conda/lib/python3.8/posixpath.py", line 63, in isabs sep = _get_sep(s) File "/opt/conda/lib/python3.8/posixpath.py", line 42, in _get_sep if isinstance(path, bytes): RecursionError: maximum recursion depth exceeded while calling a Python object ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104085 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2023-07-01 03:25:51 +00:00
Fuzzkatt	e865bc7da4	add SM80OrLater checks to bfloat16 torchinductor tests (#104436 ) Fixes #103993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104436 Approved by: https://github.com/kit1980	2023-07-01 03:15:46 +00:00
Huy Do	b3e60ee052	Fix broken torch._inductor.config import (#104477 ) This fixes the bug in profiler code exposed by https://github.com/pytorch/pytorch/pull/104368 that introduced on the fact that `import torch._dynamo` also imports `torch._inductor.config`: ``` $ python -c "import torch._inductor;print(torch._inductor.config)" Traceback (most recent call last): File "<string>", line 1, in <module> AttributeError: module 'torch._inductor' has no attribute 'config' (base) $ python -c "import torch._dynamo;print(torch._inductor.config)" <module 'torch._inductor.config' from '/home/nshulga/git/pytorch/pytorch/torch/_inductor/config.py'> ``` ### Testing D47159397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104477 Approved by: https://github.com/aaronenyeshi, https://github.com/malfet	2023-07-01 02:23:44 +00:00
leslie-fang-intel	d6f1827181	[Inductor][Quant] Add UT to combine dynamo export and inductor constant folding (#104245 ) Summary As we report in [Issue-103582](https://github.com/pytorch/pytorch/issues/103582), previous `constant folding` failed to work after `dynamo export`. With latest PyTorch main branch, we can't reproduce this error. Adding UT in this PR to cover this use case. Test Plan ``` python -m pytest test_inductor_freezing.py -k test_functional_constant_folding_after_dynamo_export ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104245 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-07-01 01:50:50 +00:00
Nikita Karetnikov	b1c31b1d26	[pt2] metas and `SymInt` support for `max_pool` ops (#103951 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103951 Approved by: https://github.com/Chillee, https://github.com/kulinseth	2023-07-01 01:33:35 +00:00
Nikita Karetnikov	c4a6f86062	[pt2] add metas for `max_unpool2d` and `max_unpool3d` (#103821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103821 Approved by: https://github.com/Skylion007, https://github.com/Chillee	2023-07-01 01:33:35 +00:00
BowenBao	f9aa004d39	[ONNX][TypePromo] Materialize type promotion rules (#104063 ) This PR adds the materialized type promotion rule set and the type promotion rule infrastructure that will be consumed by ONNX exporter. The script that generates the rule set is included, and a local unittest is added to check the validity of the materialized rule set. Full design doc and discussion at https://microsoft-my.sharepoint.com/:w:/p/bowbao/Edj2lF1oi0JIitT_3ntyuqkBo6ll7N6NJDmavM0lp_KkEA?e=OElyjR Pull Request resolved: https://github.com/pytorch/pytorch/pull/104063 Approved by: https://github.com/titaiwangms, https://github.com/justinchuby	2023-07-01 01:20:08 +00:00
angelayi	828b275740	[exportdb] Setup website (#104288 ) <img width="1109" alt="image" src="https://github.com/pytorch/pytorch/assets/10901756/e67ff8a9-adb1-466f-8285-fb7d3653d139"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104288 Approved by: https://github.com/zhxchen17	2023-07-01 01:03:56 +00:00
Iris	49af83cf44	[6/n][FSDP] Update _sharded_pre_load_state_dict_hook to use DTensor when use_dtensor=True in ShardedStateDictConfig (#104087 ) This allows us use use_dtensor=True for ShardedStateDictConfig() before calling model.load_state_dict(). It only works for offload_to_cpu=False now. Next PR will make use_dtensor=True work with offload_to_cpu=True for load_state_dict(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104087 Approved by: https://github.com/fegin	2023-07-01 01:02:59 +00:00
Kevin Liu	1de1bea60d	Back out "[Inductor][FX passes] Remove config.split_cat_fx_passes & A… (#104370 ) …dd config.experimental_patterns" (#104362) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/104362 revert D46752606 to unblock pyper release. This diff introduced a package incompatibility between ads_dper3 and training_platform. Test Plan: tests pass Reviewed By: yanboliang Differential Revision: D47100297 fbshipit-source-id: 24a0bc149f0f9165b5ffcca80e669e917d6dd4c2 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/104370 Approved by: https://github.com/yanboliang	2023-07-01 00:44:46 +00:00
Peter Bell	9626604bdd	[inductor] Fix squeeze normalization pattern (#104434 ) Fixes #103875 In the test sample, this pass would turn: ``` %squeeze : [num_users=1] = call_method[target=squeeze](args = (%l_x_, 1, 2), kwargs = {}) ``` into ``` %squeeze_1 : [num_users=1] = call_function[target=torch.squeeze](args = (%l_x_, 1), kwargs = {}) ``` which is clearly wrong as we've dropped the second squeeze dim. Instead, this PR now normalizes to ``` %squeeze_1 : [num_users=1] = call_function[target=torch.squeeze](args = (%l_x_, (1, 2)), kwargs = {}) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104434 Approved by: https://github.com/jansel	2023-07-01 00:28:02 +00:00
Andrew Gu	d982fdb5d5	[FSDP] Rework meta device init (#104189 ) This addresses https://github.com/pytorch/pytorch/issues/104187. After this PR, the contract with the user is that: - If passing `param_init_fn=None`, each `nn.Module.reset_parameters()` should only initialize its own parameters/buffers (like `parameters(recurse=False)`/`buffers(recurse=False)`). - If passing `param_init_fn` not equal to `None`, then similarly, one call to `param_init_fn(module)` should only initialize `module`'s own parameters/buffers. With this contract and this PR's changes, meta device initialization through either `reset_parameters()` or `param_init_fn` should be correct. Those functions will run on the original parameter/buffer shapes allowing for correct shape-dependent computations like for fan-in/fan-out, and there will not be any re-initialization of any module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104189 Approved by: https://github.com/rohan-varma	2023-07-01 00:25:12 +00:00
FFFrog	93f5a82e37	Add detailed requirement of compiler in README.md (#103819 ) Refer to this [issue](https://github.com/pytorch/pytorch/issues/102258), a compiler that fully supports C++17 is required, otherwise the precision of some operators will have problems for aarch64. Therefore, It will be more user-friendly to specify the gcc version especially for aarch64. Fixes #102258 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103819 Approved by: https://github.com/kit1980	2023-06-30 22:59:50 +00:00
Oren Leung	3ff111a4b4	doc: fix fake_quantize_per_tensor_affine docs (#104453 ) Fixes #82800 Fixes wrong `fake_quantize_per_tensor_affine` example and wrong `fake_quantize_per_tensor_affine` formula Pull Request resolved: https://github.com/pytorch/pytorch/pull/104453 Approved by: https://github.com/kit1980	2023-06-30 22:59:00 +00:00
Eli Kobrin	a5ca445f79	Check for corrupted ivalues. (#104243 ) Hi! We've been fuzzing torchvision project with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz). We've found a SEGV error at address 0x0 at `vector.h:163` in pytorch third-party project flatbuffers. The error occurs because the `ivalues` field of flatbuffer module can be null, so the corresponding check must be inserted. torchvision version: 9d0a93eee90bf7c401b74ebf9c8be80346254f15 pytorch version: 0f1621df1a0a73956c7ce4e2f72f069e610e0137 OS: Ubuntu 20.04 How to reproduce 1. Build docker from [here](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/torchvision) and run the container: sudo docker build -t oss-sydr-fuzz-torchvision . sudo docker run --privileged --rm -v `pwd`:/fuzz -it oss-sydr-fuzz-torchvision /bin/bash 2. Run the target on this input: [malformed-module.txt](https://github.com/pytorch/pytorch/files/11879653/malformed-module.txt) /encode_png_fuzz malformed-module.txt 3. You will see the following output: AddressSanitizer:DEADLYSIGNAL ================================================================= ==1154==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x00000d17cc61 bp 0x7ffcbe8637f0 sp 0x7ffcbe863660 T0) ==1154==The signal is caused by a READ memory access. ==1154==Hint: address points to the zero page. #0 0xd17cc61 in flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::IValue> >::size() const /pytorch/third_party/flatbuffers/include/flatbuffers/vector.h:163:48 #1 0xd17cc61 in torch::jit::(anonymous namespace)::FlatbufferLoader::parseModule(torch::jit::mobile::serialization::Module) /pytorch/torch/csrc/jit/mobile/flatbuffer_loader.cpp:293:32 #2 0xd17dd23 in torch::jit::parse_and_initialize_mobile_module_for_jit(void, unsigned long, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, std::vector<c10::IValue, std::allocator<c10::IValue> >&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >) /pytorch/torch/csrc/jit/mobile/flatbuffer_loader.cpp:809:29 #3 0xdd661b4 in torch::jit::parse_and_initialize_jit_module(std::shared_ptr<char>, unsigned long, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, c10::optional<c10::Device>) /pytorch/torch/csrc/jit/serialization/import.cpp:345:28 #4 0xdd6b24a in torch::jit::_load_jit_module_from_bytes(std::shared_ptr<char>, unsigned long, std::shared_ptr<torch::jit::CompilationUnit>, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:547:14 #5 0xdd6c6df in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:443:10 #6 0xdd6c1c7 in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:421:10 #7 0xdd6dce4 in torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:503:10 #8 0xf2d3f75 in torch::serialize::InputArchive::load_from(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>) /pytorch/torch/csrc/api/src/serialize/input-archive.cpp:97:13 #9 0x60509c in void torch::load<at::Tensor, char&>(at::Tensor&, char&) /pytorch/torch/include/torch/csrc/api/include/torch/serialize.h:107:11 #10 0x6036be in LLVMFuzzerTestOneInput /vision/encode_png.cc:38:5 #11 0x66b041 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15 #12 0x6544cc in fuzzer::RunOneTest(fuzzer::Fuzzer, char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6 #13 0x65a61b in fuzzer::FuzzerDriver(int, char*, int ()(unsigned char const*, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9 #14 0x654222 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10 #15 0x7f0c87b9c082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee) #16 0x542cdd in _start (/encode_png_fuzz+0x542cdd) AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV /pytorch/third_party/flatbuffers/include/flatbuffers/vector.h:163:48 in flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::IValue> >::size() const ==1154==ABORTING Pull Request resolved: https://github.com/pytorch/pytorch/pull/104243 Approved by: https://github.com/kit1980	2023-06-30 22:53:49 +00:00
Valentin Andrei	f20fe674f9	[easy][cuda] Removed the warp size hardcode on layer norm backward CUDA kernel (#104441 ) Just a nit fix where `GammaBetaBackwardCUDAKernel_32x32` kernel used a hardcoded warp size for performing the reduction and laneId calculation. Changed this to use `C10_WARP_SIZE`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104441 Approved by: https://github.com/malfet	2023-06-30 22:49:13 +00:00
PyTorch MergeBot	8958f041be	Revert "Add forward mode AD to out-place foreach functions (#102409 )" This reverts commit e2ec0ba404f9fbd3c215cad4cabd7383c692cb33. Reverted https://github.com/pytorch/pytorch/pull/102409 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it is failing some tests in trunk `e799f565eb` ([comment](https://github.com/pytorch/pytorch/pull/102409#issuecomment-1615254393))	2023-06-30 22:46:57 +00:00
Michael Lazos	c178257b40	Don't limit fusions with foreach scheduler nodes (#104471 ) Ignore config fusion limit for foreach nodes since they have their own fusion limits which will be split automatically. With the fusion limit this will automatically start not fusing epilogue copies if there are more than 64 tensors in the foreach lists (very bad) which will create a ton of extra allocations. With this change, fusions with the subkernels still respect this limit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104471 Approved by: https://github.com/jansel	2023-06-30 21:59:18 +00:00
Andres Lugo-Reyes	ef7bc3e23d	[ROCm] reduce tolerance for triangular solve with well_conditioned set to True (#104425 ) Current test case causes an edge case tensor input that causes a single generated tensor to fail the tolerance assertion on ROCm only and only for float32. We have reviewed the logic with our libraries team and have discovered the discrepancy is due to a difference in order of operations on AMD GPUs. They came back with "working as intended" and found no perceivable bug. Interestingly, if we change the values in ks, ns, or bs, the test passes on ROCm. These particular sizes in this particular order generates a single problematic input that causes the assertion to fail the tolerance check by ~0.07. Again, this is not a bug, just differences in implementation. This PR loosens the tolerance for ROCm only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104425 Approved by: https://github.com/jeffdaily, https://github.com/nikitaved, https://github.com/lezcano	2023-06-30 21:43:42 +00:00
Masaki Kozuki	6929e9e947	Use `int64_t` accordingly in `cunn_SoftMaxBackward` to avoid `int` overflow (#104270 ) Fixes #103501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104270 Approved by: https://github.com/malfet, https://github.com/mikaylagawarecki	2023-06-30 21:39:46 +00:00
PyTorch MergeBot	4de1ee6ba4	Revert "Value range refinement using multi-variate expressions. (#97964 )" This reverts commit 26424122076c880694f3fe39ad21860bddb9b475. Reverted https://github.com/pytorch/pytorch/pull/97964 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it is breaking an internal test ([comment](https://github.com/pytorch/pytorch/pull/97964#issuecomment-1615194524))	2023-06-30 21:08:05 +00:00
shibo19	7acc4a2e86	add generic func to get function defined in custom device module (#99048 ) Fixes #ISSUE_NUMBER Now for the custom device, we use `getattr` and `setattr` to run the func defined in custom device module in some files, such as `AMP`, `random`, `DDP` and so on. So I want to add a generic func to get these funcs more friendly, could you take a look? @bdhirsh @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/99048 Approved by: https://github.com/bdhirsh	2023-06-30 20:02:44 +00:00
Wei Lu	b5980c0b86	[PyTorch Vulkan] add Vulkan support for `aten::masked_fill` (#104444 ) Summary: Implemented `aten::masked_fill` for Vulkan backend, see https://pytorch.org/docs/stable/generated/torch.Tensor.masked_fill.html for the behavior of this operator. Some explanation of the implementation: - The shapes of the input tensor and mask should be broadcastable (see [broadcasting semantics](https://pytorch.org/docs/stable/notes/broadcasting.html)). For example, the input tensor is of shape [3, 1, 5] and mask of shape [2, 1]. Then the output is of shape [3, 2, 5]. - A straightforward implementation is to generate an output and a mask, both of shape [3, 2, 5], by applying `repeat` operations on the input tensor and mask respectively. Then we traverse the mask and fill elements of output with `value` where mask is `True`. - However the `repeat` operation on mask is unnecessary and incurs extra time and space overhead. Instead we can keep the mask as it is and traverse the original mask and compute the corresponding broadcasted positions in the output tensor (see the shader file `masked_fill.glsl` for such computation). Some explanation of the test: - We test all possible broadcasting of the input tensor and mask. Manually setting all possible broadcastable shapes is intimidating. Instead we apply the following algorithm to automatically generate all possible cases which only requires one input of the shapes of the input tensor and mask. - First we set an identical shape for the `input_shape` and `mask_shape`, e.g. both are of [3, 5, 2, 3]. - Then we truncate all possible proceeding dimensions of `input_shape` and `mask_shape` respectively. Denote the results as `curr_input_shape` and `curr_mask_shape`, e.g. `curr_input_shape = [5, 2, 3]` and `curr_mask_shape = [2, 3]`. - Next, for both `curr_input_shape` and `curr_mask_shape` we generate all possible subsets of the indices and set the corresponding elements to 1 for each subset. For example, for `curr_input_shape = [5, 2, 3]`, a possible `input_idx_subset = [0, 2]`. We set the 0th and 2nd elements of `curr_input_shape` to be 1, then `curr_input_shape = [1, 2, 1]`. Similarly for `curr_mask_shape = [2, 3]`, a possible `mask_idx_subset = [0]`, then the updated `curr_mask_shape = [1, 3]`. - In the end, we test `masked_fill` with the combinations of `curr_input_shape` and `curr_mask_shape`. In the example above, an output tensor of shape [1, 2, 3] will be generated. - In `vulkan_api_test.cpp`, a function `gen_all_subsets` is implemented to generate all possible subsets of a given set of indices through backtracking. Test Plan: Full test result is shown in P777851326. `masked_fill` related tests are shown below. ``` (base) luwei@luwei-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="mask" Building: finished in 0.1 sec (100%) 264/2820 jobs, 0/2820 updated Total time: 0.1 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = mask [==========] Running 5 tests from 1 test suite. [----------] Global test environment set-up. [----------] 5 tests from VulkanAPITest [ RUN ] VulkanAPITest.masked_fill_invalidinputs_exceptions [ OK ] VulkanAPITest.masked_fill_invalidinputs_exceptions (35 ms) [ RUN ] VulkanAPITest.masked_fill_scalar_mult4ch [ OK ] VulkanAPITest.masked_fill_scalar_mult4ch (582 ms) [ RUN ] VulkanAPITest.masked_fill_scalar_nonmult4ch [ OK ] VulkanAPITest.masked_fill_scalar_nonmult4ch (592 ms) [ RUN ] VulkanAPITest.masked_fill_tensor_mult4ch [ OK ] VulkanAPITest.masked_fill_tensor_mult4ch (0 ms) [ RUN ] VulkanAPITest.masked_fill_tensor_nonmult4ch [ OK ] VulkanAPITest.masked_fill_tensor_nonmult4ch (0 ms) [----------] 5 tests from VulkanAPITest (1212 ms total) [----------] Global test environment tear-down [==========] 5 tests from 1 test suite ran. (1212 ms total) [ PASSED ] 5 tests. ``` Reviewed By: SS-JIA Differential Revision: D46423811 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104444 Approved by: https://github.com/SS-JIA	2023-06-30 19:57:07 +00:00
William Wen	d901dd94cb	[logging] add custom format option to logging artifacts (#104443 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104443 Approved by: https://github.com/mlazos	2023-06-30 19:54:14 +00:00
shibo19	53919d4bf8	add named tensor support for custom device (#104401 ) Fixes #ISSUE_NUMBER 1. for custom device(privateuse1 backend), we also want to support named tensors, so I optimize the check and add test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104401 Approved by: https://github.com/mikaylagawarecki	2023-06-30 19:40:12 +00:00
Denis Vieriu	28720ad585	Fix argmax and argmin clamp value on MPS (#104374 ) Replace clamp `LLONG_MAX` clamp value with the largest integer value that can be stored in a double. `constantWithScalar` takes as input a `double` value, for which `LLONG_MAX` was not fitting in a dobule, resulting in failures on x86. Fixes https://github.com/pytorch/pytorch/issues/98191, https://github.com/pytorch/pytorch/issues/92311 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104374 Approved by: https://github.com/razarmehr, https://github.com/kulinseth	2023-06-30 18:11:49 +00:00
Digant Desai	36c4dad197	[ET][XNNPACK] Add support for quantized LeakyReLU (#104309 ) Summary: Also adds support for backend_config Test Plan: `buck test fbcode//mode/dev-nosan fbcode//executorch/backends/xnnpack/test:` Reviewed By: mcr229 Differential Revision: D47043207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104309 Approved by: https://github.com/salilsdesai, https://github.com/manuelcandales	2023-06-30 17:42:22 +00:00
Nikita Shulga	ddd7da7546	Enable more tests (#104437 ) Remove `test_segment_reductions` from list of blocklisted tests Remove `@onlyCPU` qualifier from test_segment_reductions as it has CUDA specific parts Fixes https://github.com/pytorch/pytorch/issues/104410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104437 Approved by: https://github.com/atalman, https://github.com/huydhn	2023-06-30 16:26:11 +00:00
AllenTiTaiWang	032ea6a61e	[ONNX] Create stand alone diagnostic rule on nearest match finding in dispatcher (#104267 ) Change the diagnostic call in nearest match finding from UnsupportedNodeAnalysis to its own guarding rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104267 Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao	2023-06-30 16:21:08 +00:00
PyTorch MergeBot	a2a8b4d415	Revert "Turn translation validation on for tests and accuracy runs by default. (#103611 )" This reverts commit e311bed2a8e014f0ccf6fdc3fce11884982ac930. Reverted https://github.com/pytorch/pytorch/pull/103611 on behalf of https://github.com/malfet due to Broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/103611#issuecomment-1614850276))	2023-06-30 15:54:18 +00:00
Jean Schmidt	b1e4378b05	Migrate jobs from windows.8xlarge.nvidia.gpu to nonephemeral (#104404 ) This is yet another step to move windows instances away from ephemeral instances, more details on #101209 Queue times are very high recently for this instance type, migrating away from ephemeral instances will provide a big relief for developers. Even if flakiness is introduced, the overall time-to-signal will be smaller given 20h queue times peaks we've been experiencing. ![Screenshot 2023-06-29 at 12 57 48](https://github.com/pytorch/pytorch/assets/4520845/d2ae7912-1043-431b-a081-d7476f9fd443) # Copilot Summary <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at cde9c95</samp> This pull request updates several GitHub Actions workflow files and a template file to use non-ephemeral CUDA GPU runners for Windows binary build jobs. This improves the performance and stability of these jobs and makes the job names more consistent. # Copilot Poem <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at cde9c95</samp> > _`runs-on` changes_ > _CUDA jobs need `nonephemeral`_ > _faster winter builds_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104404 Approved by: https://github.com/malfet	2023-06-30 15:45:38 +00:00
Brian Hirsh	624d20c3de	kill inductor.config.disable_cpp_codegen in internal (#104351 ) Summary: This diff adds a path in inductor to invoke gcc through Remote Execution, when run from within fbcode. This should (hopefully) let us kill the `inductor.disable_cpp_codegen` flag, since we should now be able to invoke clang at runtime from within fbcode to compile c++ code. This was preventing https://github.com/pytorch/pytorch/pull/100115 from landing, which fixed one of the last remaining models in torchbench that was failing with `torch.compile` (hf_Longformer). Enumeration of changes: - updated inductor to invoke `_run_build_command()` when in fbcode, which hooks into Remote Execution - When inductor invokes g++ normally, it includes a bunch of absolute paths, to stuff like the pytorch header paths, and the input and output path. I changed these all to relative paths when in fbcode, and copied everything we needed into a temp dir that we send to Remote Execution. - updated `triton/fb/make_build_paths.py` to let us grab paths to openmp, sleef, and ld from within the Remote Execution environment. I'm not sure if there's a better way to do this (but this way appeared to work, thanks to Bert's suggestion from https://www.internalfb.com/diff/D46482550?dst_version_fbid=231706286239076&transaction_fbid=229345569847706) - factored `triton/fb/build.py` (it had a function to create a triton build command and run it all in one go, I separated the bit that takes in an arbitrary command (our clang command), and runs it with RE) - a few tweaks to the include paths that inductor uses: it adds those two extra paths (sleef and openmp), and it also does not manually include the `-ltorch`,`-lc10`,`-ltorch_python`,`-ltorch_cpu` libs - the linker was complaining that it couldn't find those libs, and not including those flags ends up working - I added a few more missing headers. Maybe with D46527002 this won't be necessary? - I had a basic manual test in `scripts/hirsheybar/tmp2.py`. We probably want to try running an actual job in MAST to make sure this works. Test Plan: `scripts/hirsheybar/pt2/tmp2.py` has a basic test, but I'm also planning on testing by kicking off a MAST job with cmf_10x (thanks to a bunch of help from Bert) Reviewed By: bertmaher Differential Revision: D46364355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104351 Approved by: https://github.com/bertmaher	2023-06-30 13:32:16 +00:00
Xilun Wu	e799f565eb	[DTensor][TP][Random] Introduce TensorParallelRNGTracker to integrate parallel RNG state with Tensor Parallel (#103910 ) This PR enables the automatic use of `TensorParallelRNGTracker` in Tensor Parallel api. Some unit tests are going to be added to cover. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103910 Approved by: https://github.com/wanchaol, https://github.com/fduwjj	2023-06-30 08:06:41 +00:00
César Vargas	7bc181d374	[Xcode 15][caffe2] Avoid redundant redeclaration of 'constexpr' static data member (#104049 ) Summary: Handling the `out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated [-Werror,-Wdeprecated]` warning on Xcode 15. Test Plan: Build Reviewed By: n0shake Differential Revision: D46875270 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104049 Approved by: https://github.com/malfet	2023-06-30 06:59:33 +00:00
Wanchao Liang	da06920f47	Replace all_gather in device mesh with functional collective equivalent (#104056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104056 Approved by: https://github.com/kumpera, https://github.com/wanchaol	2023-06-30 05:30:02 +00:00
Yanbo Liang	77642da3b8	Fix broken meta registration for torch.full (#104451 ) Fixes #104117 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104451 Approved by: https://github.com/eellison	2023-06-30 05:14:52 +00:00
David Berard	0b62aca726	Don't decompose aten.bucketize (#104396 ) torch.bucketize takes a tensor of values, and a "boundaries" tensor, which is a sorted list of values that represent buckets. It returns the bucket that each value lies in. E.g. if values = [1, 5, 3, 6] and boundaries=[0, 2, 4, 6, 8], the output will be [1, 3, 2, 4]. The current decomposition of this op doesn't work well with dynamic shapes. It performs a binary search, which bakes in the number of iterations in the binary search and requires recompiling (I don't completely understand why/where this happens). I can't think if whether there's a good way to write a decomposition for this op that will work with dynamic shapes. Use case: this op is very similar to some operations needed by jagged tensors. As a first step, I want to add a lowering for aten.bucketize and make use of opinfos. #104007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104396 Approved by: https://github.com/Chillee	2023-06-30 05:05:08 +00:00
Wanchao Liang	958bd3a549	[fake_pg] remove init barrier env var (#104428 ) We can now remove the env var as we by default disable the init barrier Pull Request resolved: https://github.com/pytorch/pytorch/pull/104428 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2023-06-30 05:04:26 +00:00
Yan Li	56ef8ca054	Fix recursive call error in `lift_tracked_freevar_to_input` (#104378 ) Summary: The test was failing in `lift_tracked_freevar_to_input ` https://www.internalfb.com/phabricator/paste/view/P776002064 Cause: * line 1219 assumes that `lift_tracked_freevar_to_input` is never called by the root tracer * However, when we see a bound free variable in a child tracer, line 1226 will invoke the parent tracer recursively. * When it reaches the root tracer, the assumption will fail. Fix: * we relax the assumption: if `lift_tracked_freevar_to_input` is called on the root tracer, we validate the variable is bound free, to allow the case where `lift_tracked_freevar_to_input` is populated from child tracers. Test Plan: pytest ./generated/test_VainF_pytorch_msssim.py pytest caffe2/test/dynamo/test_autograd_function.py -k test_function_with_bound_free_variable Reviewed By: yanboliang Differential Revision: D47033011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104378 Approved by: https://github.com/Skylion007, https://github.com/yanboliang	2023-06-30 04:53:45 +00:00
Masaki Kozuki	e2ec0ba404	Add forward mode AD to out-place foreach functions (#102409 ) The major difference from in-place support is that some out-place functions have their derivatives spelled out in derivatives.yaml, which requires some changes in `load_derivatives.py` and some handlings in various places due to the others whose derivatives are generated by `torchgen`. rel: - #58833 - #100695 --- # Generated Foreach ```c++ ::std::vector<at::Tensor> _foreach_sinh(c10::DispatchKeySet ks, at::TensorList self) { auto self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); std::vector<bool> _any_has_forward_grad_result(self.size()); for (const auto& i : c10::irange(self.size())) { _any_has_forward_grad_result[i] = isFwGradDefined(self[i]); } std::shared_ptr<ForeachSinhBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<ForeachSinhBackward0>(new ForeachSinhBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->self_ = make_saved_variable_list(self); grad_fn->self_size_ = self.size(); } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::_foreach_sinh(ks & c10::after_autograd_keyset, self_); })(); auto result = std::move(_tmp); #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt); for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { if (_any_has_forward_grad_result[i]) { auto self_t_raw = toNonOptFwGrad(self[i]); auto self_tensor = toNonOptTensor(self[i]); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self[i]); result_new_fw_grad_opts[i] = (self_t.conj() * self_p.cosh().conj()).conj(); } } for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i]; if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level / 0, / is_inplace_op / false); } } return result; } ::std::vector<at::Tensor> _foreach_norm_Scalar(c10::DispatchKeySet ks, at::TensorList self, const at::Scalar & ord) { auto self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); std::vector<bool> _any_has_forward_grad_result(self.size()); for (const auto& i : c10::irange(self.size())) { _any_has_forward_grad_result[i] = isFwGradDefined(self[i]); } std::shared_ptr<ForeachNormBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<ForeachNormBackward0>(new ForeachNormBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->ord = ord; grad_fn->self_ = make_saved_variable_list(self); grad_fn->self_size_ = self.size(); } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::_foreach_norm(ks & c10::after_autograd_keyset, self_, ord); })(); auto result = std::move(_tmp); #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt); for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { if (_any_has_forward_grad_result[i]) { auto self_t_raw = toNonOptFwGrad(self[i]); auto self_tensor = toNonOptTensor(self[i]); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self[i]); result_new_fw_grad_opts[i] = norm_jvp(self_p, self_t, ord, result[i]); } } for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) { auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i]; if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result[i]._set_fw_grad(result_new_fw_grad_opt.value(), / level / 0, / is_inplace_op / false); } } if (grad_fn) { grad_fn->result = result; } return result; } ``` # Reference ```c++ at::Tensor sinh(c10::DispatchKeySet ks, const at::Tensor & self) { auto& self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self)); std::shared_ptr<SinhBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<SinhBackward0>(new SinhBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->self_ = SavedVariable(self, false); } #ifndef NDEBUG c10::optional<Storage> self__storage_saved = self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt; c10::intrusive_ptr<TensorImpl> self__impl_saved; if (self_.defined()) self__impl_saved = self_.getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::sinh(ks & c10::after_autograd_keyset, self_); })(); auto result = std::move(_tmp); #ifndef NDEBUG if (self__storage_saved.has_value() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage())); if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr()); if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) { TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: sinh"); } if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: sinh"); #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt; if (_any_has_forward_grad_result && (result.defined())) { auto self_t_raw = toNonOptFwGrad(self); auto self_tensor = toNonOptTensor(self); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self); result_new_fw_grad_opt = (self_t.conj() self_p.cosh().conj()).conj(); } if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result._set_fw_grad(result_new_fw_grad_opt.value(), /* level / 0, / is_inplace_op / false); } return result; } at::Tensor norm_Scalar(c10::DispatchKeySet ks, const at::Tensor & self, const at::Scalar & p) { auto& self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self)); std::shared_ptr<NormBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<NormBackward0>(new NormBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self )); grad_fn->p = p; grad_fn->self_ = SavedVariable(self, false); } #ifndef NDEBUG c10::optional<Storage> self__storage_saved = self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt; c10::intrusive_ptr<TensorImpl> self__impl_saved; if (self_.defined()) self__impl_saved = self_.getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; return at::redispatch::norm(ks & c10::after_autograd_keyset, self_, p); })(); auto result = std::move(_tmp); #ifndef NDEBUG if (self__storage_saved.has_value() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage())); if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr()); if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) { TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: norm_Scalar"); } if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: norm_Scalar"); #endif if (grad_fn) { set_history(flatten_tensor_args( result ), grad_fn); } throw_error_for_complex_autograd(result, "norm"); c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt; if (_any_has_forward_grad_result && (result.defined())) { auto self_t_raw = toNonOptFwGrad(self); auto self_tensor = toNonOptTensor(self); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options()); auto self_p = toNonOptPrimal(self); result_new_fw_grad_opt = norm_jvp(self_p, self_t, p, result); } if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result._set_fw_grad(result_new_fw_grad_opt.value(), / level / 0, / is_inplace_op */ false); } if (grad_fn) { grad_fn->result_ = SavedVariable(result, true); } return result; } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102409 Approved by: https://github.com/soulitzer	2023-06-30 04:51:43 +00:00
Wanchao Liang	8457703e8d	lazy init device mesh in fsdp (#104447 ) since fsdp state is lazy init, we also need to lazy init device mesh otherwise devicemesh allgather check would trigger some mismatch in allgather counts in fsdp tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/104447 Approved by: https://github.com/wconstab	2023-06-30 04:40:16 +00:00
David Berard	0ff9a82a4d	[profiler] Fix profiling PT2 w/ dynamic shapes & record_shapes (#104320 ) When using torch.profiler.profile(record_shapes=True), the profiler tries to collect `tensor.sizes()` to put this information into the profile trace. When dynamic shapes is turned on, sometimes tensors will appear that have symbolic sizes. In that case, `tensor.sizes()` can throw an assertion. This PR checks to see if tensor has symbolic shapes, and doesn't collect shape info in that case. Differential Revision: [D47082414](https://our.internmc.facebook.com/intern/diff/D47082414) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104320 Approved by: https://github.com/aaronenyeshi	2023-06-30 04:35:52 +00:00
Jerry Zhang	ecca9591d5	[quant][pt2e] Add reference representation for quantize/dequantize operators (#104395 ) Summary: Similar to quantized add, in this PR we added the reference represenation for quantize/dequantize operators Test Plan: buck2 test caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_representation_quantize (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)' buck2 test caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_representation_dequantize (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)' Reviewed By: kimishpatel Differential Revision: D46959928 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104395 Approved by: https://github.com/andrewor14	2023-06-30 04:32:18 +00:00
XiaobingSuper	a704251628	inductor: fix compile error of bfloat16 broadcast operation (#104319 ) For the bfloat16 broadcast, there is always has compile error: ``` error: could not convert ‘tmp2’ from ‘Vectorized<float>’ to ‘Vectorized<c10::BFloat16> ``` This PR will fix this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104319 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-30 04:14:38 +00:00
PyTorch UpdateBot	89decc3a10	[vision hash update] update the pinned vision hash (#104449 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104449 Approved by: https://github.com/pytorchbot	2023-06-30 03:42:04 +00:00
Animesh Jain	537a6c0651	[dynamo][ac] Remove disable monkeypatching of utils.checkpoint (#104397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104397 Approved by: https://github.com/wconstab	2023-06-30 02:27:06 +00:00
Yukio Siraichi	2642412207	Value range refinement using multi-variate expressions. (#97964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97964 Approved by: https://github.com/ezyang	2023-06-30 01:32:22 +00:00
Yukio Siraichi	ffb526a2e4	Value range refinement using uni-variate expressions. (#97963 ) This PR introduces value range refinement of shape symbols by symbolically evaluating the value range of the involved guards. This should help `_maybe_evaluate_static` to eliminate more guards. This is a stack of PRs created from the discussion on: #96616. In summary, this PR: - simplifies `FloorDiv` nodes on the left-hand side of an expression so as to isolate a symbol in the numerator - tries to match the expression against the form: `<symbol> <relop> <expr>` - uses the matched expression for refining the value range of `<symbol>` using the range of `<expr>` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97963 Approved by: https://github.com/ezyang	2023-06-30 01:32:22 +00:00
Yukio Siraichi	e311bed2a8	Turn translation validation on for tests and accuracy runs by default. (#103611 ) This PR turns translation validation on by default for tests and accuracy benchmark runs. It also installs Z3 on CI. The main changes are: - Add `--no-translation-validation` as an option in _test/run_tests.py_ - Set `PYTORCH_TEST_WITH_TV` environment variable - Add `TEST_WITH_TV` variable in _torch/testing/_internal/common_utils.py_ - Turn translation validation on for accuracy benchmarks in _benchmarks/dynamo/common.py_ - Add Z3 installation on CI scripts Pull Request resolved: https://github.com/pytorch/pytorch/pull/103611 Approved by: https://github.com/ezyang	2023-06-30 01:32:21 +00:00
Will Constable	d0509fe32d	Document how functional collectives work under eager/dynamo (#104386 ) Move user facing apis to the top for best visibility (strictly code-motion in this PR, besides adding comments) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104386 Approved by: https://github.com/voznesenskym, https://github.com/wanchaol	2023-06-30 01:12:55 +00:00
Peter Bell	ffb1b4c462	[inductor] Install guards on both cases of View.handle_negative_index (#103780 ) This branch is not an optimization, it's a correctness issue so there should be a guard installed on both sides of the branch. Otherwise we could have an expression like `(s0 - s1)` that is initially positive, then becomes negative with a new set of shapes and now references an invalid index. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103780 Approved by: https://github.com/ezyang	2023-06-29 23:09:53 +00:00
Howard Huang	d455d48744	Add back in reduce_scatter_tensor_coalesced (#104345 ) #104256 erroneously removed the pybind definition for `reduce_scatter_tensor_coalesced` introduced in #103561 This adds it back in and introduces a test for the API. Test command: ``` pytest test/distributed/test_c10d_nccl.py -vsk test_reduce_scatter_tensor_coalesced ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104345 Approved by: https://github.com/kwen2501	2023-06-29 22:53:26 +00:00
Animesh Jain	a993319a4b	[export] Dont run export guard hook when there is no graph (#104383 ) I am not able to create a test case. I saw this on an internal model which is too big to repro. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104383 Approved by: https://github.com/yanboliang	2023-06-29 22:17:04 +00:00
William Wen	76a91075ea	propagate pred guards in TorchHigherOrderOperatorVariable call_function for cond (#104379 ) Fixes https://github.com/pytorch/pytorch/issues/104372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104379 Approved by: https://github.com/voznesenskym, https://github.com/ydwu4, https://github.com/zou3519	2023-06-29 20:47:00 +00:00
Andy Rock	12f19b5dd9	consider `CALL_FINALLY` non-jumping in `stacksize_analysis` (#103621 ) Fixes #97811. This PR fixes a bug in `stacksize_analysis`. The pre-`python3.9` opcode `END_FINALLY` should be considered terminal. (edit: this is no longer what this PR does) With this change, [this](https://github.com/pytorch/pytorch/issues/97811#issuecomment-1591888590) previously failing example now passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103621 Approved by: https://github.com/williamwen42	2023-06-29 20:23:20 +00:00
Amr Elshennawy	a78bddac01	Revert D46920584: Multisect successfully blamed D46920584 for test or build failures (#104269 ) (#104302 ) Summary: This diff is reverting D46920584 D46920584: Make `torch.empty*` deterministic by filling with NaN or max int value (#101849) by generatedunixname499836121 has been identified to be causing the following test or build failures: Tests affected: - [torchrec/distributed/composable/tests:test_fsdp - torchrec.distributed.composable.tests.test_fsdp.FullyShardTest: test_composable_checkpoint](https://www.internalfb.com/intern/test/281475062923125/) Here's the Multisect link: https://www.internalfb.com/multisect/2341386 Here are the tasks that are relevant to this breakage: We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. If you believe this diff has been generated in error you may Commandeer and Abandon it. Test Plan: NA Reviewed By: huydhn, osalpekar Differential Revision: D46997394 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104302 Approved by: https://github.com/osalpekar	2023-06-29 20:20:58 +00:00
JenDL	a6b9a61a6a	Added a note to torch.round doc to indicate the return type (#97227 ) Added a note to torch.round doc to indicate the return type of output tensor Fixes #89056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97227 Approved by: https://github.com/albanD	2023-06-29 20:02:59 +00:00
vfdev	4ab140902b	[docs] Fixed typo in grid_sample doctring (#104406 ) Fixed a small typo in grid_sample doctring: <img width="265" alt="image" src="https://github.com/pytorch/pytorch/assets/2459423/1d2dd7a2-895a-4683-9d9f-a4d1d9d1a4a7"> - https://pytorch.org/docs/main/generated/torch.nn.functional.grid_sample.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/104406 Approved by: https://github.com/mikaylagawarecki, https://github.com/svekars	2023-06-29 19:44:54 +00:00
Mike Schneider	ec85ab6157	Adding aarch64 wheel CI workflows (#104109 ) Adding Workflows for building aarch64 Linux PyTorch PIP wheels Updates: * Created aarch64 template for generated workflows * Updated generate_ci_workflows.py to include aarch64 * Generated the aarch64 wheel workflow * added _binary-build-aarch64.yml for building aarch64 wheel * added _binary-test-aarch64.yml for sanity check of aarch64 wheel * Updated binary_linux_test.sh to use --extra-index-url for aarch64 till needed aarch64 dependencies are available at https://download.pytorch.org/whl/nightly/cpu NOTES: * The build and test workflows are using arm64v8/alpine and quay.io/pypa/manylinux2014_aarch64:latest docker images at this time. * Conda generated workflow not included at this time and being worked on. Workflows were successfully tested at https://github.com/xncqr/pytorch/actions/runs/5351891068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104109 Approved by: https://github.com/malfet, https://github.com/atalman	2023-06-29 18:58:43 +00:00
PyTorch MergeBot	082832b0f8	Revert "Add DSA to IndexKernel.cu (#104054 )" This reverts commit aaada2c4fcc0f977d9cd297e44a0562c2237dc8d. Reverted https://github.com/pytorch/pytorch/pull/104054 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/104054#issuecomment-1613583961))	2023-06-29 18:14:16 +00:00
Ilya Sherstyuk	cbb9683e3b	[ONNX] Speed up export of large models (#103278 ) This commit speeds up the ONNX export of large models by making the following changes: - Remove unecessary memcpy in `GetGraphProtoSize` - In `export.cpp`, pass around a pointer to the ModelProto instead of the ModelProto itself. The shape inference function is still the slowest part of the export for these models with large weights taking up 50% or more of the export time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103278 Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi	2023-06-29 17:34:29 +00:00
Richard Barnes	193adde5f4	Fix `UnboundLocalError` in `test_segment_reductions.py` (#104353 ) Summary: Fixes: ``` UnboundLocalError: local variable 'expected_result' referenced before assignment ``` Test Plan: Sandcastle Differential Revision: D47092967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104353 Approved by: https://github.com/malfet	2023-06-29 16:29:34 +00:00
Elias Ellison	f32593630b	Re-enable low memory dropout (#103330 ) On attention_is_all_you_need_pytorch: Perf: 1.526x -> 1.544x Memory: 1.00 -> 1.05x Fix for https://github.com/pytorch/pytorch/issues/102319, although I'm not sure all the perf is recovered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103330 Approved by: https://github.com/jansel	2023-06-29 16:27:02 +00:00
Rohan Varma	60e2a4a4a0	[2D parallel] workaround for FSDP init issue (#104398 ) Closes https://github.com/pytorch/pytorch/issues/96491 and does so by relaxing FSDP's assumption that the entire input module must be on the same device. Now, FSDP can accept a module partially on CPU and GPU and just emits a warning. Differential Revision: [D47117256](https://our.internmc.facebook.com/intern/diff/D47117256/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104398 Approved by: https://github.com/fegin	2023-06-29 16:07:07 +00:00
Richard Barnes	8cad411d3d	Fix UntypedStorage pin error (#104355 ) Summary: Fixes: ``` TypeError: cannot pin 'torch.storage.UntypedStorage' only CPU memory can be pinned ``` Test Plan: Sandcastle Differential Revision: D47093797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104355 Approved by: https://github.com/malfet	2023-06-29 16:06:52 +00:00
Catherine Lee	9dda446505	Pin pytest linux dependencies in docker (#104281 ) Pin the pytest dependencies listed in requirements-ci.txt, also change the mac ones to match the linux ones. The new pytest 7.4.0 causes some weird issues with printing skip messages, so pin to a previous version until I can figure out a fix Pull Request resolved: https://github.com/pytorch/pytorch/pull/104281 Approved by: https://github.com/huydhn	2023-06-29 16:05:46 +00:00
Richard Zou	408cb45e14	[Dynamo] Support threading.local getattr (#104292 ) Fixes #104066 threading.local has a custom `__getattribute__` so `_getattr_static` doesn't work with it. Since we know that threading.local's `__getattribute__` is well behaved (e.g. https://github.com/python/cpython/blob/3.11/Lib/_threading_local.py), we can just special case it. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/104292 Approved by: https://github.com/williamwen42, https://github.com/jansel	2023-06-29 14:32:37 +00:00
Brian Hirsh	875f60399e	pre_dispatch tracing: support autocast and no_grad/enable_grad ctx managers, add a pre_dispatch_eager dynamo backend (#103024 ) This PR adds support for `enable_grad`/`no_grad`/`autocast` context managers getting properly traced in `pre_dispatch` tracing. The stuff in this PR includes: - I added a torch function mode that runs during make_fx pre_dispatch tracing, `ProxyTorchFunctionMode`. It directly intercepts the torch ops that run during the above context managers, and adds them to the current graph instead of executing them - `enable_grad` and `no_grad` currently desugar into `torch._C.set_grad_enabled(bool)`, but this API isn't currently overrideable by torch function so I added the ability to interpose there - the `torch.amp` context managers don't currently have a nice equivalent, like `set_autocast_enabled(state)`, so I ended up adding two new API's: `torch.amp._set_autocast_enabled` and `torch.amp._set_autocast_disabled`. If you look at how the context manager is implemented, it ends up calling several different state-changing functions, some of which depend on the backend - so I figured that it would be cleaner just to add a new API (that should probably only be used by tracing) - but open to feedback - I added a new dynamo backend, `compile(backend="pre_dispatch_eager")`. When pre_dispatch tracing becomes always-on in inductor, it will be another potential surface for bugs. I also added a test file for it (`test/dynamo/test_pre_dispatch.py`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/103024 Approved by: https://github.com/ezyang	2023-06-29 14:17:42 +00:00
Gustav Larsson	ebb8aa9c0b	Correct output_padding for quantized tconv (torch->onnx) (#104207 ) - In #102759, the support for `quantized::conv_transposeNd` was introduced. This incorrectly set `output_padding` to all zeros. Turns out, you can specify output_padding in PyTorch, but this parameter was not being unpacked correctly and thus did not show up in the python torch->onnx code. - This adds unpacking of output_padding in `unpack_quantized_weights.cpp` when needed. It also adds this as a parameter in the python functions and uses that (and removes the all-zero defaults) - Another issue with #102759 is that it only added these new ops to opset10 without adding the ability to specify axis in opset13. This PR also fixes this. Fixes #104206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104207 Approved by: https://github.com/BowenBao	2023-06-29 13:40:48 +00:00
AllenTiTaiWang	04c0d85caf	[ONNX] Add op_level_debugging rule on validate_op_between_ort_torch (#104268 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104268 Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi	2023-06-29 13:38:13 +00:00
Animesh Jain	5c12a810ac	[dynamo] Lazy disable_dynamo API out-of-dynamo (#104317 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104317 Approved by: https://github.com/jansel, https://github.com/wconstab, https://github.com/mlazos	2023-06-29 13:30:17 +00:00
Animesh Jain	2bb83cd45c	[dynamo][ac] Minor refactor for better code organization and a bugfix (#104276 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104276 Approved by: https://github.com/zou3519	2023-06-29 12:57:59 +00:00
Huy Do	9154bbc999	Fix CUDA Bazel build to optionally include gmock after #104255 (#104308 ) This reverts commit 39868b0578c18ddc194deac697d0675760de5f11. Fixes https://github.com/pytorch/pytorch/issues/104279. The change came from an internal codemod diff that we don't want to revert. AFAIK, this addition is not needed as gmock has already been included https://github.com/google/googletest/blob/main/BUILD.bazel ### Testing * OSS CUDA Bazel build should be back after this revert * Import as D47077813 to make sure that nothing breaks internally Pull Request resolved: https://github.com/pytorch/pytorch/pull/104308 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-06-29 07:15:06 +00:00
cyy	f78b92f490	fix an ASAN error in MKLDNN (#104331 ) Fixes ASAN stack-use-after-scope in MKLDNN. The stack trace is ``` 2023-06-27T16:37:20.9099950Z ==1424==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7f0c5dc20980 at pc 0x7f0c61286a73 bp 0x7ffef8e76990 sp 0x7ffef8e76118 2023-06-27T16:37:20.9100054Z READ of size 24 at 0x7f0c5dc20980 thread T0 2023-06-27T16:37:20.9100327Z #0 0x7f0c61286a72 in memcmp (/usr/lib/llvm-7/lib/clang/7.0.1/lib/linux/libclang_rt.asan-x86_64.so+0x5da72) 2023-06-27T16:37:20.9100701Z #1 0x7f0c2f395d0b in c10::ArrayRef<long>::equals(c10::ArrayRef<long>) const (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xcb8bd0b) 2023-06-27T16:37:20.9101196Z #2 0x7f0c314a1bb1 in at::native::mkldnn_matmul(at::Tensor const&, at::Tensor const&, at::Tensor const&, float, float) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xec97bb1) 2023-06-27T16:37:20.9101714Z #3 0x7f0c301f49c5 in at::native::bmm_out_or_baddbmm_(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&, bool) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xd9ea9c5) 2023-06-27T16:37:20.9102153Z #4 0x7f0c301f85ab in at::native::structured_bmm_out_cpu::impl(at::Tensor const&, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xd9ee5ab) 2023-06-27T16:37:20.9102601Z #5 0x7f0c32cb3cb6 in at::(anonymous namespace)::wrapper_CPU_bmm(at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x104a9cb6) 2023-06-27T16:37:20.9103662Z #6 0x7f0c32ea1f43 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &(at::(anonymous namespace)::wrapper_CPU_bmm(at::Tensor const&, at::Tensor const&))>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x10697f43) 2023-06-27T16:37:20.9104330Z #7 0x7f0c3187252a in at::Tensor c10::Dispatcher::redispatch<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) const (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xf06852a) 2023-06-27T16:37:20.9104756Z #8 0x7f0c3257e097 in at::_ops::bmm::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xfd74097) 2023-06-27T16:37:20.9105237Z #9 0x7f0c383c31c3 in torch::autograd::VariableType::(anonymous namespace)::bmm(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x15bb91c3) 2023-06-27T16:37:20.9106496Z #10 0x7f0c383c25b9 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &(torch::autograd::VariableType::(anonymous namespace)::bmm(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&))>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x15bb85b9) 2023-06-27T16:37:20.9106874Z #11 0x7f0c3257da60 in at::_ops::bmm::call(at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xfd73a60) 2023-06-27T16:37:20.9107275Z #12 0x7f0c301fc0e2 in at::native::_matmul_impl(at::Tensor&, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xd9f20e2) 2023-06-27T16:37:20.9107647Z #13 0x7f0c301f9c21 in at::native::matmul(at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xd9efc21) 2023-06-27T16:37:20.9108853Z #14 0x7f0c33dca7e3 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &(at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__matmul(at::Tensor const&, at::Tensor const&))>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x115c07e3) 2023-06-27T16:37:20.9109255Z #15 0x7f0c32958ef0 in at::_ops::matmul::call(at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x1014eef0) 2023-06-27T16:37:20.9110023Z #16 0x7f0c2f596b62 in at::autocast::WrapFunction_<(at::autocast::CastPolicy)0, (c10::DeviceType)0, at::Tensor (at::Tensor const&, at::Tensor const&), &(at::_ops::matmul::call(at::Tensor const&, at::Tensor const&)), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >::call(at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xcd8cb62) 2023-06-27T16:37:20.9110723Z #17 0x7f0c2f348403 in c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor ()(at::Tensor const&, at::Tensor const&), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >::operator()(at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xcb3e403) 2023-06-27T16:37:20.9111596Z #18 0x7f0c2f348063 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor ()(at::Tensor const&, at::Tensor const&), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xcb3e063) 2023-06-27T16:37:20.9111976Z #19 0x7f0c32958ef0 in at::_ops::matmul::call(at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x1014eef0) 2023-06-27T16:37:20.9112383Z #20 0x7f0c5803dc3e in torch::autograd::THPVariable_matmul(_object, _object, _object*) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_python.so+0x2b2cc3e) 2023-06-27T16:37:20.9112561Z warning: parsing line table prologue at 0x00000000 should have ended at 0x0000050b but it ended at 0x0000050a 2023-06-27T16:37:20.9112713Z #21 0x5074a6 in cfunction_call (/opt/conda/envs/py_3.9/bin/python3.9+0x5074a6) 2023-06-27T16:37:20.9112857Z #22 0x505997 in _PyObject_Call (/opt/conda/envs/py_3.9/bin/python3.9+0x505997) 2023-06-27T16:37:20.9113114Z #23 0x505997 in PyObject_Call /croot/python-split_1684193875530/work/build-static/<invalid>:293:12 2023-06-27T16:37:20.9113258Z #24 0x4ed302 in do_call_core (/opt/conda/envs/py_3.9/bin/python3.9+0x4ed302) 2023-06-27T16:37:20.9113633Z #25 0x4ed302 in _PyEval_EvalFrameDefault /croot/python-split_1684193875530/work/build-static/<invalid>:3582:22 2023-06-27T16:37:20.9113780Z #26 0x4e6729 in _PyEval_EvalFrame (/opt/conda/envs/py_3.9/bin/python3.9+0x4e6729) 2023-06-27T16:37:20.9114041Z #27 0x4e6729 in _PyEval_EvalCode /croot/python-split_1684193875530/work/build-static/<invalid>:4329:14 2023-06-27T16:37:20.9114202Z #28 0x4efd7d in _PyFunction_Vectorcall (/opt/conda/envs/py_3.9/bin/python3.9+0x4efd7d) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104331 Approved by: https://github.com/soulitzer	2023-06-29 07:14:37 +00:00
chunyuan	d4e51511a0	Inductor cpp wrapper: add -ffast-math in linking flag (#104332 ) Fix cpp wrapper CPU performance gap on `swsl_resnext101_32x16d` compared with the default python wrapper. The pre-trained weights of `swsl_resnext101_32x16d` contains denormal numbers (close to 0.0). Linking with `-ffast-math` will make the CPU flush denormals. For the default python wrapper, the compilation and linking are done in one command thus `-ffast-math` will take effect in both compilation and linking. CPP wrapper leverages cpp_extension which will do the compilation and linking in two stages, thus we need to explicitly add `-ffast-math` as a linking flag. Single thread single batch on ICX: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link=blue vlink=purple> \| time (s) default python wrapper \| time (s) cpp wrapper before fix \| time (s) cpp wrapper after fix -- \| -- \| -- \| -- swsl_resnext101_32x16d \| 0.459097836 \| 13.82326214 \| 0.448116195 </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104332 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/EikanWang	2023-06-29 06:59:04 +00:00
Aleksei Nikiforov	732067e5c3	s390x SIMD: propagate NaN value in clamp functions (#102978 ) This change fixes test_batch_norm test in test/test_jit_fuser_te.py while keeping TestNNDeviceTypeCPU::test_grid_sample_nan_inf_cpu_float32 test from test/test_nn.py working. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102978 Approved by: https://github.com/kit1980	2023-06-29 01:28:04 +00:00
Nikita Shulga	fea683491e	Make `torch._dynamo` lazy-importable (#104368 ) Use [PEP-562](https://peps.python.org/pep-0562) to import `_dynamo` and `_inductor` only when needed. - Remove redundant imports from tests - Add `test_lazy_imports_are_lazy` to make sure they will not get imported by accident <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at bae8e90</samp> > _Sing, O Muse, of the daring deeds of PyTorch, the swift and fiery_ > _framework of deep learning, that with skill and cunning wrought_ > _many wonders of dynamic compilation, using the hidden powers_ > _of `_dynamo` and `_inductor`, the secret modules of LLVM and MLIR._ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104368 Approved by: https://github.com/msaroufim, https://github.com/albanD	2023-06-29 00:51:59 +00:00
Yukio Siraichi	d0a72ec5e4	Translation validator for dynamo guards. (#102563 ) This PR introduces a translation validator for dynamo guards. In summary, it verifies whether the guards issued as Python code are sound, w.r.t the initial guards. The main changes in this PR are: - Create an FX graph for dynamic shapes - Translate "the original" guards from the FX graph to Z3 - Check if the guards produced by `produce_guards` are sound w.r.t. the ones from the FX graph gh-stack version of the PR #101146. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102563 Approved by: https://github.com/ezyang	2023-06-28 22:32:53 +00:00
xuanqi	3707fbf63b	[RFC]: Add test for graph partition after assertion ops functionalization. (#104287 ) This PR: * Address comment at https://github.com/pytorch/pytorch/pull/103887/files#r1244128266. * Add test for graph partition to make sure assertion ops functionalization won't break graph partition in unexpected way. NOTE: In the context of export, it's totally up to the user to any type of graph partition based on specific use case. It's hard to anticipate the concrete downstream use case nor provide any specific functionality to facilitate handling assertion ops (functional / non-functional). So this PR limit to itself to [`CapabilityBasedPartitioner`](`2da6cae43c/torch/fx/passes/infra/partitioner.py (L34)`) and make sure it doesn't break graph partition unexpectedly (by adding some test). For the test case used in PR, a few things to highlight: * Without assertion, the fused graph is roughly like: ``` class fused(torch.nn.Module): def forward(self, a, b): fused_1 = self.fused_1(a, b); relu = fused_1.relu() fused_0 = self.fused_0(fused_1, relu) return (fused_0, fused_1) class fused_0(torch.nn.Module): def forward(self, add_2, relu): ... # Logic after relu return add_4 class fused_1(torch.nn.Module): def forward(self, a, b): ... # Logic before relu, `add_1` is only exposed within this submodule. return add_2 ``` * With the assertion, the fused graph is roughly like: ``` class fused(torch.nn.Module): def forward(self, arg0_1: i64[s0], arg1_1: i64[s0]): dep_token0 = ... ... fused_1 = self.fused_1(arg0_1, arg1_1); arg0_1 = arg1_1 = None ... getitem: i64[s0] = fused_1[0] # `getitem` is actually `add_1` ... relu_default: i64[s0] = torch.ops.aten.relu.default(getitem_1) ... # For inline assertion. Note that `getitem` which is an output of `fused_1`, is consumed by it. select_int: i64[] = torch.ops.aten.select.int(getitem, 0, 0) eq_scalar: b8[] = torch.ops.aten.eq.Scalar(select_int, 5) dep_token2: f32[] = torch.ops.aten._functional_assert_async.msg( eq_scalar, 'assertion error', dep_token = dep_token1 ) ... getitem_1: i64[s0] = fused_1[1] # `getitem_1` is actually `add_2` fused_0: i64[s0] = self.fused_0(getitem_1, relu_default) ... return (fused_0, getitem_1, dep_token2) class fused_0(torch.nn.Module): def forward(self, add_tensor_2: i64[s0], relu_default: i64[s0]): ... # Logic after relu return add_tensor_4 class fused_1(torch.nn.Module): def forward(self, arg0_1: i64[s0], arg1_1: i64[s0]): ... # Logic before relu # `add_tensor_1` (basically `add_1`) is returned to allow downstream assertion op consumes it. return (add_tensor_1, add_tensor_2) ``` As shown above, the extra assertion added (actually regardless whether it's funtionalized or not), it won't case extra submodule breakage if the asserted node is an intermediate node within the submodule - here the intermediate node will be returned as extra output of submodule so downstream assertion node can consume it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104287 Approved by: https://github.com/tugsbayasgalan	2023-06-28 22:13:27 +00:00
Shubhraprakash Das	ede1f99904	Add gelu vulkan function (#102762 ) Summary: Add gelu function in vulkan Test Plan: buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 [ RUN ] VulkanAPITest.gelu [ OK ] VulkanAPITest.gelu (63 ms) [ RUN ] VulkanAPITest.gelu_ [ OK ] VulkanAPITest.gelu_ (83 ms) Reviewed By: SS-JIA Differential Revision: D46297340 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102762 Approved by: https://github.com/SS-JIA	2023-06-28 21:26:14 +00:00
FFFrog	f2ea27e4a0	Replace torch.has_cuda() call with torch.backends.cuda.built() (#104338 ) torch.has_cuda() has been deprecated and using torch.backends.cuda.built() instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104338 Approved by: https://github.com/soulitzer	2023-06-28 21:06:52 +00:00
Felix Erkinger	e140c9cc92	Fixes ROCM_HOME detection in case no hipcc is found in path (#95634 ) if ROCM_HOME is not set as environment variable, it tries to find hipcc in the path, but fails with an empty string instead of an exception, returning an empty string instead of harcoded '/opt/rocm' as third case Fixes #95633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95634 Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang	2023-06-28 19:39:26 +00:00
Nikita Shulga	8464a6a165	[GHF] Better check for internal diffs (#104344 ) During revert, use title of "Meta Internal-Only Changes Check" to determine whether or not internal diff is associated with the PR. When PR is merged/closed, "Meta Internal-Only Changes Check" status is always success, but title message can differ: - "There is no internal Diff connected, this can be merged now" means that there are no internal change associated with PR (or it was landed via GitHub First workflow) - "The internal Diff has landed, this can be merged now" meaning that PR has associated internal DIFF, and OSS and internal reverts must happen in sync using internal tooling. (Or a revert PR can be authored in OSS) Add regression test for https://github.com/pytorch/pytorch/pull/100652 that was originated from the internal diff, but was merged as OSS PR. Fixes https://github.com/pytorch/pytorch/issues/104232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104344 Approved by: https://github.com/bigfootjon, https://github.com/huydhn	2023-06-28 19:22:45 +00:00
William Wen	998c07799f	[dynamo] fix deep nested closure cell KeyError (#104222 ) Fix https://github.com/pytorch/pytorch/issues/99639 by handling the case in `InliningInstructionTranslator`'s `LOAD_CLOSURE` definition when the requested cell is not in `self.closure_cells`. My intuition is that the behavior of `LOAD_DEREF` and `STORE_DEREF` on a cell/freevar should not depend on whether or not we called `LOAD_CLOSURE` (that is, we shouldn't create a new cell var in `LOAD_CLOSURE` like in https://github.com/pytorch/pytorch/pull/101357). But we need a way to push cells created by the inlined function that were not present in the caller - `InlinedClosureVariable` is used to differentiate these cells from other cells. Adding this test causes an error though (EDIT: this test is not relevant to this PR and instead just reveals that `cond` with Python side effects is still broken): ```python def test_closure_out_of_scope_cell_with_cond(self): from functorch.experimental.control_flow import cond cell1 = torch.rand(3, 3) cell2 = torch.rand(3, 3) orig3 = torch.rand(3, 3) def test(x): cell3 = orig3.clone() def then(): nonlocal cell3 cell3 += cell1 return cell3 def els(): nonlocal cell3 cell3 += cell2 return cell3 return cond(x > 0, then, els, []) opt_fn = torch._dynamo.optimize("eager")(test) result1 = opt_fn(1) self.assertTrue(torch.allclose(result1, orig3 + cell1)) result2 = opt_fn(-1) self.assertTrue(torch.allclose(result1, orig3 + cell1 + cell2)) ``` ``` Traceback (most recent call last): File "/scratch/williamwen/work/pytorch2/test/dynamo/test_misc.py", line 1768, in test_closure_out_of_scope_cell_with_cond result1 = opt_fn(1) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/eval_frame.py", line 295, in _fn return fn(args, kwargs) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/eval_frame.py", line 448, in catch_errors return callback(frame, cache_size, hooks, frame_state) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/convert_frame.py", line 526, in _convert_frame result = inner_convert(frame, cache_size, hooks, frame_state) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/convert_frame.py", line 127, in _fn return fn(args, *kwargs) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/convert_frame.py", line 360, in _convert_frame_assert return _compile( File "/scratch/williamwen/work/pytorch2/torch/_dynamo/utils.py", line 180, in time_wrapper r = func(args, *kwargs) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/convert_frame.py", line 430, in _compile out_code = transform_code_object(code, transform) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/bytecode_transformation.py", line 1000, in transform_code_object transformations(instructions, code_options) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/convert_frame.py", line 415, in transform tracer.run() File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 2029, in run super().run() File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 708, in run and self.step() File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 668, in step getattr(self, inst.opname)(inst) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 391, in wrapper return inner_fn(self, inst) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 1100, in CALL_FUNCTION self.call_function(fn, args, {}) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 559, in call_function self.push(fn.call_function(self, args, kwargs)) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/variables/torch.py", line 1061, in call_function (false_r, false_graph, false_lifted_freevars) = speculate_branch(False) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/variables/torch.py", line 1044, in speculate_branch ret_val, ret_graph, ret_lifted_freevars = speculate_subgraph( File "/scratch/williamwen/work/pytorch2/torch/_dynamo/variables/torch.py", line 850, in speculate_subgraph output = f.call_function(tx, args, {}) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/variables/functions.py", line 121, in call_function return tx.inline_user_function_return( File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 595, in inline_user_function_return result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 2134, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 2231, in inline_call_ tracer.run() File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 708, in run and self.step() File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 668, in step getattr(self, inst.opname)(inst) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 162, in impl self.push(fn_var.call_function(self, self.popn(nargs), {})) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/variables/builtin.py", line 497, in call_function proxy = tx.output.create_proxy( File "/scratch/williamwen/work/pytorch2/torch/_dynamo/output_graph.py", line 345, in create_proxy return self.current_tracer.create_proxy(args, **kwargs) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/output_graph.py", line 1109, in create_proxy new_arg = self.lift_tracked_freevar_to_input(arg) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/output_graph.py", line 1226, in lift_tracked_freevar_to_input self.parent.lift_tracked_freevar_to_input(proxy) File "/scratch/williamwen/work/pytorch2/torch/_dynamo/output_graph.py", line 1219, in lift_tracked_freevar_to_input assert ( AssertionError: lift_tracked_freevar_to_input on root SubgraphTracer from user code: File "/scratch/williamwen/work/pytorch2/test/dynamo/test_misc.py", line 1766, in test return cond(x > 0, then, els, []) File "/scratch/williamwen/work/pytorch2/test/dynamo/test_misc.py", line 1764, in els cell3 += cell2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104222 Approved by: https://github.com/jansel	2023-06-28 17:54:13 +00:00
Shunting Zhang	98f00f881f	[inductor] convert layout of conv weight ahead of time for inference (#103642 ) This PR handles inference. Will do similar thing for training later. Some manual testing results shows this can improve inference perf by 2-3% (absolute improvement not relative one). - convmixer: 4.285x -> 4.309x - resnet50: 2.170x -> 2.203x The PR is built upon freezing. Since without freezing, the weight input for a conv node may not be a parameter directly but be the output of precision converting ops. It's so much easier to implement this PR after freezing. Commands ``` TORCHINDUCTOR_FREEZING=1 python benchmarks/dynamo/timm_models.py --backend inductor --amp --performance --only convmixer_768_32 --inference ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103642 Approved by: https://github.com/eellison	2023-06-28 17:42:32 +00:00
FFFrog	044a8e3305	[skip ci] Fix the deprecating link to our office hours (#104339 ) Fix the deprecating link to our office hours Pull Request resolved: https://github.com/pytorch/pytorch/pull/104339 Approved by: https://github.com/soulitzer	2023-06-28 17:07:36 +00:00
albanD	b81f1d1bee	Speed up cpp extensions re-compilation (#104280 ) Fixes https://github.com/pytorch/pytorch/issues/68066 to a large extend. This is achieved by not touching files that don't need changing to make sure the ninja caching works as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104280 Approved by: https://github.com/fmassa	2023-06-28 17:06:07 +00:00
CYuxian	42b0bdd0c5	[onnx] Convert aten::flatten with 0d input to onnx Reshape and 1d to Identity (#104089 ) Avoid empty tensor generated by Slice op if using _flatten_helper for aten::flatten with 0d/1d input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104089 Approved by: https://github.com/thiagocrepaldi	2023-06-28 17:01:43 +00:00
Richard Barnes	aaada2c4fc	Add DSA to IndexKernel.cu (#104054 ) Summary: This diff also makes many things const and rearranges `X >= lb && X <= ub` to be `lb <= X && X <= ub`. Test Plan: ``` buck2 build mode/dev-nosan fbcode//caffe2:ATen-cu ``` Differential Revision: D46943299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104054 Approved by: https://github.com/xw285cornell	2023-06-28 16:58:02 +00:00
Rohan Varma	c866446d6c	[FSDP] Check module.training for _root_cast_forward_inputs (#104223 ) We might erroneously cast forward inputs for the root if it doesn't manage any handles (FSDP parameters). As a fix, pass in the module and check its training attribute to ensure we don't cast inputs in eval mode. Differential Revision: [D47041673](https://our.internmc.facebook.com/intern/diff/D47041673/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104223 Approved by: https://github.com/fegin	2023-06-28 16:38:01 +00:00
Sergei Vorobev	ee19121931	Change nn.Module.__getattr__ return type to Any (#104321 ) When working with a highly-dynamic python code it's not always possible to express the static types. However if we consider the end-user experience for somebody who uses both pytorch and a static type checker (mypy, pyright), we should error on the side of being ergonomic and not technically correct. The `nn.Module.__getattr__` is one of the such examples: on paper the return type is correct. In practice the community would benefit from having `Any` as a return type because it would avoid littering the idiomatic pytorch code with `cast`, `# type: ignore`, `assert`, `isinstance`, etc. Some evidences: - linked in the comment thread on pyright bug tracker https://github.com/microsoft/pyright/issues/4213 - `pyre` type checker steps outside of the normal type checking practices and special-cases `registrer_buffer()` in part to avoid this problem. https://pyre-check.org/docs/features/ This is not a very scalable solution since type-checkers generally aim at adhering to the spec (various typing PEPs). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104321 Approved by: https://github.com/kit1980, https://github.com/albanD	2023-06-28 16:14:36 +00:00
eqy	2504af5ec9	[cuDNN][cuDNN V8 API] Improve safety of `ParamsHash` keys (#104122 ) In anticipation of adding some enhancements to the cuDNN benchmark cache (e.g., LRU eviction for memory savings), this PR adds some safety improvements to the handling of cache keys. Currently, cache keys are dangerous to use, as e.g., a single inadvertent pass-by-value will potentially instantiate a copy with uninitialized padding bytes that will unwittingly hash differently and compare as unequal. This behavior is the result of `ParamsHash` using raw-bytes for hashing and comparison. I've been bitten by this in the past and would like to hopefully eliminate this class of errors. Additionally, I'm not sure the standard guarantees that default copy/move constructors copy structs byte-for-byte, and this could be problematic when using maps as insertion could call these default constructors in order to instantiate a `std::pair`. Someone knowledgeable in C++ can correct me on this, but it seems that we are potentially relying on the good graces of common compiler implementations rather than the actual standard here. This PR adds a variant of `ParamsHash` that expects a wrapped POD that has custom byte-for-byte constructors. It modifies the cuDNN V8 API benchmark cache to use this variant, and replaces the `setCacheKey` style code with constructor usage. If this approach looks good to folks I will also port other `ParamsHash` usage (e.g., in cuDNN v7 and other backends) and we can remove `ParamsHash`. CC @malfet @ngimel (who originally wanted constructors for keys, but I didn't have this solution in mind at the time) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104122 Approved by: https://github.com/zasdfgbnm, https://github.com/colesbury	2023-06-28 16:13:29 +00:00
Danni Li	35a8242226	[Doc] Add sum reduction for CTCLoss (#100235 ) Summary: Fix: #99141 Reference: `39b885cbbf/aten/src/ATen/native/LossCTC.cpp (L366-L371)` Test Plan: See GitHub Tests. Differential Revision: D45387774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100235 Approved by: https://github.com/albanD, https://github.com/mikaylagawarecki	2023-06-28 16:08:22 +00:00
Nikita Shulga	0a7b6dd97d	[BE] Fix test_trymerge.py (#104343 ) - Add `ngimel` to the list of reviewers to make "test_revert_rules" pass - Change PR in `test_get_classifications_unstable` from https://github.com/pytorch/pytorch/pull/102784 to https://github.com/pytorch/pytorch/pull/104312 as former do not have unstable jobs after merging. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at af26e18</samp> > _Oh we're the crew of the `test_trymerge.py`_ > _We update the rules and the cases on the fly_ > _We heave and we haul on the count of three_ > _We add a new approver for the `super` rule, aye_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104343 Approved by: https://github.com/jeanschmidt, https://github.com/albanD	2023-06-28 15:05:50 +00:00
Richard Zou	fde024b32d	[HigherOrderOp] Fall back on all new side effects in speculate_subgraph (#104077 ) Fixes #103613. A requirement for HigherOrderOperators is that after Dynamo capture, the body function should be functional (i.e. has no observable side effects). If the body function mutates a variable that is not local to the body, then we that should induce a graph break. This PR distinguish between MutableLocals created inside/outside body and adds relevant checks. (Design originally proposed by voznesenskym.) - We tag each mutable_local with an id that corresponds to where it came from. The mutable_local may represent an existing object that gets tracked by Dynamo or an object that is created while Dynamo is introspecting. - This id changes when we are introspecting the body of a HigherOrderOperator. - If Dynamo wants to perform a side effect using a mutable_local, we check its .scope field with the current scope id and raise Unsupported in the desired case (non-local mutation inside HigherOrderOperator body) - The id is a global thread_local variable. I can make this not a global variable, but it just takes some engineering time to thread a number through each of the various ways Dynamo can construct a mutable_local. Test Plan: - Add a bunch of new tests. Tests combinations of {global, nonlocal} x {number, Tensor, list, object, nn.Module} and asserts that HigherOrderOp falls back on those cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104077 Approved by: https://github.com/voznesenskym, https://github.com/jansel	2023-06-28 14:20:37 +00:00
cdzhan	c06bb82ba1	fix specialization when you pass an unspec int into slicing on a Python list. (#104142 ) Fixes #103545 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104142 Approved by: https://github.com/malfet, https://github.com/jansel	2023-06-28 13:13:07 +00:00
Andrew Gu	6493519fff	[Easy][FSDP] Remove misleading asserts (#104274 ) Since we do not call `_FSDPState.__init__()` and only use it for typing, it is not possible for these attributes to be `None`. The purpose of these `assert`s is to make sure that these attributes are set by `_init_process_group_state_for_hybrid_shard()`. If we care to make that explicit, I would posit that we should be using `hasattr` checks, not `is not None` checks, because if indeed `_init_process_group_state_for_hybrid_shard()` did not set these attributes, then even checking that it is not `None` would lead to an `AttributeError`. I do not include these `hasattr` checks for now since `_init_process_group_state_for_hybrid_shard()` is short enough that we can quickly tell by inspection that it sets the desired attributes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104274 Approved by: https://github.com/rohan-varma	2023-06-28 11:08:47 +00:00
Andrew Gu	ba9f6e6e92	[FSDP] Validate `ignored_modules`, `ignored_states` (#104273 ) This checks that `ignored_modules` and `ignored_states` have the expected type and provides a reasonable error message if not. Otherwise, if someone passes a mix of modules and parameters to `ignored_states` for example, then our code may be silently incorrect. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104273 Approved by: https://github.com/rohan-varma	2023-06-28 11:08:47 +00:00
Andrew Gu	cc27e6c0f9	[FSDP] Fix `ignored_states` doc (#104253 ) This fixes https://github.com/pytorch/pytorch/issues/104246. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104253 Approved by: https://github.com/rohan-varma	2023-06-28 11:08:45 +00:00
Andrew Gu	9db8ad7f1d	[FSDP] Support unfreezing params for reshard-only hook (#104186 ) This fixes https://github.com/pytorch/pytorch/issues/104148 (unfreezing parameters after `n` steps). - This fixes a bug where we did not delete the post-backward hook state properly for the `requires_grad=False` case. - This makes the `already_resharded` correct for `SHARD_GRAD_OP`. - This generalizes `_clear_grads_if_needed()` to `_reset_flat_param_grad_info_if_needed()` to additionally include propagating the original parameters' `requires_grad` to the flat parameter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104186 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2023-06-28 11:04:57 +00:00
Danni Li	89fcfc1b8c	[Doc] linalg.ldl_factor: render the Shape of tensor A (#99777 ) Summary: Fix: #96864 Test Plan: Please see GitHub tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99777 Approved by: https://github.com/lezcano	2023-06-28 09:28:45 +00:00
Nikita Vedeneev	5cf3a99013	sampled_addmm: backward performance improvements (#103544 ) No need to do double `sparse_mask`, let's squash everything into one call! This PR exercises https://github.com/pytorch/pytorch/pull/103750, so here is an autogened code for the backward pass. ``` at::Tensor sparse_sampled_addmm(c10::DispatchKeySet ks, const at::Tensor & self, const at::Tensor & mat1, const at::Tensor & mat2, const at::Scalar & beta, const at::Scalar & alpha) { auto& self_ = unpack(self, "self", 0); auto& mat1_ = unpack(mat1, "mat1", 1); auto& mat2_ = unpack(mat2, "mat2", 2); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self, mat1, mat2 ); std::shared_ptr<SparseSampledAddmmBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<SparseSampledAddmmBackward0>(new SparseSampledAddmmBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self, mat1, mat2 )); grad_fn->alpha = alpha; grad_fn->beta = beta; if (grad_fn->should_compute_output(2)) { grad_fn->mat1_ = SavedVariable(mat1, false); } if (grad_fn->should_compute_output(1)) { grad_fn->mat2_ = SavedVariable(mat2, false); } grad_fn->self_ = SavedVariable(self, false); } ``` As you can see, we do not save tensors unless needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103544 Approved by: https://github.com/soulitzer	2023-06-28 08:49:54 +00:00
Nikita Shulga	148960b8cc	[BE] Modernize C++ in MetalPrepackOpContext (#104312 ) Mark destructors as overrides, which fixes: ```cpp In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/metal/MetalPrepackOpRegister.cpp:3: /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/metal/MetalPrepackOpContext.h:52:3: warning: '~Conv2dOpContext' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] ~Conv2dOpContext() { ^ /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/core/ivalue.h:22:17: note: overridden virtual function is here class TORCH_API CustomClassHolder : public c10::intrusive_ptr_target {}; ^ In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/metal/MetalPrepackOpRegister.cpp:3: /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/metal/MetalPrepackOpContext.h:147:3: warning: '~LinearOpContext' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override] ~LinearOpContext() { ^ /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/core/ivalue.h:22:17: note: overridden virtual function is here class TORCH_API CustomClassHolder : public c10::intrusive_ptr_target {}; ``` Modernize constructors by passing parameters by value and moving them, rather than by reference, see [clang-tidy pass-by-value rule](https://clang.llvm.org/extra/clang-tidy/checks/modernize/pass-by-value.html). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104312 Approved by: https://github.com/kit1980, https://github.com/osalpekar	2023-06-28 07:17:08 +00:00
shibo19	c2095af3f8	make funcs argument type from torch.cuda.stream as torch.Stream (#104156 ) Fixes #ISSUE_NUMBER 1. we want to support fsdp for custom device, so we make funcs argument type from torch.cuda.stream as torch.Stream Pull Request resolved: https://github.com/pytorch/pytorch/pull/104156 Approved by: https://github.com/awgu	2023-06-28 06:02:56 +00:00
PyTorch MergeBot	f7fdaf8191	Revert "Re-enable low memory dropout (#103330 )" This reverts commit 2d14395f176b38b8416c2713d285e5ae55695a5f. Reverted https://github.com/pytorch/pytorch/pull/103330 on behalf of https://github.com/malfet due to Lots of tests failed with 'prims' object has no attribute 'inductor_random' ([comment](https://github.com/pytorch/pytorch/pull/103330#issuecomment-1610691147))	2023-06-28 04:27:37 +00:00
Elias Ellison	2d14395f17	Re-enable low memory dropout (#103330 ) On attention_is_all_you_need_pytorch: Perf: 1.526x -> 1.544x Memory: 1.00 -> 1.05x Fix for https://github.com/pytorch/pytorch/issues/102319, although I'm not sure all the perf is recovered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103330 Approved by: https://github.com/jansel	2023-06-28 03:13:41 +00:00
David Berard	a8b63d4d1b	[dynamo] If UserDefinedObjectVariable.var_getattr() is a callable, try handling as a TorchVariable (#104231 ) In some cases, a UserFunctionVariable can be constructed when the underlying function is actually a TorchVariable. One example is when an attribute on a UnspecializedNNModuleVariable is a torch function. In those cases, we should treat the UserFunctionVariable as a TorchVariable. This adds a check in UserDefinedObjectVariable.var_getattr() to try to create a TorchVariable instead of a UserFunctionVariable. Fixes #104172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104231 Approved by: https://github.com/williamwen42, https://github.com/jansel	2023-06-28 02:39:03 +00:00
Bin Bao	28d42e66e4	[CI] Add DALLE2_pytorch to FORCE_AMP_FOR_FP16_BF16_MODELS (#104283 ) Summary: DALLE2_pytorch inference does not support bfloat16, fallback to use AMP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104283 Approved by: https://github.com/eellison	2023-06-28 02:37:15 +00:00
cyy	54cb61f7d9	enable ASAN on some tests (#103647 ) Enabling more tests on ASAN, meanwhile we disable float-divide-by-zero and float-cast-overflow, both are disabled because they are also disabled by default in latest clang. The following cited doc explains the reasons. ``` -fsanitize=float-cast-overflow: Conversion to, from, or between floating-point types which would overflow the destination. Because the range of representable values for all floating-point types supported by Clang is [-inf, +inf], the only cases detected are conversions from floating point to integer types. -fsanitize=float-divide-by-zero: Floating point division by zero. This is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value, so is not included in -fsanitize=undefined. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103647 Approved by: https://github.com/kit1980	2023-06-28 02:17:14 +00:00
Jinku Cui	27eecf32bd	Remove redundant dummy overrides (#103992 ) # Tidy the code in [overrides.py](https://github.com/pytorch/pytorch/blob/main/torch/overrides.py) ## Duplicate APIs in the [get_testing_overrides()](https://github.com/pytorch/pytorch/blob/main/torch/overrides.py#L335) function: \| APIs \| Line number\| \|-------\|-------\| \| torch.fft.fft\| L544 L564 \| \| torch.logsumexp \| L670 L672 \| torch.narrow_copy \| L733 L1126 \| \| torch.native_norm \| L740 L741 L742 \| \| torch.nn.init.constant_ \| L885 L887 \| \| torch.squeeze_copy \| L1134 L1135 \| \| torch.view_copy \| L1148 L1149 \| \| Tensor.\_coalesced\_ \| L1236 L1261 \| ## Testing script ```Python import torch import inspect import functools from typing import Dict, Set, Callable """ @functools.lru_cache(None) def get_testing_overrides() -> Dict[Callable, Callable]: ... Tensor = torch.Tensor ret: Dict[Callable, Callable] = { # ... torch.fft.fft: lambda input, n=None, dim=-1, norm=None: -1, # L544 torch.fft.fft: lambda input, n=None, dim=-1, norm=None: -1, # L564 torch.logsumexp: lambda input, names, keepdim=False, out=None: -1, # L670 torch.logsumexp: lambda input, names, keepdim=False, out=None: -1, # L672 torch.narrow_copy: lambda input, dim, start, length: -1, # L733 torch.narrow_copy: lambda self, dim, start, length: -1, # L1126 torch.native_norm: lambda input, p=2: -1, # L740 torch.native_norm: lambda input, p=2: -1, # L741 torch.native_norm: lambda input, p=2, dim=None, keepdim=False, dtype=None: -1, # L742 torch.squeeze_copy: lambda self: -1, # L1134 torch.squeeze_copy: lambda self, dim: -1, # L1135 torch.view_copy: lambda self, size: -1, # L1148 torch.view_copy: lambda self, dtype: -1, # L1149 Tensor._coalesced_: lambda self: -1, # L1236 Tensor._coalesced_: lambda self, coalesced: -1, # L1261 # ... } ... """ if __name__ == "__main__": ret = torch.overrides.get_testing_overrides() Tensor = torch.Tensor dups = {"torch.fft.fft": torch.fft.fft, "torch.logsumexp": torch.logsumexp, "torch.narrow_copy": torch.narrow_copy, "torch.native_norm": torch.native_norm, "torch.squeeze_copy": torch.squeeze_copy, "torch.view_copy": torch.view_copy, "Tensor._coalesced_": Tensor._coalesced_} for k,v in dups.items(): print(f"{k:18} {inspect.signature(ret[v])}") ``` ## Testing output ```Shell torch.fft.fft (input, n=None, dim=-1, norm=None) torch.logsumexp (input, names, keepdim=False, out=None) torch.narrow_copy (self, dim, start, length) torch.native_norm (input, p=2, dim=None, keepdim=False, dtype=None) torch.squeeze_copy (self, dim) torch.view_copy (self, dtype) Tensor._coalesced_ (self, coalesced) ``` ## Explanation: The function `get_testing_overrides()` returns a `Dict[Callable, Callable]`. The later dummy overrides will cover the previous dummy overrides in the returned `Dict`. Therefore, removing the dummy overrides with homonym API names can tidy the code and increase the readability of the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103992 Approved by: https://github.com/kit1980	2023-06-28 01:59:56 +00:00
Tugsbayasgalan Manlaibaatar	361ef824ea	Handle custom higher order ops (#104285 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104285 Approved by: https://github.com/zhxchen17	2023-06-28 01:53:36 +00:00
Elias Ellison	05ebd538d4	Inference Horizontal Fuse Addmm (#100746 ) Gives 1.5% improvement on PegasusForCausalLM Pull Request resolved: https://github.com/pytorch/pytorch/pull/100746 Approved by: https://github.com/jansel	2023-06-28 01:08:37 +00:00
Howard Huang	9165d46b89	DDP + C10D sparse all_reduce changes (#103916 ) (#104256 ) Summary: reland of https://github.com/pytorch/pytorch/pull/103916 ## Changes prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function. prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...` ## Example script ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py import torch import torch.distributed as dist def main(): dist.init_process_group(backend="nccl") rank = dist.get_rank() a = torch.tensor([[0, 2.], [3, 0]]).to(rank) a = a.to_sparse() print(f"rank {rank} - a: {a}") dist.all_reduce(a) if __name__ == "__main__": main() ``` output: ``` rank 1 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse rank 0 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse ``` Test Plan: Testing commands (OSS): ``` # python pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops # c++ build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Testing commands (internal, ondemand GPU): ddp tests: ``` buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output # Get the .par file from the previous command and use it below TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata ``` c10d tests: ``` # build tests and run with log output (python) buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops # python NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)' # c++ NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Differential Revision: D47056695 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/104256 Approved by: https://github.com/rohan-varma	2023-06-28 00:37:52 +00:00
Animesh Jain	c0aa442cb5	[dynamo][higher order op] Relaxing too restrictive check for output to be a list/tuple of tensors (#104221 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104221 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2023-06-28 00:30:43 +00:00
leslie-fang-intel	945a257277	[Quant][PT2E] Supported customized _EQUIVALENT_TYPES in Module Partition API (#102516 ) Summary `Module Partition API` can simplify the pattern match process in Quantization annotation. However, current implementation of `Module Partition API` has hardcoded `_EQUIVALENT_TYPES` `999bae0f54/torch/ao/quantization/_pt2e/graph_utils.py (L13-L20)`. So, PyTorch Extension Libraries such as [intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch) can't use `Module Partition API` with customized `_EQUIVALENT_TYPES` . In this PR, we plan to enable customized `_EQUIVALENT_TYPES` by pass in parameter. Test Plan ``` python -m pytest test_graph_utils.py -k test_customized_equivalet_types_dict ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102516 Approved by: https://github.com/jgong5, https://github.com/kimishpatel	2023-06-28 00:20:25 +00:00
Shunting Zhang	298ff41a38	[inductor] fix a bug in coordinate descent tuner (#104293 ) The neighbor values we try for a field can be empty in some corner cases. ``` # E.g., if XBLOCK is 1 initially and size_hint for x is also 1. # We would not try either larger or smaller XBLOCK in this case. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104293 Approved by: https://github.com/jansel	2023-06-28 00:05:13 +00:00
Richard Zou	280df5dc2e	[HigherOrderOp] Remove `_deprecated_global_ns` from some ops (#104105 ) The remaining ops after this PR are: - cond - map - anything that is out of tree. These are a bit more difficult to remove. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/104105 Approved by: https://github.com/ydwu4	2023-06-28 00:03:29 +00:00
Elias Ellison	de7b6e55eb	Fix bad cudagraph interaction (#104196 ) Fix for https://github.com/pytorch/pytorch/issues/103126 As mentioned there, > We need to make sure we are not removing the misaligned inputs before we are checking for misalignment in cudagraphs, so we know not to expect a static input for the misaligned tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104196 Approved by: https://github.com/desertfire	2023-06-27 21:36:09 +00:00
Elias Ellison	7bb40be143	Fix fake tensor for private use backends (#103090 ) Fix for https://github.com/pytorch/pytorch/issues/101244 We need meta to be higher priority than PrivateUse1 (as it is for cpu and cuda) so that when meta is in tls we hit meta kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103090 Approved by: https://github.com/bdhirsh	2023-06-27 21:17:40 +00:00
albanD	1a8af1503f	Upgrade Pybind submodule to 2.10.4 (#103989 ) This is not ready for review, this is to make sure asan is fixed. Not sure what is the most effective way to track down the bad dec_ref within deploy yet. The asan silencing is done to match this comment: `1c79003b3c/test/test_cpp_extensions_jit.py (L749-L752)` EDIT: since the final failing function is in libtorch_python.so, we would need to skip that whole lib (not ok). So now we're skipping based on the function name which should be restrictive enough to not hide any real bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103989 Approved by: https://github.com/malfet	2023-06-27 20:22:39 +00:00
Jerry Zhang	c98896b76f	[quant][pt2e] Add more precise representation for quantized add (#104130 ) Summary: The planned e2e for quantization in pytorch 2.0 export is the following: float_model -> prepare_pt2e -> calibration -> convert_pt2e -> ... inside convert_pt2e, we will first produce a q/dq representation of the quantized model, similar to the previous output of convert_to_reference_fx in fx grah mode quantization: ``` torch.ops.quantized_decomposed.dequantize_per_tensor -> torch.ops.aten.add -> torch.ops.quantized_decomopsed.quantize_per_tensor torch.ops.quantized_decomposed.dequantize_per_tensor / ``` Then we'll rewrite the above to a more precise representation that express the intention in a more precise manner, since here we actually want to do int8 addition, instead of simulating the int8 addition with fp32 operations, the representation for quantized add is: ``` def quantized_add(x_i8, x_scale, x_zero_point, y_i8, y_scale, y_zero_point, out_scale, out_zero_point): x = (x_scale / out_scale) * x_i8 y = (y_scale / out_scale) * y_i8 out = x + y out -= (x_zero_point * x_scale - y_zero_point * y_scale) / out_scale out += out_zero_point return out ``` Test Plan: ``` buck2 test caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_representation_add (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)' ``` Reviewed By: kimishpatel Differential Revision: D45628032 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104130 Approved by: https://github.com/kimishpatel	2023-06-27 20:11:30 +00:00
Yanbo Liang	7bf27cf163	[Inductor][FX passes] Remove config.split_cat_fx_passes & Add config.experimental_patterns (#104208 ) Summary: TLDR: * Remove config.split_cat_fx_passes, and move split cat passes behind config.pattern_matcher (True by default) * Add config.experimental_patterns (False by default). * In the future, general/universal patterns should behind config.pattern_matcher; customized/unmatured patterns should behind config.experimental_patterns. More details at: https://docs.google.com/document/d/1P8uJTpOTdQpUbw56UxHol40tt-EPFTq1Qu38072E9aM/edit Test Plan: Existing unit tests Reviewed By: jansel, jackiexu1992 Differential Revision: D46752606 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104208 Approved by: https://github.com/williamwen42	2023-06-27 20:08:40 +00:00
Jesse Cai	2da6cae43c	[core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135 ) This PR adds in support for semi-structured sparsity via a tensor subclass. It currently uses the CUTLASS kernels merged in PR #100881. In the future we plan to add in cuSPARSELt support (see the other PRs in the stack), which will give us larger performance gains. This PR adds in 2 things: - a Tensor subclass, `SparseSemiStructuredTensor` to store the sparse tensor in copmressed form and override `__torch_dispatch__`. - a conversion function that takes in a dense tensor and a semi-structured sparse bool mask and creates an instance of the subclass. SparseSemiStructuredTensor The subclass stores the dense tensor in a contiguous flattened tensor for future compatability with cuSPARSELt, which expects this format. Note that the CUTLASS kernels do not have this limitation, as the specified values and the metadata are passed separately in `_structured_sparse_linear`. In the future we can use the cuSPARSELT bindings [here](https://github.com/pytorch/pytorch/pull/103700) for faster matmul, better dtype converage, and relaxed shape constraints. Since we currently don't have a way to go back from the sparse representation to the dense representation, and we store the weights in compressed form, we don't have a great way to handle .t(). Instead, we keep track of how often we've called transpose on our tensor, and if it's an unexpected number we throw an error. When the first argument is sparse, we expect an even number of calls to transpose, while when the second argument is sparse, we expect an odd number of calls. This is because we support second argument sparse matrix multiplications by using transpose properties. to_sparse_semi_structured This is a conversion function to convert a dense tensor and a semi-structured sparse bool mask into a subclass. Currently, we must pass in a bool mask, since we can't infer it becuase there may be additional zero elements in the dense tensor, so `tensor !=0` is not 2:4 sparse. Once we add either a method to derive the mask from the dense tensor or cuSPARSELt, we no longer need to pass in the mask. cuSPARSELt has it's own helper functions to create the metadata mask. User Details We have implemented support for the following ops for `torch.float16` and `torch.int8`: ``` torch.addmm(bias, dense, sparse.t()) torch.mm(dense, sparse) torch.mm(sparse, dense) aten.linear.default aten.t.default aten.t.detach ``` The end user interface to accelerate a nn.Linaer module with the subclass would look like this: ``` from torch.sparse import to_sparse_semi_structured mask = torch.Tensor([0, 0, 1, 1]).tile(128, 32).cuda().bool() linear = Model(128, 128).half().cuda() linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight, mask=linear.weight.bool()) ``` This also updates tests and the `torch.sparse` module docstring to reflect these changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102135 Approved by: https://github.com/albanD	2023-06-27 19:21:06 +00:00
Logan Wendholt	39868b0578	[codemod][third-party][gtest] Migrate all fbcode gtest from tp2 to fbsource/third-party (#104255 ) Summary: ## What is this? This is a giant codemod to migrate all of fbcode from the tp2 version of gtest to the `fbsource/third-party` version. ## Why? Various parts of the monorepo use different versions of gtest which are incompatible with each other and make maintenance of C++ testing more difficult than it should be. There also doesn't seem to be much reason for this fragmentation. Shifting all `gtest` dependencies towards `fbsource/third-party` is a big step in the right direction towards cleaning this up. Also -- tp2 is deprecated, so we want to stop using that anyway. If we're going to make improvements to `gtest`, we should get away from tp2 as a first step. ## How? I used bash script to perform the majority of the codemod: P777150295 I followed up with `rg` to find additional dependencies, then simply iterated a ton until CI was (mostly) happy. This diff also includes an update to autodeps to use the `third-party/fbsource` version of gtest rather than the `tp2` version. #forcetdhashing Test Plan: CI Differential Revision: D46961576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104255 Approved by: https://github.com/huydhn	2023-06-27 19:10:08 +00:00
Xilun Wu	a66107a30c	[DTensor][Random] Introduce CudaRNGStateTracker to maintain parallel RNG state for DTensor (#103235 ) # Change This PR adds two classes to DTensor: 1. `CudaRNGStateTracker`: `CudaRNGStateTracker` stores Random Number Generator (RNG) state (a `ByteTensor` object) in a `dict`, mapping from a corresponding tag to each state tensor. It also provides a set of convenient utility methods to help access/modify the state tensors. The most important interface is `_distribute_region` which will be used when DTensor executes a random op (an operator that calls RNG). 2. `OffsetBasedRNGTracker`: This subclass of `CudaRNGStateTracker` defines the default policy of how RNG states should be shared and synchronized among all ranks to respect the semantics of DTensor random operators. # Warning - With `Multi-threaded ProcessGroup`, the global variable `_rng_tracker` will be shared among threads(ranks) and cause issue. We need to figure out a compatible solution for that. - The RNG state may be asynchronous outside of participating ranks. It is harmless in our current use case of submesh though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103235 Approved by: https://github.com/wanchaol	2023-06-27 19:00:25 +00:00
Bowen Bao	84f578dcc2	[ONNX] Cache AutoTokenizer in CI for test (#104233 ) Fixes #103950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104233 Approved by: https://github.com/malfet	2023-06-27 18:55:39 +00:00
jaiaid	93b6b17dd0	CUDA_HOST_COMPILER spelling fix in cmake build files generate method (#104126 ) Fix of CUDA_HOST_COMPILER spelling fix in generating additional build option in CMake.generate method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104126 Approved by: https://github.com/malfet	2023-06-27 18:46:12 +00:00
Te	a73ad82c8f	conditional CMAKE_CUDA_STANDARD (#104240 ) Fixes #104237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104240 Approved by: https://github.com/malfet	2023-06-27 18:41:25 +00:00
xuanqi	bf34ecd0c8	[RFC]: Integrate assertions functionalization to export (after AOT export) (#103887 ) This PR integrated the assertion functionalization logic into current export logic. NOTE: I finally decided to do the assertion functionalization after AOT export instead of before for the following reasons: * The benefit of AOT export is that the graph is already functionalized so things like method call is already transformed to function call. However, if we do it before AOT export, the graph is still in torch level and extra logic like `bab21d20eb/torch/_export/pass_base.py (L201-L204C17)` will need to be implemented. * The graph signature is kind of already incorrect after adding runtime assertions currently (this doesn't seem break logic since we already depend on positions instead of FQNs of outputs). This PR also fixed this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103887 Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan	2023-06-27 18:14:29 +00:00
Tugsbayasgalan Manlaibaatar	936cd4f2f5	Migrate exportdb to torch.export (#104260 ) Reapply of (https://github.com/pytorch/pytorch/pull/103861). Things that needed to be fixed: - Fix a bug with returning dict output type - Make pass_base work with map implementation - Fix subtle bug with dynamo not propagating "val" in node.meta - Add export_constraints field in ExportCase in ExportDB Pull Request resolved: https://github.com/pytorch/pytorch/pull/104260 Approved by: https://github.com/angelayi	2023-06-27 17:49:18 +00:00
Jean Schmidt	ab9577087a	Update accuracy for dynamo/torchbench CI - vision_maskrcnn, hf_T5_generate and dlrm (#104263 ) Fixes breaking CI jobs Pull Request resolved: https://github.com/pytorch/pytorch/pull/104263 Approved by: https://github.com/atalman, https://github.com/seemethere	2023-06-27 17:33:01 +00:00
Digant Desai	ef285faeba	[ET][XNNPACK] Add support for quantized Multiply (#104134 ) Summary: Also adds support for backend_config with relu fusion since XNNPACK allows it. We should revisit the relu fusion once we gain more clarity on quantSrcPartition or some other way to do these fusion and not having to add all combinations. We should really rename the backend config to et_xnnpack.py or something TODO Test Plan: `buck test fbcode//mode/dev-nosan fbcode//executorch/backends/xnnpack/test:` Differential Revision: D46985169 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104134 Approved by: https://github.com/mcr229, https://github.com/salilsdesai	2023-06-27 16:59:28 +00:00
Jack Taylor	80ea3422f0	[ROCm] Enable tl.reduce usage on ROCm (#104099 ) Revert aten.prod explicit fallback on ROCm and enabling the use of tl.reduce in triton codegen. This PR also enables an optimisation that was previously conditionalised out for ROCm https://github.com/pytorch/pytorch/pull/102444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104099 Approved by: https://github.com/peterbell10, https://github.com/malfet	2023-06-27 16:21:32 +00:00
Li-Huai (Allan) Lin	99e87bb6a0	[MPS] Dispatch outer bin edges selection function (#101792 ) Dispatch the selection function to prevent using `is_mps()` in `Histogram.cpp`. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at b329a02</samp> This pull request refactors and implements the logic for inferring the bin edges of histograms from the input tensor for different device types. It introduces a dispatch stub `histogram_select_outer_bin_edges_stub` and moves the device-specific code to separate files, such as `HistogramKernel.cpp` and `HistogramKernel.mm`. This improves the modularity and readability of the histogram functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101792 Approved by: https://github.com/albanD	2023-06-27 16:17:10 +00:00
Li-Huai (Allan) Lin	217a8b4697	[MPS] Add MPSProfiler to histogram kernel (#101692 ) Apart from introducing MPSProfiler, this PR also 1. removes the synchronization call after all the commands are encoded since the stream will be synchronized along the next graph op is encountered and run. One can take a look at this [PR](https://github.com/pytorch/pytorch/pull/99810) to get some insight. 2. initialize the offset calculation kernel's thread output with 0 to ensure the subsequent offset accumulation is correct. This change makes the kernel aligned with `kernel_index_offsets` kernel. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 4094984</samp> This change enables performance analysis of the `histogram` kernel on MPS devices by using the `MPSProfiler` class to collect and report relevant metrics. It modifies the file `HistogramKernel.mm` to add profiling calls around the kernel execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101692 Approved by: https://github.com/albanD	2023-06-27 16:17:10 +00:00
Nikita Shulga	c40f5edf7b	Change tools search order (#104214 ) Prevents following cryptic error if one attempts to use `run_tests.py` on system that also has torchaudio installed in dev mode (as `tools` from https://github.com/pytorch/audio might take precedence, but this is not how script should behave): ``` Unable to import test_selections from tools/testing. Running without test selection stats.... Reason: No module named 'tools.stats' Traceback (most recent call last): File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1673, in <module> main() File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1604, in main selected_tests = get_selected_tests(options) File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1418, in get_selected_tests path = os.path.join(str(REPO_ROOT), TEST_TIMES_FILE) NameError: name 'TEST_TIMES_FILE' is not defined ``` But make sure to remove it in the end, otherwise it will not work if torch is installed from wheel, but tests are running from clean repo checkout. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at dd52521</samp> > _Sing, O Muse, of the cunning code review_ > _That fixed the tests of the `tools` module_ > _By adding and removing the root path_ > _As a shepherd guides his flock to and fro._ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104214 Approved by: https://github.com/kit1980	2023-06-27 15:54:34 +00:00
David Radley	4d613b9a5f	[doc] Improve `mps` package description (#104184 ) Fixes #104183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104184 Approved by: https://github.com/malfet	2023-06-27 15:50:36 +00:00
Pearu Peterson	ad2905ad27	Make _test_autograd_multiple_dispatch_view a view operation (#104149 ) Fixes the `test_view_copy_cuda` failure case in https://github.com/pytorch/pytorch/issues/99655 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104149 Approved by: https://github.com/soulitzer	2023-06-27 15:43:35 +00:00
Nikita Vedeneev	567b5e5b28	Multioutput backward formula: allow conditional guards against saving (#103750 ) Multi-output backward formulas break the ability of autogen to decide which variables have to be stored in a graph. This PR introduces a macro `wrap_opt_if` which could be used to hint autogen about variable interdependence. For example, the following code is being generated for `_trilinear` with this modification: ``` at::Tensor _trilinear(c10::DispatchKeySet ks, const at::Tensor & i1, const at::Tensor & i2, const at::Tensor & i3, at::IntArrayRef expand1, at::IntArrayRef expand2, at::IntArrayRef expand3, at::IntArrayRef sumdim, int64_t unroll_dim) { auto& i1_ = unpack(i1, "i1", 0); auto& i2_ = unpack(i2, "i2", 1); auto& i3_ = unpack(i3, "i3", 2); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( i1, i2, i3 ); [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(i1) \|\| isFwGradDefined(i2) \|\| isFwGradDefined(i3)); std::shared_ptr<TrilinearBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<TrilinearBackward0>(new TrilinearBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( i1, i2, i3 )); grad_fn->expand1 = expand1.vec(); grad_fn->expand2 = expand2.vec(); grad_fn->expand3 = expand3.vec(); if (grad_fn->should_compute_output(1) \|\| grad_fn->should_compute_output(2)) { grad_fn->i1_ = SavedVariable(i1, false); } if (grad_fn->should_compute_output(0) \|\| grad_fn->should_compute_output(2)) { grad_fn->i2_ = SavedVariable(i2, false); } if (grad_fn->should_compute_output(0) \|\| grad_fn->should_compute_output(1)) { grad_fn->i3_ = SavedVariable(i3, false); } grad_fn->sumdim = sumdim.vec(); } ``` with the following backward modifications: ``` - name: _trilinear(Tensor i1, Tensor i2, Tensor i3, int[] expand1, int[] expand2, int[] expand3, int[] sumdim, int unroll_dim=1) -> Tensor - i1, i2, i3: _trilinear_backward(grad, i1, i2, i3, expand1, expand2, expand3, sumdim, grad_input_mask) + i1, i2, i3: "_trilinear_backward(grad, + wrap_opt_if(i1, grad_input_mask[1] \|\| grad_input_mask[2]), + wrap_opt_if(i2, grad_input_mask[0] \|\| grad_input_mask[2]), + wrap_opt_if(i3, grad_input_mask[0] \|\| grad_input_mask[1]), + expand1, expand2, expand3, sumdim, grad_input_mask)" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103750 Approved by: https://github.com/soulitzer	2023-06-27 15:12:09 +00:00
Jack Khuu	18dacf7e79	[Specialized Kernel] Update yaml syntax to use kernel instead of dispatch (#104070 ) Based on this [code search](https://fburl.com/code/gjcnw8ly) (.yaml with `dispatch: CPU:`), update all files found to use ``` kernels: - arg_meta: None kernel_name: ``` instead of ``` dispatch: CPU: ``` --- ## Code changes: - `fbcode/executorch/codegen/tools/gen_oplist.py` - Strip ET specific fields prior to calling parse_native_yaml_struct --- ## Files edited that are not `functions.yaml` or `custom_ops.yaml` - fbcode/executorch/kernels/optimized/optimized.yaml - fbcode/executorch/kernels/quantized/quantized.yaml - fbcode/executorch/kernels/test/custom_kernel_example/my_functions.yaml --- ## Found Files that were not edited Dispatched to more than just CPU - fbcode/caffe2/aten/src/ATen/native/native_functions.yaml - xplat/caffe2/aten/src/ATen/native/native_functions.yaml - xros/third-party/caffe2/caffe2/aten/src/ATen/native/native_functions.yaml Grouped ops.yaml path - fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/ops.yaml --- Design Doc: https://docs.google.com/document/d/1gq4Wz2R6verKJ2EFseLyPdAF0wqomnCrVDDJpRkYsRw/edit?kh_source=GDOCS#heading=h.8raqyft9y50 Differential Revision: [D46952067](https://our.internmc.facebook.com/intern/diff/D46952067/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D46952067/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/104070 Approved by: https://github.com/larryliu0820	2023-06-27 09:53:20 +00:00
Wanchao Liang	95707ac964	[fake_pg] allow fake_pg allgather to do some simple validation (#104213 ) Note that in general it's not good form to try to make FakePG work with 'real data', but the reasoning here is that we want FakePG to work with DeviceMesh's init code that have the data validation, which makes it worth the tradeoff. In general user should use MTPG or normal PG for cases where they may care about real data from collectives Pull Request resolved: https://github.com/pytorch/pytorch/pull/104213 Approved by: https://github.com/wconstab, https://github.com/voznesenskym	2023-06-27 09:39:16 +00:00
Xu Han	6c1ccccf21	Enable mimalloc on pytorch Windows (#102595 ) This PR is implemention of [#102534](https://github.com/pytorch/pytorch/issues/102534), option 2. Major changes: 1. Add mimalloc to the submodule. 2. Add build option "USE_MIMALLOC". 3. It is only enabled on Windows build, And it would improve pytorch memory allocation performance. Additional Test: <img width="953" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/4b2ec2dc-16f1-4ad9-b457-cfeb37e489d3"> This PR also build & static link mimalloc on Linux well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102595 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-06-27 08:53:26 +00:00
Michael Voznesensky	803c14490b	Specialize storage_offset - Does not cover automatic dynamic (#104204 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104204 Approved by: https://github.com/wconstab	2023-06-27 05:51:42 +00:00
Nikita Shulga	c3e4a67905	Refactor multigpu tests to `test_cuda_multigpu` (#104059 ) Mostly refactor, that moves all the tests from `test_cuda` that benefit from multiGPU environment into its own file. - Add `TestCudaMallocAsync` class for Async tests ( to separate them from `TestCudaComm`) - Move individual tests from `TestCuda` to `TestCudaMultiGPU` - Move `_create_scaling_models_optimizers` and `_create_scaling_case` to `torch.testing._internal.common_cuda` - Add newly created `test_cuda_multigpu` to the multigpu periodic test <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at f4d46fa</samp> This pull request fixes a flaky test and improves the testing of gradient scaling on multiple GPUs. It adds verbose output for two CUDA tests, and refactors some common code into helper functions in `torch/testing/_internal/common_cuda.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104059 Approved by: https://github.com/huydhn	2023-06-27 05:32:05 +00:00
Pritam Damania	572ff2779b	[RESUBMIT] Ensure ncclCommAbort can abort stuck ncclCommInitRank (#103925 ) https://github.com/pytorch/pytorch/pull/95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior. However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op. To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`. I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/103925 Approved by: https://github.com/osalpekar	2023-06-27 04:22:03 +00:00
PyTorch MergeBot	b76a040b18	Revert "[core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135 )" This reverts commit aea771de30427998e83010459b69da1ab66f0879. Reverted https://github.com/pytorch/pytorch/pull/102135 on behalf of https://github.com/huydhn due to test_sparse_semi_structured.py::TestSparseSemiStructuredCUDA::test_mm_sparse_first_NT_cuda_int8 is still failing CUDA trunk jobs `aea771de30` ([comment](https://github.com/pytorch/pytorch/pull/102135#issuecomment-1608744110))	2023-06-27 03:49:31 +00:00
nihuini	7157dfdd4a	[jit] fix duplicated module input and output values in tracing module (#102510 ) remap shall record the original inp pointers instead of remapped ones testcase ```python import torch import torch.nn as nn import torch.nn.functional as F class Normalize(nn.Module): def __init__(self): super().__init__() self.norm = nn.GroupNorm(num_groups=32, num_channels=32) def forward(self, x, y): if y is None: y = x else: y = self.norm(y) y = y * 2 return y class G(nn.Module): def __init__(self): super().__init__() self.norm = Normalize() def forward(self, x): A = self.norm(x, None) B = F.relu(A) return A, B class Net(nn.Module): def __init__(self): super().__init__() self.g = G() self.norm_1 = Normalize() def forward(self, x): hs = self.g(x) A, B = hs h = self.norm_1(B, A) return h net = Net() net = net.eval() x = torch.randn(1, 32, 16, 16) traced = torch.jit.trace(net, x) print(traced.graph) ``` without this patch, there are duplicated lifted values, %80, %81, %82, %83, %84, %85 ``` graph(%self.1 : __torch__.Net, %x : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu)): %norm_1 : __torch__.___torch_mangle_1.Normalize = prim::GetAttr[name="norm_1"](%self.1) %g : __torch__.G = prim::GetAttr[name="g"](%self.1) %86 : (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor) = prim::CallMethod[name="forward"](%g, %x) %79 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %80 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %81 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %82 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %83 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %84 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %85 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu) = prim::TupleUnpack(%86) %87 : Tensor = prim::CallMethod[name="forward"](%norm_1, %79, %80, %81, %82, %83, %84, %85) return (%87) ``` with this patch ``` graph(%self.1 : __torch__.Net, %x : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu)): %norm_1 : __torch__.___torch_mangle_1.Normalize = prim::GetAttr[name="norm_1"](%self.1) %g : __torch__.G = prim::GetAttr[name="g"](%self.1) %71 : Tensor = prim::CallMethod[name="forward"](%g, %x) %72 : Tensor = prim::CallMethod[name="forward"](%norm_1, %71) return (%72) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102510 Approved by: https://github.com/davidberard98	2023-06-27 03:43:06 +00:00
Jesse Cai	aea771de30	[core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135 ) This PR adds in support for semi-structured sparsity via a tensor subclass. It currently uses the CUTLASS kernels merged in PR #100881. In the future we plan to add in cuSPARSELt support (see the other PRs in the stack), which will give us larger performance gains. This PR adds in 2 things: - a Tensor subclass, `SparseSemiStructuredTensor` to store the sparse tensor in copmressed form and override `__torch_dispatch__`. - a conversion function that takes in a dense tensor and a semi-structured sparse bool mask and creates an instance of the subclass. SparseSemiStructuredTensor The subclass stores the dense tensor in a contiguous flattened tensor for future compatability with cuSPARSELt, which expects this format. Note that the CUTLASS kernels do not have this limitation, as the specified values and the metadata are passed separately in `_structured_sparse_linear`. In the future we can use the cuSPARSELT bindings [here](https://github.com/pytorch/pytorch/pull/103700) for faster matmul, better dtype converage, and relaxed shape constraints. Since we currently don't have a way to go back from the sparse representation to the dense representation, and we store the weights in compressed form, we don't have a great way to handle .t(). Instead, we keep track of how often we've called transpose on our tensor, and if it's an unexpected number we throw an error. When the first argument is sparse, we expect an even number of calls to transpose, while when the second argument is sparse, we expect an odd number of calls. This is because we support second argument sparse matrix multiplications by using transpose properties. to_sparse_semi_structured This is a conversion function to convert a dense tensor and a semi-structured sparse bool mask into a subclass. Currently, we must pass in a bool mask, since we can't infer it becuase there may be additional zero elements in the dense tensor, so `tensor !=0` is not 2:4 sparse. Once we add either a method to derive the mask from the dense tensor or cuSPARSELt, we no longer need to pass in the mask. cuSPARSELt has it's own helper functions to create the metadata mask. User Details We have implemented support for the following ops for `torch.float16` and `torch.int8`: ``` torch.addmm(bias, dense, sparse.t()) torch.mm(dense, sparse) torch.mm(sparse, dense) aten.linear.default aten.t.default aten.t.detach ``` The end user interface to accelerate a nn.Linaer module with the subclass would look like this: ``` from torch.sparse import to_sparse_semi_structured mask = torch.Tensor([0, 0, 1, 1]).tile(128, 32).cuda().bool() linear = Model(128, 128).half().cuda() linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight, mask=linear.weight.bool()) ``` This also updates tests and the `torch.sparse` module docstring to reflect these changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102135 Approved by: https://github.com/albanD	2023-06-27 02:37:00 +00:00
Amr Elshennawy	968b7b5e0f	Initial commit of collective_utils (#101037 ) Summary: Details in T133020932 First commit of collective utils library. Ported over from model store, removed scuba logging, error_trait and all dependencies on modelstore. Test Plan: In the following diffs. Differential Revision: D45545970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101037 Approved by: https://github.com/H-Huang	2023-06-27 02:15:16 +00:00
Ashok Kumar Kannan	41866a2ead	Fix missing mandatory device_type argument in autocast docstring (#97223 ) Fixes #[92803](https://github.com/pytorch/pytorch/issues/92803) ![Screenshot from 2023-03-21 12-28-14](https://user-images.githubusercontent.com/100136654/226538769-141f3b9e-0de2-4e86-8e42-d5a4a7413c6f.png) ![Screenshot from 2023-03-21 12-28-29](https://user-images.githubusercontent.com/100136654/226538777-9e719090-75c0-46f7-8594-5efcb0a46df6.png) ![Screenshot from 2023-03-21 12-29-36](https://user-images.githubusercontent.com/100136654/226538783-399a0e60-ffc9-4d73-801c-8cfce366d142.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97223 Approved by: https://github.com/albanD, https://github.com/malfet	2023-06-27 01:54:54 +00:00
Tarun Karuturi	6d2da6106d	Raise AttributeError in _OpsNamespace if __self__ attribute is requested (#104096 ) Summary: Trying to get the `__self__` attribute on any `_OpNamespace` object should be an invalid operation. The `__self__` attribute only exists on instance method object and not on class objects. In [dynamo](`a152b3e3b8/torch/_dynamo/variables/torch.py (L164)`) there is code that tries to access the `__self__` attribute on `TorchVariable`, this currently results in an expensive call to `torch._C._jit_get_operation` [here](`a152b3e3b8/torch/_ops.py (L740)`) which ultimately fails and throws an exception. For cases where it fails the operation turns out to be quite expensive on the order of ~0.03s. For edge use cases when exporting large models with quantized ops this exception is thrown 100's of times resulting in a lot of time wasted. By preventing the call to `torch._C._jit_get_operation` we can quickly return from this function and significantly reduce export times. On a large ASR model for example export currently takes ~405 seconds. With this change we can reduce it to ~340s. Overall this should also be a harmless change as no one should mostly ever try to access the `__self__` attribute on any `_OpNamespace` object. Test Plan: Added test case. Differential Revision: D46959879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104096 Approved by: https://github.com/larryliu0820, https://github.com/ezyang, https://github.com/zou3519	2023-06-27 01:42:06 +00:00
leslie-fang-intel	f8ac569365	[Inductor][Quant]Fix tile2d code generation issue with uint8 data type (#104074 ) Summary The previous vectorized code generation of tile2d doesn't support input data type of uint8, which still takes it as float and generate wrong result. This PR fixes this issue. Take UT `test_tile2d_load_decomposed_dequant_add_relu_quant` in this PR as example: The previous generated code is: ``` #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(192L); i1+=static_cast<long>(16L)) { unsigned char tmp0[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<unsigned char,16,16>(in_ptr0 + static_cast<long>(i0 + (1024Li1)), static_cast<long>(1024L), tmp0, 16); unsigned char tmp7[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<unsigned char,16,16>(in_ptr1 + static_cast<long>(i0 + (1024Li1)), static_cast<long>(1024L), tmp7, 16); for (long i0_inner = 0; i0_inner < 16; i0_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16Li0_inner)); auto tmp8 = at::vec::Vectorized<float>::loadu(tmp7 + static_cast<long>(16Li0_inner)); auto tmp2 = (tmp1); auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(1.0)); auto tmp4 = tmp2 - tmp3; auto tmp5 = at::vec::Vectorized<float>(static_cast<float>(0.01)); auto tmp6 = tmp4 * tmp5; auto tmp9 = (tmp8); auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(2.0)); auto tmp11 = tmp9 - tmp10; auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(0.02)); auto tmp13 = tmp11 * tmp12; auto tmp14 = tmp6 + tmp13; auto tmp15 = at::vec::clamp_min(tmp14, decltype(tmp14)(0)); auto tmp16 = at::vec::Vectorized<float>(static_cast<float>(33.333333333333336)); auto tmp17 = tmp15 * tmp16; auto tmp18 = tmp17.round(); auto tmp19 = at::vec::Vectorized<float>(static_cast<float>(3.0)); auto tmp20 = tmp18 + tmp19; auto tmp21 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp22 = at::vec::maximum(tmp20, tmp21); auto tmp23 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp24 = at::vec::minimum(tmp22, tmp23); auto tmp25 = (tmp24); at::vec::store_float_as_uint8(tmp25, out_ptr0 + static_cast<long>(i1 + (196Li0) + (196Li0_inner))); } } ``` After this PR, the generated code is: ``` #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(192L); i1+=static_cast<long>(16L)) { unsigned char tmp0[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<unsigned char,16,16>(in_ptr0 + static_cast<long>(i0 + (1024Li1)), static_cast<long>(1024L), tmp0, 16); unsigned char tmp7[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<unsigned char,16,16>(in_ptr1 + static_cast<long>(i0 + (1024Li1)), static_cast<long>(1024L), tmp7, 16); for (long i0_inner = 0; i0_inner < 16; i0_inner++) { auto tmp1 = at::vec::load_uint8_as_float(tmp0 + static_cast<long>(16Li0_inner)); auto tmp8 = at::vec::load_uint8_as_float(tmp7 + static_cast<long>(16Li0_inner)); auto tmp2 = (tmp1); auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(1.0)); auto tmp4 = tmp2 - tmp3; auto tmp5 = at::vec::Vectorized<float>(static_cast<float>(0.01)); auto tmp6 = tmp4 * tmp5; auto tmp9 = (tmp8); auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(2.0)); auto tmp11 = tmp9 - tmp10; auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(0.02)); auto tmp13 = tmp11 * tmp12; auto tmp14 = tmp6 + tmp13; auto tmp15 = at::vec::clamp_min(tmp14, decltype(tmp14)(0)); auto tmp16 = at::vec::Vectorized<float>(static_cast<float>(33.333333333333336)); auto tmp17 = tmp15 * tmp16; auto tmp18 = tmp17.round(); auto tmp19 = at::vec::Vectorized<float>(static_cast<float>(3.0)); auto tmp20 = tmp18 + tmp19; auto tmp21 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp22 = at::vec::maximum(tmp20, tmp21); auto tmp23 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp24 = at::vec::minimum(tmp22, tmp23); auto tmp25 = (tmp24); at::vec::store_float_as_uint8(tmp25, out_ptr0 + static_cast<long>(i1 + (196Li0) + (196Li0_inner))); } } ``` Test Plan ``` python -m pytest test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant python -m pytest test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104074 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-27 00:59:05 +00:00
Yang Chen	d2281e38ae	Adds the initial support for AOTInductor model and interface (#104202 ) This PR combines the C++ code for the AOTInductor's model and interface with Bin Bao's changes to AOTInductor codegen. It adds a number of AOTInductor C interfaces that can be used by an inference runtime. Under the hood of the interfaces, the model code generated by the AOTInductor's codegen is wrapped into a class, AOTInductorModel, which manages tensors and run the model inference. On top of AOTInductorModel, we provide one more abstract layer, AOTInductorModelContainer, which allows the user to have multiple inference runs concurrently for the same model. This PR also adjusts the compilation options for AOT codegen, particularly some fbcode-related changes such as libs to be linked and header-file search paths. Note that this is the very first version of the AOTInductor model and interface, so many features (e.g. dynamic shape) are incomplete. We will support those missing features in in future PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104202 Approved by: https://github.com/desertfire	2023-06-27 00:37:26 +00:00
Peter Stefek	d8a2e7461b	Fix incorrect distribution of randperm with device mps (#104171 ) Fixes #104170 As noted in the above issue it seems that the code for randperm basically boils down to: `torch.argsort(torch.rand(size, device="mps"), dim = 0)` However it seems like in the fused(?) pytorch version the type of tensor we were drawing `torch.rand(size, device="mps")` from was int64 with an inclusive(?) upper bound of 1. This caused everything to be sorted into two groups (if you drew 0 or 1) each monotonically ascending due to sort tie breaking. One way to fix this is to just generate the random tensor as float64s with an upper bound of 1.0 instead of int64s. An alternative to to just set the upper bound to max int 64. ~I choose the float64 one basically on a coin flip b/c I couldn't tell the original contributor's intent (due to mixed up upper bounds and type) but would be happy to change to use int64 and max int 64 as an upper bound instead if that's better.~ Edit on second thought I don't like using floats from 0.0 to 1.0 as there are fewer of them in that range than int64s from 0 to int 64 max_value. I also suspect integer math might be faster but need to benchmark this tomorrow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104171 Approved by: https://github.com/malfet	2023-06-27 00:36:15 +00:00
drisspg	994b98b78b	Add language server support for vscode (#104160 ) Makes it so clangd support can work with with vscode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104160 Approved by: https://github.com/seemethere	2023-06-27 00:20:53 +00:00
Mikayla Gawarecki	981f24e806	Add docstring to torch.serialization.register_package (#104046 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104046 Approved by: https://github.com/albanD	2023-06-26 23:28:32 +00:00
Driss Guessous	4a008d268a	REDO of dropout support for mem eff #102038 (#103704 ) THIS IS A new PR with the changes from #102038 + #103201 + plus namespacing changes to fix bug. # Summary This PR builds off of: - https://github.com/pytorch/pytorch/pull/101847 - https://github.com/pytorch/pytorch/pull/100583 It specifically adds dropout support to the memory efficient attention kernel. In the process of doing so roughly 3 changes were made: - Update sdpa dispatching to allow for inputs requiring grad to be sent to efficient attention - Update how memory efficient attention handles passing the rng state from forward to backward in order to enable cuda_graph support - Fix a bug in the kernel that was causing incorrect gradients to be produced for num_keys > 64 with dropout and causal masking set. https://github.com/facebookresearch/xformers/pull/755 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103704 Approved by: https://github.com/cpuhrsch	2023-06-26 23:05:03 +00:00
PyTorch MergeBot	bfa08a1c67	Revert "[core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135 )" This reverts commit cf5262a84f815c1e574883bc244333d0d211c7a2. Reverted https://github.com/pytorch/pytorch/pull/102135 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but test_sparse_semi_structured.py::TestSparseSemiStructuredCUDA::test_mm_sparse_first_NT_cuda_int8 is failing CUDA trunk jobs `cf5262a84f`. This looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/102135#issuecomment-1608423849))	2023-06-26 22:54:16 +00:00
cyy	d4a98280a8	[Reland] Use missing-prototypes in torch_cpu (#104138 ) This PR enables Wmissing-prototypes in torch_cpu except some generated cpp files and the mps and metal,vulkan backends and caffe2 sources. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104138 Approved by: https://github.com/albanD, https://github.com/malfet	2023-06-26 22:53:43 +00:00
PyTorch MergeBot	436d035dc7	Revert "DDP + C10D sparse all_reduce changes (#103916 )" This reverts commit fed5fba6e4ee3f221bac481798c5a31f785ba75e. Reverted https://github.com/pytorch/pytorch/pull/103916 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/103916#issuecomment-1608412325))	2023-06-26 22:37:58 +00:00
Eli Uriegas	a69f427f95	aten: Ensure dim is size_t (#104201 ) Attempts to fix failures introduced in https://github.com/pytorch/pytorch/pull/103930 (example failures: https://github.com/pytorch/pytorch/actions/runs/5363450214/jobs/9731034104) <!-- copilot:all --> ### <samp>🤖 Generated by Copilot at 67d5076</samp> ### Summary 🔧🚨🚦 <!-- 1. 🔧 (wrench) - This emoji can be used to indicate a bug fix or a minor improvement to the code quality or performance. 2. 🚨 (rotating light) - This emoji can be used to indicate a change that affects the error handling or validation logic of the code, or that adds or modifies a test case. 3. 🚦 (vertical traffic light) - This emoji can be used to indicate a change that affects the control flow or branching logic of the code, or that adds or modifies a condition or assertion. --> Fix a compiler warning in `Expand.cpp` by casting a tensor dimension to `size_t`. This improves the code quality and correctness of the `expand` function for the Vulkan backend. > _`expand` tensor_ > _cast `dim()` to `size_t`_ > _autumn leaves warning_ ### Walkthrough * Cast `self.dim()` to `size_t` to avoid signed-unsigned comparison warning in `expand` function ([link](https://github.com/pytorch/pytorch/pull/104201/files?diff=unified&w=0#diff-c175e908cbcb8595b22696e672b526202ed3a4a11341603c1522397e499b5c2bL29-R29)) <details> <summary> Fix done using chatgpt </summary> ![Screenshot 2023-06-26 at 11 52 14 AM](https://github.com/pytorch/pytorch/assets/1700823/95c141e5-36b6-4916-85ca-85415bcc507f) </details> Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104201 Approved by: https://github.com/lucylq, https://github.com/huydhn, https://github.com/malfet	2023-06-26 22:01:27 +00:00
Mikayla Gawarecki	b93ed8164e	Add non-recursive module.to_empty option (#104197 ) Fixes https://github.com/pytorch/pytorch/issues/97049, related to https://github.com/pytorch/pytorch/issues/104187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104197 Approved by: https://github.com/albanD	2023-06-26 21:47:22 +00:00
Jesse Cai	cf5262a84f	[core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135 ) This PR adds in support for semi-structured sparsity via a tensor subclass. It currently uses the CUTLASS kernels merged in PR #100881. In the future we plan to add in cuSPARSELt support (see the other PRs in the stack), which will give us larger performance gains. This PR adds in 2 things: - a Tensor subclass, `SparseSemiStructuredTensor` to store the sparse tensor in copmressed form and override `__torch_dispatch__`. - a conversion function that takes in a dense tensor and a semi-structured sparse bool mask and creates an instance of the subclass. SparseSemiStructuredTensor The subclass stores the dense tensor in a contiguous flattened tensor for future compatability with cuSPARSELt, which expects this format. Note that the CUTLASS kernels do not have this limitation, as the specified values and the metadata are passed separately in `_structured_sparse_linear`. In the future we can use the cuSPARSELT bindings [here](https://github.com/pytorch/pytorch/pull/103700) for faster matmul, better dtype converage, and relaxed shape constraints. Since we currently don't have a way to go back from the sparse representation to the dense representation, and we store the weights in compressed form, we don't have a great way to handle .t(). Instead, we keep track of how often we've called transpose on our tensor, and if it's an unexpected number we throw an error. When the first argument is sparse, we expect an even number of calls to transpose, while when the second argument is sparse, we expect an odd number of calls. This is because we support second argument sparse matrix multiplications by using transpose properties. to_sparse_semi_structured This is a conversion function to convert a dense tensor and a semi-structured sparse bool mask into a subclass. Currently, we must pass in a bool mask, since we can't infer it becuase there may be additional zero elements in the dense tensor, so `tensor !=0` is not 2:4 sparse. Once we add either a method to derive the mask from the dense tensor or cuSPARSELt, we no longer need to pass in the mask. cuSPARSELt has it's own helper functions to create the metadata mask. User Details We have implemented support for the following ops for `torch.float16` and `torch.int8`: ``` torch.addmm(bias, dense, sparse.t()) torch.mm(dense, sparse) torch.mm(sparse, dense) aten.linear.default aten.t.default aten.t.detach ``` The end user interface to accelerate a nn.Linaer module with the subclass would look like this: ``` from torch.sparse import to_sparse_semi_structured mask = torch.Tensor([0, 0, 1, 1]).tile(128, 32).cuda().bool() linear = Model(128, 128).half().cuda() linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight, mask=linear.weight.bool()) ``` This also updates tests and the `torch.sparse` module docstring to reflect these changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102135 Approved by: https://github.com/albanD	2023-06-26 21:30:43 +00:00
Sam Larsen	f7f415eb2d	[inductor] add cpp randint implementation to ir.py (#103079 ) (#104124 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104124 Approved by: https://github.com/desertfire	2023-06-26 21:26:25 +00:00
Howard Huang	fed5fba6e4	DDP + C10D sparse all_reduce changes (#103916 ) Summary: ## Changes prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function. prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...` ## Example script ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py import torch import torch.distributed as dist def main(): dist.init_process_group(backend="nccl") rank = dist.get_rank() a = torch.tensor([[0, 2.], [3, 0]]).to(rank) a = a.to_sparse() print(f"rank {rank} - a: {a}") dist.all_reduce(a) if __name__ == "__main__": main() ``` output: ``` rank 1 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse rank 0 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse ``` Test Plan: Testing commands (OSS): ``` # python pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops # c++ build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Testing commands (internal, ondemand GPU): ddp tests: ``` buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output # Get the .par file from the previous command and use it below TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata ``` c10d tests: ``` # build tests and run with log output (python) buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops # python NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)' # c++ NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Differential Revision: D46724856 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/103916 Approved by: https://github.com/rohan-varma	2023-06-26 20:42:17 +00:00
kshitij12345	8a08733218	update test_higher_order_op: grad test (#104179 ) With https://github.com/pytorch/pytorch/pull/103597, `config.dynamic_shapes` is always `True` and we never check the generated graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104179 Approved by: https://github.com/zou3519	2023-06-26 19:32:59 +00:00
Sergii Dymchenko	adf9595c2f	Update CODEOWNERS (#103934 ) Remove users that no longer have write access to the repo, resolving CODEOWNERS errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103934 Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet	2023-06-26 19:29:29 +00:00
Jacob Szwejbka	fb8aa721e2	[Pytorch Edge][BE] Delete Sparse Qnnpack test failing since 2022 jul (#104073 ) Summary: According to https://www.internalfb.com/omh/view/ai_infra_mobile_platform/tests these have been failing since jul 2022. Just going to delete unless someone thinks they actually do matter and should be made green https://www.internalfb.com/intern/test/562949996115570/ <- failing test I ran locally and got errors like xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h:483: Failure Expected equality of these values: c[mIndex * cStride() + nIndex] Which is: -872.50446 acc[mIndex * n() + nIndex] Which is: -872.50488 at 0, 0: reference = -872.5048828125, optimized = -872.50445556640625, Mr x Nr = 8 x 4, M x N x K = 7 x 1 x 13 xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h:483: Failure Expected equality of these values: c[mIndex * cStride() + nIndex] Which is: -67.246628 acc[mIndex * n() + nIndex] Which is: -67.24707 at 3, 0: reference = -67.2470703125, optimized = -67.246627807617188, Mr x Nr = 8 x 4, M x N x K = 4 x 1 x 15 [ FAILED ] Q8GEMM_8x4c1x4__SSE2.packedA_k_gt_8_subtile (148 ms) Test Plan: ci Reviewed By: kimishpatel Differential Revision: D46950966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104073 Approved by: https://github.com/kimishpatel	2023-06-26 18:27:20 +00:00
zhxchen17	100aff9d4f	[export] Deserialize subgraphs. (#103991 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/103991 Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri	2023-06-26 18:17:44 +00:00
Angela Yi	dd4f4bb47d	[exir] Initial serialization (#103763 ) Summary: ETRecord can't use this yet because the other programs need to be migrated to using ExportedProgram (D46729844) Note: higher order ops like call_delegate/cond are also not supported yet Test Plan: `buck2 run @//mode/dev-nosan //executorch/exir/tests:serde` Differential Revision: D46802454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103763 Approved by: https://github.com/tarun292	2023-06-26 18:05:27 +00:00
Richard Zou	618cc82e77	Stop Dynamo from peeking into wrap's body (#104076 ) When Dynamo sees `wrap(f, x)`, and it decides that `f` is unsafe, Dynamo should fall back to eager mode and stop introspection all the way throughout the call of `f`. The motivation is: - it's easier to test `wrap` this way (it is clearer how many graph breaks should occur) - Other HigherOrderOperator do this because their execution of the body involves code that is not necessarily Dynamo-able. e.g. functorch transforms. Since `wrap` is a test for the HigherOrderOp mechanism, it should reflect what other HigherOrderOps do. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104076 Approved by: https://github.com/ydwu4	2023-06-26 17:16:51 +00:00
Andrew M. James	5364366f8c	Sparse Compressed mm avoid creating temp sparse (#104062 ) When mm forwards to addmm it creates a zeroed out self this tensor should take options from the result not one of the sparse arguments. The bug was leading to an error when calling linear with an `out` kwarg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104062 Approved by: https://github.com/nikitaved, https://github.com/pearu	2023-06-26 16:45:04 +00:00
Digant Desai	bd8841101b	[ET][XNNPACK] Add support for quantized Sub (#104090 ) Summary: Also adds support for backend_config with relu fusion since XNNPACK allows it. We should revisit the relu fusion once we gain more clarity on quantSrcPartition or some other way to do these fusion and not having to add all combinations. We should really rename the backend config to et_xnnpack.py or something TODO Test Plan: `buck test fbcode//mode/dev-nosan fbcode//executorch/backends/xnnpack/test:` Differential Revision: D46924209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104090 Approved by: https://github.com/mcr229	2023-06-26 16:32:15 +00:00
Elias Ellison	edc9c0df7e	Fold Conv-Bn (#100653 ) Adds Conv-BN folding to inductor freezing. One thing that's a little awkward now is we'll want different decompositions to run depending on if we are in the inference compiler. For now, I require that you run with torch.no_grad() so we can detect if no gradients are required before calling aot_autograd. Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100653 Approved by: https://github.com/jansel	2023-06-26 16:04:34 +00:00
Bin Bao	c1fffdcd5b	Change how AOTInductor's fx input is produced (#104123 ) Test Plan: CI Reviewed By: wushirong Differential Revision: D46983754 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104123 Approved by: https://github.com/chenyang78	2023-06-26 15:59:33 +00:00
mantaionut	b2277075b0	Fixed benchmark_utils.Fuzzer (#101553 ) Use np.randint with int64 since int32 is default on Windows. change default seed to be between [0, 2**32-1] because that is np.random.RandomState required input Fixes #51205 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101553 Approved by: https://github.com/kit1980	2023-06-26 08:03:27 +00:00
Shiyan Deng	3c34a00d1b	Preserve all submodules/parameters/buffers when unpickle graph module (#104115 ) Summary: When we pickle/unpickle graph module in multipy, we would lost modules/attributes that are not referred in the graph. This is because when unpickle fx graph module, we use the stored `__dict__` and the fx graph to create a new graph module. In GraphModule init, we drop any attribute that is not referred in the graph. This behavior is not ideal because we actually expect a graph module that's exactly the same after unpickling. Test Plan: ``` buck test mode/opt caffe2/test:fx -- test_preserve_unused_attr_after_unpickle Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D46976230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104115 Approved by: https://github.com/houseroad	2023-06-26 06:59:48 +00:00
shibo19	58feefa4ed	add custom device support for special nn.modules (#103419 ) Fixes #103818 1. for some special nn.Modules, there are checks which only support cuda, so I add `privateuse1` check. 2. when get the device type for `privateuse1` by `torch._C._get_privateuse1_backend_name()`, it will get error in `torch.jit.script`, so I add a global variable to avoid this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103419 Approved by: https://github.com/albanD	2023-06-26 00:58:29 +00:00
ZhaoqiongZ	7cef7195f6	[draft] Update Multiprocessing best practices with CPU device (#103229 ) Fixes [#102498](https://github.com/pytorch/pytorch/issues/102498) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103229 Approved by: https://github.com/mingfeima, https://github.com/svekars, https://github.com/jgong5	2023-06-25 06:26:40 +00:00
Jens Glaser	86e0eda18d	Add partial derivative unit tests (#103809 ) Adds the unit tests requested in #95810 This PR also addresses a gap in unit testing of gradients, as `gradcheck` always performs total derivatives w.r.t. all arguments and module parameters. Some modules have different code paths for partial derivatives, e.g. `LayerNorm`, and those should be tested separately. The PR has the following limitations: - it does not test partial derivatives w.r.t. every combination of arguments, which would exponentially increase CI time. - it does not implement the same logic for Hessians, where the increase in CI time would be quadratic in the number of arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103809 Approved by: https://github.com/kit1980	2023-06-25 00:36:10 +00:00
Michael Lazos	a9efbef716	Add support for unique overload of foreach_pow (#104137 ) This overload has a scalar in the first arg position unlike every other overload. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104137 Approved by: https://github.com/jansel	2023-06-24 21:07:33 +00:00
Adnan Akhundov	e4d8504ebc	Unify GELU tanh approximation in _addmm_activation GPU back-end (#104061 ) Summary: Currently, cuBLASLt-based fused GELU epilogue in the GPU back-end of the `_addmm_activation` operator uses tanh approximation, whereas other code paths on GPU don't. With this PR, the GELU tanh approximation is switched on in all back-end code paths of `_addmm_activation` on GPU for better consistency. Test Plan: ``` $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 1.896s OK $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 2.050s OK ``` Reviewers: @eellison Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/104061 Approved by: https://github.com/eellison	2023-06-24 18:36:45 +00:00
Nikita Shulga	925f0a01c7	Do not pass `stepcurrent` option unless in CI (#104135 ) Should allow one to run the same tests multiple times on local machine <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 740a92d</samp> > _`pytest_args` change_ > _Only add `--sc` on CI_ > _Avoid conflicts - fall_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104135 Approved by: https://github.com/huydhn, https://github.com/kit1980	2023-06-24 09:34:14 +00:00
Lucy Qiu	454f4e98a2	[Pytorch] aten::expand (#103930 ) Summary: Expand using `aten::repeat` for all dims [expand](https://pytorch.org/docs/stable/generated/torch.Tensor.expand.html#torch.Tensor.expand) [expand_as]( https://pytorch.org/docs/stable/generated/torch.Tensor.expand_as.html) Test Plan: clang-format on `Expand.cpp` expand tests: ``` lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter=".expand" Action graph will be rebuilt because files have been added or removed. Parsing buck files: finished in 1.1 sec Downloaded 5/50 artifacts, 661.18 Kbytes, 37.5% cache miss (for updated rules) Building: finished in 15.4 sec (100%) 515/515 jobs, 15/515 updated Total time: 16.9 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = .expand [==========] Running 6 tests from 1 test suite. [----------] Global test environment set-up. [----------] 6 tests from VulkanAPITest [ RUN ] VulkanAPITest.expand_exceptions [ OK ] VulkanAPITest.expand_exceptions (66 ms) [ RUN ] VulkanAPITest.expand_1d [ OK ] VulkanAPITest.expand_1d (7 ms) [ RUN ] VulkanAPITest.expand_2d [ OK ] VulkanAPITest.expand_2d (2 ms) [ RUN ] VulkanAPITest.expand_3d [ OK ] VulkanAPITest.expand_3d (2 ms) [ RUN ] VulkanAPITest.expand_4d [ OK ] VulkanAPITest.expand_4d (4 ms) [ RUN ] VulkanAPITest.expand_as [ OK ] VulkanAPITest.expand_as (11 ms) [----------] 6 tests from VulkanAPITest (95 ms total) [----------] Global test environment tear-down [==========] 6 tests from 1 test suite ran. (95 ms total) [ PASSED ] 6 tests. lfq@lfq-mbp fbsource % ``` Differential Revision: D46302042 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103930 Approved by: https://github.com/SS-JIA	2023-06-24 03:57:53 +00:00
Lucy Qiu	466efccc8a	[Pytorch] aten::zeros (#103703 ) Summary: Implement [aten::zeros](https://pytorch.org/docs/stable/generated/torch.zeros.html?highlight=zeros#torch.zeros) Test Plan: ``` lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="zeros" Action graph will be rebuilt because files have been added or removed. Parsing buck files: finished in 2.3 sec Downloaded 0/4 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules) Building: finished in 6.0 sec (100%) 454/454 jobs, 3/454 updated Total time: 8.4 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = zeros [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from VulkanAPITest [ RUN ] VulkanAPITest.zeros [ OK ] VulkanAPITest.zeros (99 ms) [----------] 1 test from VulkanAPITest (99 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (99 ms total) [ PASSED ] 1 test. ``` Differential Revision: D46777782 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103703 Approved by: https://github.com/SS-JIA	2023-06-24 03:47:45 +00:00
PyTorch UpdateBot	6f78390607	[vision hash update] update the pinned vision hash (#104133 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104133 Approved by: https://github.com/pytorchbot	2023-06-24 03:42:08 +00:00
Nikita Shulga	63f66d19ea	[Tests] Make `run_test.py` usable without boto3 (#104111 ) There is a `HAVE_TEST_SELECTION_TOOLS` conditional, but turns out it does not really work, so fix it by defining all missing prototypes and make it work as single-shard instance Add lint rule to test stat it would succeed for runnign only test_cuda with released version of PyTorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/104111 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi	2023-06-24 03:10:49 +00:00
cyy	483f748dd5	[BE] Enforce missing `override` keyword (#104032 ) This PR enables `-Winconsistent-missing-destructor-override` and `-Winconsistent-missing-override` and fixes violations. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 47e904e</samp> This pull request updates the code of various classes and operators in the `caffe2` and `aten` subdirectories to use the `override` specifier instead of the `virtual` keyword for destructors and other virtual functions that override a base class function. This improves the code readability, quality, and consistency with C++ best practices. It also modifies the `./CMakeLists.txt` file to enable warnings for these specifiers, but disable errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104032 Approved by: https://github.com/malfet	2023-06-24 02:34:24 +00:00
Huy Do	202a9108f7	Disable core dump when rerunning disabled tests (#104131 ) Fixes https://github.com/pytorch/pytorch/issues/103612 Figuring out a way to dynamically stop generating core dumps on Linux runner is harder than I thought. The recommend solution is to set a custom script in `/proc/sys/kernel/core_pattern` as documented in https://man7.org/linux/man-pages/man5/core.5.html so that we could dynamically stop generating more core file when the disk space drops below a certain threshold. However, AFAICT this is not yet supported inside Docker container (https://stackoverflow.com/questions/59986788). In addition, when the runner runs out of space, all the subsequent step to clean it up won't be done. The next job running will also fail because nothing could be setup, i.e. https://github.com/pytorch/pytorch/actions/runs/5357044327/jobs/9717914230 So this is only a limit fix to not generate core dumps while re-running disabled tests because a crashed test is run multiple times there and will generate multiple core files. ### Testing ``` ulimit -c 0 kill -3 PID ``` Check that no core file is generated after. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104131 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-06-24 02:29:53 +00:00
Animesh Jain	75dab587ef	[dynamo] FSDP + AC + torch.compile (#103953 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103953 Approved by: https://github.com/wanchaol	2023-06-24 01:40:56 +00:00
Eli Kobrin	b3ace213f2	Heap buffer overflow at `source_range_serialization.cpp:73` (#103969 ) Hi! We've been fuzzing torchvision project with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz). We've found a heap buffer overflow error at `source_range_serialization.cpp:73` in pytorch project. The error occurs because there is not check in `deserialize_source` that `text_table_` size can be less than `fnameIndex`. To prevent the error the corresponding check must be located. torchvision version: 9d0a93eee90bf7c401b74ebf9c8be80346254f15 pytorch version: 0f1621df1a0a73956c7ce4e2f72f069e610e0137 OS: Ubuntu 20.04 How to reproduce 1. Build docker from [here](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/torchvision) and run the container: sudo docker build -t oss-sydr-fuzz-torchvision . sudo docker run --privileged --rm -v `pwd`:/fuzz -it oss-sydr-fuzz-torchvision /bin/bash 2. Run the target on this input: [serialization-crash.txt](https://github.com/pytorch/pytorch/files/11819901/serialization-crash.txt) /encode_png_fuzz serialization-crash.txt 3. You will see the following output: ================================================================= ==13==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60200055a630 at pc 0x0000010197b7 bp 0x7ffd4cfb15f0 sp 0x7ffd4cfb15e8 READ of size 8 at 0x60200055a630 thread T0 #0 0x10197b6 in std::__shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, (__gnu_cxx::_Lock_policy)2>::get() const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr_base.h:1325:16 #1 0x10197b6 in std::__shared_ptr_access<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get() const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr_base.h:1024:66 #2 0x10197b6 in std::__shared_ptr_access<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, (__gnu_cxx::_Lock_policy)2, false, false>::operator() const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr_base.h:1011:10 #3 0xde888c2 in torch::jit::SourceRangeDeserializer::deserialize_source(c10::IValue const&) /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:73:16 #4 0xde8802b in torch::jit::SourceRangeDeserializer::deserialize(c10::IValue const&) /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:51:37 #5 0xde8e9c7 in torch::jit::ConcreteSourceRangeUnpickler::unpickle() /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:224:39 #6 0xde8fb19 in torch::jit::ConcreteSourceRangeUnpickler::findSourceRangeThatGenerated(torch::jit::SourceRange const&) /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:231:3 #7 0x10798e7 in torch::jit::Source::findSourceRangeThatGenerated(torch::jit::SourceRange const&) /pytorch/torch/csrc/jit/frontend/source_range.cpp:144:23 #8 0x1079d9a in torch::jit::SourceRange::findSourceRangeThatGenerated() const /pytorch/torch/csrc/jit/frontend/source_range.h:384:26 #9 0x1079acd in torch::jit::SourceRange::highlight(std::ostream&) const /pytorch/torch/csrc/jit/frontend/source_range.cpp:149:32 #10 0x1026fe2 in torch::jit::Lexer::expected(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::Token const&) /pytorch/torch/csrc/jit/frontend/lexer.h:461:13 #11 0x10417d9 in torch::jit::Lexer::expected(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/frontend/lexer.h:465:5 #12 0x102e52c in torch::jit::Lexer::expect(int) /pytorch/torch/csrc/jit/frontend/lexer.h:471:7 #13 0xcee774c in torch::jit::ParserImpl::parseIdent() /pytorch/torch/csrc/jit/frontend/parser.cpp:52:16 #14 0xcef4ea8 in torch::jit::ParserImpl::parseBaseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:195:22 #15 0xcef2c1b in torch::jit::ParserImpl::parseExp(int) /pytorch/torch/csrc/jit/frontend/parser.cpp:284:16 #16 0xcefac6a in torch::jit::ParserImpl::parseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:262:12 #17 0xcefac6a in torch::jit::ParserImpl::parseSubscriptExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:403:15 #18 0xceff39f in torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::)())::'lambda'()::operator()() const /pytorch/torch/csrc/jit/frontend/parser.cpp:354:54 #19 0xceff39f in torch::jit::Expr std::__invoke_impl<void, torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::)())::'lambda'()&>(std::__invoke_other, torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::)())::'lambda'()&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14 #20 0xceea935 in torch::jit::ParserImpl::parseSequence(int, int, int, std::function<void ()> const&) /pytorch/torch/csrc/jit/frontend/parser.cpp:339:7 #21 0xceefd69 in torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::)()) /pytorch/torch/csrc/jit/frontend/parser.cpp:353:5 #22 0xcef895a in torch::jit::ParserImpl::parseSubscript(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&) /pytorch/torch/csrc/jit/frontend/parser.cpp:430:9 #23 0xcef5e5c in torch::jit::ParserImpl::parseBaseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:206:18 #24 0xcef2c1b in torch::jit::ParserImpl::parseExp(int) /pytorch/torch/csrc/jit/frontend/parser.cpp:284:16 #25 0xceeeb9d in torch::jit::ParserImpl::parseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:262:12 #26 0xceeeb9d in torch::jit::ParserImpl::parseExpOrExpTuple() /pytorch/torch/csrc/jit/frontend/parser.cpp:94:19 #27 0xcee8a36 in torch::jit::ParserImpl::parseStmt(bool) /pytorch/torch/csrc/jit/frontend/parser.cpp:612:20 #28 0xcee7e72 in torch::jit::ParserImpl::parseStatements(bool, bool) /pytorch/torch/csrc/jit/frontend/parser.cpp:697:23 #29 0xcee56f5 in torch::jit::ParserImpl::parseClass() /pytorch/torch/csrc/jit/frontend/parser.cpp:747:9 #30 0xcee544a in torch::jit::Parser::parseClass() /pytorch/torch/csrc/jit/frontend/parser.cpp:812:17 #31 0xdddbea9 in torch::jit::SourceImporterImpl::parseSourceIfNeeded(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:182:42 #32 0xdddadbc in torch::jit::SourceImporterImpl::findNamedType(c10::QualifiedName const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:135:3 #33 0xdde1d88 in torch::jit::SourceImporterImpl::resolveType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::SourceRange const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:261:10 #34 0xcf2ba5f in torch::jit::ScriptTypeParser::parseTypeFromExpr(torch::jit::Expr const&) const /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:238:24 #35 0xcf2bec7 in torch::jit::ScriptTypeParser::parseType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:312:10 #36 0xddf4284 in torch::jit::SourceImporter::loadType(c10::QualifiedName const&) const /pytorch/torch/csrc/jit/serialization/import_source.cpp:786:27 #37 0xdd739f7 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0::operator()(c10::QualifiedName const&) const /pytorch/torch/csrc/jit/serialization/import.cpp:146:33 #38 0xdd739f7 in c10::StrongTypePtr std::__invoke_impl<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>(std::__invoke_other, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14 #39 0xdd73880 in std::enable_if<is_invocable_r_v<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>, c10::StrongTypePtr>::type std::__invoke_r<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>(torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:113:9 #40 0xdd736d6 in std::_Function_handler<c10::StrongTypePtr (c10::QualifiedName const&), torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0>::_M_invoke(std::_Any_data const&, c10::QualifiedName const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/std_function.h:291:9 #41 0xdd76349 in std::function<c10::StrongTypePtr (c10::QualifiedName const&)>::operator()(c10::QualifiedName const&) const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/std_function.h:622:14 #42 0xdeb9f48 in torch::jit::Unpickler::readGlobal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/unpickler.cpp:835:9 #43 0xdeb012d in torch::jit::Unpickler::readInstruction() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:511:7 #44 0xdeae437 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:251:27 #45 0xdeae0d2 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204:3 #46 0xddd6de3 in torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) /pytorch/torch/csrc/jit/serialization/import_read.cpp:53:20 #47 0xdd732dd in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/import.cpp:184:10 #48 0xdd69885 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::deserialize(c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:287:19 #49 0xdd6c855 in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:438:25 #50 0xdd6c1c7 in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:421:10 #51 0xdd6dce4 in torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:503:10 #52 0xf2d3f75 in torch::serialize::InputArchive::load_from(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>) /pytorch/torch/csrc/api/src/serialize/input-archive.cpp:97:13 #53 0x60509c in void torch::load<at::Tensor, char&>(at::Tensor&, char&) /pytorch/torch/include/torch/csrc/api/include/torch/serialize.h:107:11 #54 0x6036be in LLVMFuzzerTestOneInput /vision/encode_png.cc:38:5 #55 0x66b041 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15 #56 0x6544cc in fuzzer::RunOneTest(fuzzer::Fuzzer, char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6 #57 0x65a61b in fuzzer::FuzzerDriver(int, char**, int ()(unsigned char const, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9 #58 0x654222 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10 #59 0x7f3d12cc7082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee) #60 0x542cdd in _start (/encode_png_fuzz+0x542cdd) 0x60200055a630 is located 16 bytes to the right of 16-byte region [0x60200055a610,0x60200055a620) allocated by thread T0 here: #0 0x60057d in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3 #1 0xde9185d in std::_Vector_base<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:346:20 #2 0xde9185d in void std::vector<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::_M_realloc_insert<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >(__gnu_cxx::__normal_iterator<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::vector<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >, std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:440:33 #3 0xde916a1 in std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >& std::vector<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::emplace_back<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >(std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:121:4 #4 0xde8f445 in torch::jit::SourceRangeDeserializer::SourceRangeDeserializer(c10::IValue) /pytorch/torch/csrc/jit/serialization/source_range_serialization.h:42:19 #5 0xde8e141 in torch::jit::ConcreteSourceRangeUnpickler::unpickle() /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:215:28 #6 0xde8fb19 in torch::jit::ConcreteSourceRangeUnpickler::findSourceRangeThatGenerated(torch::jit::SourceRange const&) /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:231:3 #7 0x10798e7 in torch::jit::Source::findSourceRangeThatGenerated(torch::jit::SourceRange const&) /pytorch/torch/csrc/jit/frontend/source_range.cpp:144:23 #8 0x1079d9a in torch::jit::SourceRange::findSourceRangeThatGenerated() const /pytorch/torch/csrc/jit/frontend/source_range.h:384:26 #9 0x1079acd in torch::jit::SourceRange::highlight(std::ostream&) const /pytorch/torch/csrc/jit/frontend/source_range.cpp:149:32 #10 0x1026fe2 in torch::jit::Lexer::expected(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::Token const&) /pytorch/torch/csrc/jit/frontend/lexer.h:461:13 #11 0x10417d9 in torch::jit::Lexer::expected(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/frontend/lexer.h:465:5 #12 0xcee774c in torch::jit::ParserImpl::parseIdent() /pytorch/torch/csrc/jit/frontend/parser.cpp:52:16 #13 0xcef4ea8 in torch::jit::ParserImpl::parseBaseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:195:22 #14 0xcef2c1b in torch::jit::ParserImpl::parseExp(int) /pytorch/torch/csrc/jit/frontend/parser.cpp:284:16 #15 0xcefac6a in torch::jit::ParserImpl::parseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:262:12 #16 0xcefac6a in torch::jit::ParserImpl::parseSubscriptExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:403:15 #17 0xceff39f in torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::)())::'lambda'()::operator()() const /pytorch/torch/csrc/jit/frontend/parser.cpp:354:54 #18 0xceff39f in torch::jit::Expr std::__invoke_impl<void, torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::)())::'lambda'()&>(std::__invoke_other, torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::)())::'lambda'()&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14 #19 0xceea935 in torch::jit::ParserImpl::parseSequence(int, int, int, std::function<void ()> const&) /pytorch/torch/csrc/jit/frontend/parser.cpp:339:7 #20 0xceefd69 in torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::)()) /pytorch/torch/csrc/jit/frontend/parser.cpp:353:5 #21 0xcef895a in torch::jit::ParserImpl::parseSubscript(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&) /pytorch/torch/csrc/jit/frontend/parser.cpp:430:9 #22 0xcef5e5c in torch::jit::ParserImpl::parseBaseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:206:18 #23 0xcef2c1b in torch::jit::ParserImpl::parseExp(int) /pytorch/torch/csrc/jit/frontend/parser.cpp:284:16 #24 0xceeeb9d in torch::jit::ParserImpl::parseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:262:12 #25 0xceeeb9d in torch::jit::ParserImpl::parseExpOrExpTuple() /pytorch/torch/csrc/jit/frontend/parser.cpp:94:19 #26 0xcee8a36 in torch::jit::ParserImpl::parseStmt(bool) /pytorch/torch/csrc/jit/frontend/parser.cpp:612:20 #27 0xcee7e72 in torch::jit::ParserImpl::parseStatements(bool, bool) /pytorch/torch/csrc/jit/frontend/parser.cpp:697:23 #28 0xcee56f5 in torch::jit::ParserImpl::parseClass() /pytorch/torch/csrc/jit/frontend/parser.cpp:747:9 #29 0xcee544a in torch::jit::Parser::parseClass() /pytorch/torch/csrc/jit/frontend/parser.cpp:812:17 #30 0xdddbea9 in torch::jit::SourceImporterImpl::parseSourceIfNeeded(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:182:42 #31 0xdddadbc in torch::jit::SourceImporterImpl::findNamedType(c10::QualifiedName const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:135:3 #32 0xdde1d88 in torch::jit::SourceImporterImpl::resolveType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::SourceRange const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:261:10 #33 0xcf2ba5f in torch::jit::ScriptTypeParser::parseTypeFromExpr(torch::jit::Expr const&) const /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:238:24 SUMMARY: AddressSanitizer: heap-buffer-overflow /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr_base.h:1325:16 in std::__shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, (__gnu_cxx::_Lock_policy)2>::get() const Shadow bytes around the buggy address: 0x0c04800a3470: fa fa 00 00 fa fa 00 00 fa fa fd fa fa fa 00 00 0x0c04800a3480: fa fa fd fa fa fa fd fd fa fa fd fd fa fa fd fa 0x0c04800a3490: fa fa fd fd fa fa 00 00 fa fa 00 00 fa fa 00 00 0x0c04800a34a0: fa fa fd fa fa fa fd fd fa fa fd fa fa fa 00 fa 0x0c04800a34b0: fa fa fd fd fa fa fd fd fa fa fd fa fa fa fd fd =>0x0c04800a34c0: fa fa 00 00 fa fa[fa]fa fa fa fa fa fa fa fa fa 0x0c04800a34d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c04800a34e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c04800a34f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c04800a3500: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c04800a3510: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==13==ABORTING Pull Request resolved: https://github.com/pytorch/pytorch/pull/103969 Approved by: https://github.com/davidberard98	2023-06-24 00:49:14 +00:00
xuanqi	344bab2669	[RFC]: Functionalize assertions (#103757 ) The idea here is to create do a graph mutation to: * Create an initial dependency token at the beginning of the program. * Replace non-functional version of assertion statements to functional version. * The functional version of assertion statement will: * Accept a dependency token from output of previous functional assertion statement (or the initial dependency token if there isn't any). * Generate a dependency token as the output of assertion statement. * Augment the output to include the dependency token generated by last assertion statement. The goal here is to: * Form an explicit dependency chain and avoid potential reordering during other passes of compiling. * Make the assertions a part of overall execution graph will affect the final output (or it could potentially be DCEed). NOTE: * Currently only cover `contrain_range` and WIP to support other assertions. Send out this PR to collect feedback first. * Here it only focus on implementation itself. Will integrate it with current export in future PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103757 Approved by: https://github.com/avikchaudhuri	2023-06-24 00:23:35 +00:00
Nikita Shulga	98d513cabf	[BE][Test] Remove `--pytest` option from `run_test.py` (#104125 ) Because we always run tests with pytest now. Marking it as `bc-breaking` as there could technically be some scripts depending on it somewhere... <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 1760568</samp> > _`pytest` option gone_ > _simpler test runner script_ > _autumn leaves fall fast_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104125 Approved by: https://github.com/seemethere	2023-06-24 00:20:20 +00:00
Gustav Larsson	9f11ad6f86	Extend torch->onnx export for quantized convolutional ops (#102759 ) - Extend support: - quantized::conv1d - quantized::conv3d - quantized::conv3d_relu - quantized::conv_transpose1d - quantized::conv_transpose2d - quantized::conv_transpose3d - Note: quantized::{conv1d_relu,conv2d,conv2d_relu} already supported. - To support this, quantization unpacking added for: - conv1d - conv_transpose1d - conv_transpose2d - conv_transpose3d - Note: conv3d/conv3d_relu already had weights unpacking set up, even though it didn't have torch.onnx support. - Add tests. - The 3D tests will fail if run with the qnnpack backend (e.g., on Apple silicon Mac), so added decorator skipIfQuantizationBackendQNNPack. - Minor fix in `aten/src/ATen/native/quantized/cpu/qconv.cpp` for 3D convolutions (triggered by added tests). Fixes #102747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102759 Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi, https://github.com/kit1980	2023-06-23 22:50:17 +00:00
yewentao	75108b2096	Normal and Uniform return earlier without entering kernel<RNG> (#103507 ) Fixes [#103418](https://github.com/pytorch/pytorch/issues/103418) By the way, I'm wondering should we update other distribution funcs in `aten\src\ATen\native\DistributionTemplates.h`? Pull Request resolved: https://github.com/pytorch/pytorch/pull/103507 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki	2023-06-23 21:54:13 +00:00
Edward Z. Yang	bd5b1788cd	Support printing inequality in ExprPrinter (#104104 ) Fixes https://github.com/pytorch/pytorch/issues/103587 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104104 Approved by: https://github.com/jansel	2023-06-23 21:50:17 +00:00
Michael Lazos	3e674b75b1	Allow fusion of epilogue copies with upstream foreach ops (#104018 ) Allow fusion of epilogue copies with foreach kernel scheduler nodes Pull Request resolved: https://github.com/pytorch/pytorch/pull/104018 Approved by: https://github.com/jansel	2023-06-23 21:39:59 +00:00
Nicolas Macchioni	47ff90fde5	[pt2][inductor] update local caching and create `get_system` method (#104050 ) Summary: separate system information construction as a separate static method, and update local caching (/temp_dir/cache is now a dir, not a file; this is relevant for upcoming changed i.e. adding `allow_tf32` since it would now be possible to have multiple valid local caches) Test Plan: sandcastle + CI Differential Revision: D46568207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104050 Approved by: https://github.com/jansel	2023-06-23 21:06:22 +00:00
eqy	ef3d1cfa16	[cuDNN][cuDNN V8 API] Thread safety fixes for cuDNN V8 API (#103939 ) (this these two fixes are now oudated, see EDIT below) Fixes for two thread safety issues (one currently unobserved, and one currently observed). 1. `std::erase` can potentially invalidate a pointer to an `ExecutionPlan` in the current implementation. While failures due to this issue have not yet been reported to my knowledge, it is better to return a copy of an `ExecutionPlan` for safety. 2. #103793 surfaced that `cudnnBackendExecute` appears to currently be thread-unsafe. I've verified this with a PyTorch free (pure C++) repro using the cuDNN frontend. This PR addes a mutex that we can hopefully once this issue is resolved. EDIT: Feedback from cuDNN is that the V8 backend API has known thread-safety limitations when `ExecutionPlan`s are shared (or even shallow copied) across threads. Given that the common PyTorch use case of eager-mode is singlethreaded (per GPU), this PR now opts to make `ExecutionPlan` caches `thread_local`, as this simplifies the code and eliminates the need for a mutex. The potential tradeoff is some additional warmup cost in multithreaded case, but this would also only be worse than the current case if multiple threads had largely overlapping workloads. CC @tuero @ptrblck @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/103939 Approved by: https://github.com/xw285cornell, https://github.com/colesbury	2023-06-23 20:51:47 +00:00
Jens Glaser	933166e5c0	Fix null pointer dereference on ROCm (#95810 ) The root cause of the crash in training nanoGPT was a null pointer dereference in the layer norm kernel. While addressing the issue, I also made sure that `__syncthreads()` is simultaneously called by all threads in the block, to avoid unwanted side effects. Moreover, I changed the kernel launch code to be more clear about the accumulation data type (`T_ACC`) and thread block dimensions, without changing function. Fixes #95808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95810 Approved by: https://github.com/ngimel	2023-06-23 20:02:10 +00:00
Huy Do	a45132e049	Remove CUDA 11.7 Docker image build (#104116 ) This option has been removed by https://github.com/pytorch/builder/pull/1408. This is currently failing in trunk https://github.com/pytorch/pytorch/actions/runs/5358541073/jobs/9720970056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104116 Approved by: https://github.com/seemethere, https://github.com/malfet	2023-06-23 20:01:57 +00:00
Meghan	6ff4548b6e	[AMP] Support XLA:TPU (#96370 ) With https://github.com/pytorch/xla/pull/5148, https://github.com/pytorch/xla/pull/4740 With these changes XLA:GPU users should use `torch.cuda.amp.autocast()` for AMP with float16 XLA:TPU users should use `torch.amp.autocast('xla')` for AMP with bfloat16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96370 Approved by: https://github.com/bdhirsh, https://github.com/malfet	2023-06-23 19:46:42 +00:00
Rodrigo Kumpera	c17bdb3247	[C10D] Add functional collective reduce_scatter_into_tensor_coalesced. (#101023 ) Implementation uses a fallback that does no coalescing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101023 Approved by: https://github.com/wanchaol	2023-06-23 19:24:11 +00:00
Rodrigo Kumpera	93e63fc0f6	[Core] Drop GIL in THPVariable_item around aten op (#104103 ) This can cause a deadlock by starving other python threads required for the kernel to make progress. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104103 Approved by: https://github.com/albanD	2023-06-23 19:13:49 +00:00
Khushi Agrawal	b5d1b42f99	[bfloat16] adaptive_{max, avg}_pool3d (#89754 ) Add bfloat16 support in `adaptive_{max, avg}_pool3d` as discussed in https://github.com/pytorch/pytorch/pull/88906#discussion_r1033466164. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89754 Approved by: https://github.com/jgong5, https://github.com/kit1980	2023-06-23 19:12:47 +00:00
PyTorch MergeBot	7274582390	Revert "sparse_mask: backward support for sparse lhs (#95165 )" This reverts commit f090fdf3b49164679fb6316e9ae15e0c4fb3c9eb. Reverted https://github.com/pytorch/pytorch/pull/95165 on behalf of https://github.com/huydhn due to Sorry for reverting this. I think one of the tests test_sparse.py::TestSparseCUDA::test_sparse_mask_backward_cuda_complex128 is failing on slow gradcheck `f090fdf3b4` ([comment](https://github.com/pytorch/pytorch/pull/95165#issuecomment-1604696109))	2023-06-23 18:40:15 +00:00
Nikita Shulga	3a823e4617	[BE][CMake] Do not pass `-mfpu=neon` on Apple (#104078 ) Followup after https://github.com/pytorch/pytorch/pull/103929 that get rid of an annoying warning, which will become an error in newer Xcode <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 748d60d</samp> > _`NEON_FOUND` is true_ > _But iOS may not like `-mfpu=neon`_ > _Check platform, then branch_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104078 Approved by: https://github.com/huydhn, https://github.com/kit1980	2023-06-23 17:09:30 +00:00
Jack Khuu	d1c367470b	[Specialized Kernel] Remove requirement for type_alias and dim_order_alias to be present (#104006 ) These fields are not required when kernels provided do not use aliases (e.g. only a default kernel Differential Revision: [D46916099](https://our.internmc.facebook.com/intern/diff/D46916099/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104006 Approved by: https://github.com/larryliu0820	2023-06-23 16:49:57 +00:00
HDCharles	8176cd8c0f	[ao] fixing quantized prelu workflow (#103455 ) Summary: https://github.com/pytorch/pytorch/issues/100654 noticed prelu was not running its observers when the quantization flow was being run, this was a bug which is now fixed and the relevant prelu tests also now check for this. Also added a corrected observer for PReLU to qconfig_mapping Test Plan: python test/test_quantization.py TestStaticQuantizedModule.test_prelu Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/103455 Approved by: https://github.com/jerryzh168	2023-06-23 16:45:40 +00:00
Jack Taylor	8a500f0be6	Update triton commit pin for ROCm (#104035 ) Updates ROCm triton pinned commit Pull Request resolved: https://github.com/pytorch/pytorch/pull/104035 Approved by: https://github.com/jithunnair-amd, https://github.com/kit1980	2023-06-23 16:36:20 +00:00
Andrew Or	7320ef5651	[quant][pt2] Add prepare QAT test for mobilenetv2 (#104068 ) Summary: Prepare QAT for mobilenetv2 has matching numerics with FX. There were two changes needed to achieve this, however. First, this commit adds observer sharing for ReLU6, which is used extensively throughout this model. Second, in the tests we have to use the same manual seed every time we call the models in order to get the same results between FX and PT2. This is because there is a dropout at the end of the model. Test Plan: python test/test_quantization.py TestQuantizePT2EModels.test_qat_mobilenet_v2 Reviewed By: kimishpatel Differential Revision: D46707786 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104068 Approved by: https://github.com/jerryzh168	2023-06-23 16:34:25 +00:00
Edward Z. Yang	fd40abb706	Minor bugfix for int inputs in minifier (#104100 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104100 Approved by: https://github.com/albanD	2023-06-23 16:17:12 +00:00
Bert Maher	fb04b59fa2	[functorch] Remove test_functionalize (#103748 ) After landing D46344980 I talked with @rzou and discovered that test_functionalize no longer actually tests anything beyond what's already in test_aotdispatch. So, we can remove this, and save some GPU testing cycles! Differential Revision: [D46395212](https://our.internmc.facebook.com/intern/diff/D46395212/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D46395212/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/103748 Approved by: https://github.com/zou3519, https://github.com/Neilblaze	2023-06-23 14:38:50 +00:00
Mengwei Liu	ce845dfe49	[Reland][ET] Select used et_kernel_metadata only (#104005 ) Summary: Currently we rely on root operator, but we also need to check for et_kernel_metadata for used specialized kernels. Test Plan: contbuild & OSS CI Reviewed By: Jack-Khuu Differential Revision: D46882119 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104005 Approved by: https://github.com/Jack-Khuu	2023-06-23 14:38:45 +00:00
Zachary DeVito	afc788a99c	Re-land _cycleviz.py: visualize reference cycles holding cuda memory (#104051 ) Reference cycles are freed by the cycle collector rather than being cleaned up when the objects in the cycle first become unreachable. If a cycle points to a tensor, the CUDA memory for that tensor will not be freed until garbage collection runs. Accumulation of CUDA allocations can lead to out of memory errors (OOMs), as well as non-deterministic allocation behavior which is harder to debug. This visualizer installs a garbage collection hook to look for cycles containing CUDA tensors and saves a visualization of the garbage: ``` from torch.cuda._cycleviz import warn_tensor_cycles warn_tensor_cycles() # do some work that results in a cycle getting garbage collected # ... > WARNING:root:Reference cycle includes a CUDA Tensor see visualization of cycle /tmp/tmpeideu9gl.html ``` Reland to make windows skip the test. This reverts commit 7b3b6dd4262337c5289d64dd3e824b0614cf68e3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104051 Approved by: https://github.com/aaronenyeshi, https://github.com/malfet	2023-06-23 13:44:58 +00:00
Nikita Vedeneev	f090fdf3b4	sparse_mask: backward support for sparse lhs (#95165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95165 Approved by: https://github.com/pearu, https://github.com/cpuhrsch	2023-06-23 12:27:27 +00:00
leslie-fang-intel	fcb7a47f8b	[Quant][PT2E]Fix the maxpool2d input observer didn't insert after QuantizationAnnotation API (#101941 ) Summary The previous UT has been broken accidently, since the output of conv2d node has been annotated by mistake. Re-enable these UTs for case: - Single `conv2d` node, if we don't annotate the output node of `conv2d`. There should be no fake quant at conv2d's output. - For `conv2d-maxpool` pattern, `maxpool` should has fake quant inserted at input and output node since we annotate these nodes. Test Plan ``` python -m pytest test_quantize_pt2e.py -k test_wo_annotate_conv_output_quantizer python -m pytest test_quantize_pt2e.py -k test_max_pool2d_quantizer ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101941 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-06-23 11:50:31 +00:00
kshitij12345	47894bb165	[functorch] disable C++ Function under functorch transforms (#103957 ) Fixes https://github.com/pytorch/pytorch/issues/102720 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103957 Approved by: https://github.com/zou3519	2023-06-23 11:01:44 +00:00
Michael Voznesensky	ec24f1e4cc	Simulate treespec flattening/unflattening (#101896 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101896 Approved by: https://github.com/jansel, https://github.com/anijain2305	2023-06-23 10:53:15 +00:00
Ying Zhang	92c0e49419	Add num_elements_per_warp as an triton_config (#103702 ) # Summary 1. Add `num_elements_per_warp` as an optional triton config. Currently it's only used in Pointwise max_auto_tune. 2. Added an entry for Pointwise max_auto_tune when len(size_hints)==1. This is from the results of `CoordescTuner` for the `max_pool2d_with_indices_backward` kernel. 3. Made `max_pool2d_with_indices_backward` channel-last consider torch inductor lowering by default when auto-tune is enabled. (I tried to update `num_elements_per_warp` directly for all configs directly. However it brings some perf regressions for "torchbench" and "dynamic" models. So in this PR still use a guard.) # Performance test results Operator max_pool2d_with_indices_backward testing: ``` python3.9 benchmarks/dynamo/microbenchmarks/operatorbench.py --suite=timm --op=aten.max_pool2d_with_indices_backward.default --max-samples=5 --dtype=float16 --channels-last Before this change: Fallback Inductor Speedups : [0.9997202597876758, 1.0001309108307304, 1.0002654421310901] Default lowering: Inductor Speedups : [0.9945062166479167, 1.0632119741391315, 1.3002933288577507] TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=0 Inductor Speedups : [0.9941159121217165, 1.0648002410311495, 1.2999986086755966] TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1 Inductor Speedups : [0.9950528253874693, 1.0651245316963014, 1.3013674401534756] TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 Inductor Speedups : [1.4020247605755827, 1.5504232138088152, 1.8226088905229931] After this change: TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1 Inductor Speedups : [1.403303265792746, 1.548831582475635, 1.822278780085024] ``` Inductor perf nightly run in progress: https://github.com/pytorch/pytorch/actions/runs/5329044981 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103702 Approved by: https://github.com/jansel, https://github.com/eellison	2023-06-23 10:17:46 +00:00
Michael Lazos	09d093b47b	Update foreach realization rules (#104017 ) - don't realize inputs to foreach ops - only realize outputs if there are downstream non-foreach ops before merge: need to update tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/104017 Approved by: https://github.com/jansel	2023-06-23 08:26:09 +00:00
xuanqi	a152b3e3b8	[RFC] Create functional aten assertion ops (#103751 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #103887 * #103757 * __->__ #103751 Prep PR to create functional version of assertions. Concrete logic will be implemented in future PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103751 Approved by: https://github.com/tugsbayasgalan	2023-06-23 06:20:42 +00:00
Vinay Kumar Burugu	3c28431a0f	Feature: Dump compile_times when TORCH_LOGS=dynamo is enabled. (#104057 ) Partial implementation of https://github.com/pytorch/pytorch/issues/103173. This PR only implements the feature to dump compile_times at the end of the session using the atexit handler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104057 Approved by: https://github.com/ezyang	2023-06-23 05:25:09 +00:00
fduwjj	23b7035b3c	[TP] Add an input resharding wrapper for TP and unit test for 2D + AC (#103334 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103334 Approved by: https://github.com/kumpera	2023-06-23 04:05:01 +00:00
Ali Moezzi	8c3958eddc	Fix lr_scheduler serialization contains bound methods issue (#102627 ) Fixes #42376 `torch.save` serializes bound methods inside LR scheduler resulting in large serialized file. Test cases include checking file size, checking if the `anneal_func` is bounded and file is loaded correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102627 Approved by: https://github.com/albanD	2023-06-23 03:53:15 +00:00
PyTorch UpdateBot	c805b81fef	[vision hash update] update the pinned vision hash (#104082 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104082 Approved by: https://github.com/pytorchbot	2023-06-23 03:47:42 +00:00
PyTorch MergeBot	29e3fddb08	Revert "Preserve original co_filename when FX symbolic_trace (#103885 )" This reverts commit b9f81a483a7879cd3709fd26bcec5f1ee33577e6. Reverted https://github.com/pytorch/pytorch/pull/103885 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/103885#issuecomment-1603612781))	2023-06-23 02:49:04 +00:00
Michael Lazos	5a97c947c6	Fix optimizer grad mode state interaction with dynamo (#103952 ) Graph break before restoring the grad mode to ensure dynamo respects `no_grad`. This isn't a bug necessarily, but this will allow us to get good perf until aot is updated. https://github.com/pytorch/pytorch/issues/104053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103952 Approved by: https://github.com/janeyx99	2023-06-23 02:07:08 +00:00
Xiaodong Wang	22eaedacd3	[nccl] Do no skip send/recv 0 byte tensor (#103140 ) Summary: Since NCCL 2.12.10, NCCL supports send/recv 0 byte: https://github.com/NVIDIA/nccl/issues/696. Therefore we don't have to skip. One issue is that if a rank has 0 bytes to send and 0 bytes to recv, it'll skip send/recv completely. And it'll proceed to the next collective which it can send/recv something, making it confusing to the other ranks. Another solution is to add a barrier but that's very expensive. Test Plan: will add a unit test Differential Revision: D46507785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103140 Approved by: https://github.com/malfet, https://github.com/kwen2501	2023-06-23 01:57:53 +00:00
Yidi Wu	0330f67b22	Remove ExportGraphModuleMixin. (#103786 ) Summary: We remove the ExportGraphModuleMixin. There are several implications of this change: 1. The graph_module of ExportedProgram, EdgeDialectProgram and ExecutorchProgram won't have the same signature as original user function. Instead, we should directly call the Program, which has the same calling convention. e.g: 2. All passes need to go through prog.transform(passes). We need to make all passes return PassResult as a result. 3. We also need to make sure the graph_module.meta is preserved after transform. Test Plan: Test with CI. Differential Revision: D46729844 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103786 Approved by: https://github.com/avikchaudhuri	2023-06-23 01:22:28 +00:00
Niklas Nolte	4624afaa30	use reset_running_stats in swa_utils.update_bn (#103801 ) the stat reset in `swa_utils.update_bn` already exists in `NormBase.reset_running_stats`, so use that Pull Request resolved: https://github.com/pytorch/pytorch/pull/103801 Approved by: https://github.com/janeyx99	2023-06-23 01:17:13 +00:00
Mengwei Liu	75716fb060	[export][serde] Add opset version check and upgrader API (#103238 ) This PR adds initial implementation of an upgrader. Added test to show that this version works for one of the upgraders in https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/operator_upgraders/upgraders_entry.cpp. Differential Revision: [D46651778](https://our.internmc.facebook.com/intern/diff/D46651778) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103238 Approved by: https://github.com/avikchaudhuri	2023-06-23 01:06:02 +00:00
Animesh Jain	6463c55ef8	[inductor] Limit window for horizontal fusion (#104024 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104024 Approved by: https://github.com/jansel	2023-06-23 01:04:15 +00:00
Ryan Smith	6bda97e2c1	Raise type error message for `interpolate` if `size` contains non-integer elements (#99243 ) Raise type error message for interpolate when output size is a tuple containing elements that are not `int` Fixes #98287 Check is only performed if `size` is an instance of `list` or `tuple`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99243 Approved by: https://github.com/Skylion007, https://github.com/Neilblaze, https://github.com/MovsisyanM, https://github.com/albanD	2023-06-23 00:48:45 +00:00
PaliC	51664489ba	fix upload alerts to rockset (#103995 ) Testing is the CI of https://github.com/pytorch/pytorch/pull/103996 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103995 Approved by: https://github.com/huydhn	2023-06-22 23:33:10 +00:00
Muralidhar Andoorveedu	4e204ff87b	Added is_xla (#103100 ) This change creates `is_xla` which is congruent with `is_cuda` and `is_cpu`. Useful in situations like: https://github.com/pytorch/pytorch/pull/102858 ``` >>> x = torch.tensor([1], device=xm.xla_device()) >>> x.is_xla True >>> x.is_cpu False >>> x = torch.tensor([1]) >>> x.is_cpu True >>> x.is_xla False ``` Attn: @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/103100 Approved by: https://github.com/albanD	2023-06-22 23:31:04 +00:00
Iris	49dc26435f	[BE]Fix @parametrize not working when using @with_comms in DTensorTestBase (#104065 ) 1) Fix @parametrize not working when using @with_comms in DTensorTestBase, this is because args and kwargs are currently not being passed when using @with_comms wrapper. 2) Use @parametrize in test_fsdp_dtensor_state_dict.py to make sure it is working correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104065 Approved by: https://github.com/fduwjj	2023-06-22 23:24:40 +00:00
Shirong Wu	a3ac258291	Pass in .so name via lower setting (#103968 ) (#104015 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/103968 Differential Revision: D46922444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104015 Approved by: https://github.com/desertfire	2023-06-22 23:05:27 +00:00
Michael Lazos	8d9581a390	Remove foreach triton workaround that is no longer needed (#104016 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/104016 Approved by: https://github.com/eellison	2023-06-22 22:30:23 +00:00
Akila Premachandra	1f1fb58b8a	[dynamo] Fix TimmRunner typo in benchmarks (#104052 ) Minor fix - removes extra n from TimmRunner class object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104052 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-06-22 22:25:25 +00:00
andrewor14	0d5f1cb666	[quant] Add torch.flatten to executorch backend_config (#103988 ) Summary: This is needed to make the short-term and long-term quantization numerics match for mobilenetv2. Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: jerryzh, kimishpatel Subscribers: jerryzh, kimishpatel Differential Revision: [D46909962](https://our.internmc.facebook.com/intern/diff/D46909962) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103988 Approved by: https://github.com/jerryzh168	2023-06-22 22:11:48 +00:00
Rohan Varma	f044613f78	Back out "Revert "[DDP] multiple forward support for static graph (#103487 )" (#103873 )" (#103938 ) Differential Revision: [D46883396](https://our.internmc.facebook.com/intern/diff/D46883396/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103938 Approved by: https://github.com/awgu, https://github.com/fegin	2023-06-22 21:55:58 +00:00
soulitzer	10ad74cbec	Update SavedVariable to support saving non-input leafs (#104039 ) Fixes https://github.com/pytorch/pytorch/issues/103726 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104039 Approved by: https://github.com/albanD	2023-06-22 21:52:35 +00:00
Peter Bell	d7994dfd07	[inductor] Add triton_helpers.any instead of reusing max (#103974 ) I doubt there's much difference in performance, but this improves readability of the generated code, e.g. ```python tmp8 = triton_helpers.max2(tmp7, 1)[:, None] ``` becomes ```python tmp8 = triton_helpers.any(tmp7, 1)[:, None] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103974 Approved by: https://github.com/lezcano	2023-06-22 20:06:21 +00:00
Andrew Or	303ff84b04	[quant][pt2] Update special qspecs after QAT rewrite (#103970 ) Summary: Special qspecs like `SharedQuantizationSpec` and `DerivedQuantizationSpec` refer to other nodes in the graph. However, after subgraph rewriting in QAT, the nodes referred to in these special qspecs may be replaced by new nodes. This could lead to the following error when inserting observers according to these qspecs: ``` AssertionError: please make sure only refer to edge or node that has observer/fake_quant inserted: 'getitem' not in dict_keys([(arg0, convolution_default_1), (mul_tensor, convolution_default_1), getitem_3]) ``` This commit fixes this by keeping track of the nodes that are replaced during subgraph rewriting in QAT, and using this mapping to update the dangling references used in these special qspecs. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_qat_update_shared_qspec Reviewed By: jerryzh168 Differential Revision: D46606614 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103970 Approved by: https://github.com/jerryzh168	2023-06-22 20:05:57 +00:00
Catherine Lee	c16a28860f	Reenable disabled tests by pr body (#103790 ) Query for list of renabled issues in the filter test config step: switch filter test config to query for all the PR info instead of just the labels so token usage should stay the same, move code and tests related to parsing reenabled issues to filter test config step, remove old code to get PR body and commit message. `REENABLED_ISSUES` should be a comma separated list of issue numbers to be reenabled. For testing: Fixes #103789 Check that 103789 shows up in list of ignored disabled issues Sanity check that test-config labels still work More testing via `python3 ".github/scripts/filter_test_configs.py" --workflow "pull" --job-name "linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu)" --test-matrix "{ include: [ { config: "default", shard: 1, num_shards: 1 }, ]} " --pr-number "" --tag "" --event-name "push" --schedule "" --branch ""` and `python3 ".github/scripts/filter_test_configs.py" --workflow "pull" --job-name "linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu)" --test-matrix "{"include": [{"config": "default", "shard": 1, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}]}" --pr-number "103790" --tag "" --event-name "pull_request" --schedule "" --branch ""` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103790 Approved by: https://github.com/huydhn	2023-06-22 19:47:11 +00:00
Catherine Lee	7ac1c64bc4	Exclude _nvfuser from test collection (#104003 ) The three files in this folder are run by should instead be run by test_jit_cuda_fuser.py, test_nvfuser_dynamo.py, and test_nvfuser_frontend.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/104003 Approved by: https://github.com/huydhn, https://github.com/jjsjann123	2023-06-22 19:46:45 +00:00
Louis Feng	5847cb55e4	[PyPer][ET] Refactor EG to ET (#99694 ) Summary: Change execution graph to execution trace. See post: https://fb.workplace.com/groups/873291503156329/permalink/1529496217535851/ Test Plan: Run a job. Reviewed By: chaekit Differential Revision: D44121392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99694 Approved by: https://github.com/chaekit	2023-06-22 19:41:54 +00:00
Shunting Zhang	ec922efe3b	[inductor] fix a failed test for layout optimization (#103984 ) Summary: The test fail because a fixed port is used to initialize the process group. That does not work in stress test when multiple instance of the tests are being run concurrently. Pick a random port and do some small retry for that. Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:layout_optim -- --exact 'caffe2/test/inductor:layout_optim - test_mutate_view (caffe2.test.inductor.test_layout_optim.TestLayoutOptim)' --run-disabled --jobs 18 --stress-runs 10 --record-results ``` Differential Revision: D46908114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103984 Approved by: https://github.com/williamwen42	2023-06-22 19:34:10 +00:00
PyTorch MergeBot	b5594f7df0	Revert "Use missing-prototypes in torch_cpu (#103725 )" This reverts commit 716b3b893d2826f1e47ab5321f082b48c66c8c92. Reverted https://github.com/pytorch/pytorch/pull/103725 on behalf of https://github.com/osalpekar due to Broke caffe2 builds due. More info at [D46920675](https://www.internalfb.com/diff/D46920675) ([comment](https://github.com/pytorch/pytorch/pull/103725#issuecomment-1603129273))	2023-06-22 18:30:31 +00:00
Daniil Kutz	4aee0fef11	Heap buffer overflow due to wrong loop condition in torch::jit::unpickler (#103667 ) Hi! I've been fuzzing different pytorch modules with with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch), and found a heap buffer overflow error that occures by incorrect loop condition in torch::jit::unpickler.cpp. This bug was found in several fuzzing targets: it can be triggered by `torch::jit::load()` method when loading a .pt model and by `torch::distributed::rpc::deserializeRequest()` method in RPC module. All found errors could be reproduced with provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch). ### PoC for deserealizeRequest(): [crash-0722408578cd2f26593b5a01e26d2a078d3dc5f6.zip](https://github.com/pytorch/pytorch/files/11756694/crash-0722408578cd2f26593b5a01e26d2a078d3dc5f6.zip) ``` ================================================================= ==29858==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6020004ed808 at pc 0x000000680084 bp 0x7ffcbd8220d0 sp 0x7ffcbd8220c8 READ of size 4 at 0x6020004ed808 thread T0 #0 0x680083 in c10::IValue::IValue(c10::IValue const&) /pytorch/aten/src/ATen/core/ivalue.h:224:33 #1 0xdc4beb8 in std::pair<c10::impl::DictIterator<c10::IValue, c10::IValue, ska_ordered::detailv3::sherwood_v3_table<std::pair<c10::IValue, c10::IValue>, c10::IValue, c10::detail::DictKeyHash, ska_ordered::detailv3::KeyOrValueHasher<c10::IValue, std::pair<c10::IValue, c10::IValue>, c10::detail::DictKeyHash>, c10::detail::DictKeyEqualTo, ska_ordered::detailv3::KeyOrValueEquality<c10::IValue, std::pair<c10::IValue, c10::IValue>, c10::detail::DictKeyEqualTo>, std::allocator<std::pair<c10::IValue, c10::IValue> >, std::allocator<ska_ordered::detailv3::sherwood_v3_entry<std::pair<c10::IValue, c10::IValue> > > >::templated_iterator<std::pair<c10::IValue, c10::IValue> > >, bool> c10::Dict<c10::IValue, c10::IValue>::insert_or_assign<c10::IValue&, c10::IValue&>(c10::IValue&, c10::IValue&) const /pytorch/aten/src/ATen/core/Dict_inl.h:136:5 #2 0xea680a7 in torch::jit::Unpickler::readInstruction() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:452:14 #3 0xea64e07 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:251:27 #4 0xea64a61 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204:3 #5 0xe9b13ce in torch::jit::unpickle(std::function<unsigned long (char, unsigned long)>, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:126:20 #6 0xe9b178c in torch::jit::unpickle(char const, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:136:10 #7 0xfdc8aa1 in torch::distributed::rpc::(anonymous namespace)::toIValues(torch::distributed::rpc::Message const&, torch::distributed::rpc::MessageType) /pytorch/torch/csrc/distributed/rpc/rref_proto.cpp:23:16 #8 0xfdca3ca in torch::distributed::rpc::PythonRRefFetchCall::fromMessage(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/rref_proto.cpp:105:17 #9 0xfe7f347 in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:117:14 #10 0x5c5d13 in LLVMFuzzerTestOneInput /message_deserialize.cc:192:27 #11 0x5c2bfd in ExecuteFilesOnyByOne /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255:7 #12 0x5c2a08 in LLVMFuzzerRunDriver /AFLplusplus/utils/aflpp_driver/aflpp_driver.c #13 0x5c25c8 in main /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300:10 #14 0x7feb90908082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee) #15 0x50237d in _start (/message_deserialize_afl+0x50237d) 0x6020004ed808 is located 8 bytes to the right of 16-byte region [0x6020004ed7f0,0x6020004ed800) allocated by thread T0 here: #0 0x5bfc1d in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3 #1 0x32ad8d1 in std::_Vector_base<c10::IValue, std::allocator<c10::IValue> >::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:346:20 #2 0x32ad8d1 in void std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_realloc_insert<double>(__gnu_cxx::__normal_iterator<c10::IValue, std::vector<c10::IValue, std::allocator<c10::IValue> > >, double&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:440:33 SUMMARY: AddressSanitizer: heap-buffer-overflow /pytorch/aten/src/ATen/core/ivalue.h:224:33 in c10::IValue::IValue(c10::IValue const&) Shadow bytes around the buggy address: 0x0c0480095ab0: fa fa fd fd fa fa fd fd fa fa fd fd fa fa 00 00 0x0c0480095ac0: fa fa 00 00 fa fa 00 00 fa fa 04 fa fa fa 04 fa 0x0c0480095ad0: fa fa 00 fa fa fa fd fa fa fa 04 fa fa fa 00 fa 0x0c0480095ae0: fa fa 00 fa fa fa fd fa fa fa fd fa fa fa fd fa 0x0c0480095af0: fa fa fd fd fa fa 00 00 fa fa 00 fa fa fa 00 00 =>0x0c0480095b00: fa[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c0480095b10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c0480095b20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c0480095b30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c0480095b40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c0480095b50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==29858==ABORTING ``` ### PoC for load(): [crash-2bd32e496811fb06de24a2bb720dc6490218009f.zip](/uploads/53d108cdd434ec4b11a2034bbca3cfd8/crash-2bd32e496811fb06de24a2bb720dc6490218009f.zip) ``` ==29865==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60c00031f388 at pc 0x000000669984 bp 0x7ffd6c6de630 sp 0x7ffd6c6de628 READ of size 4 at 0x60c00031f388 thread T0 #0 0x669983 in c10::IValue::IValue(c10::IValue const&) /pytorch/aten/src/ATen/core/ivalue.h:224:33 #1 0xdc3de68 in std::pair<c10::impl::DictIterator<c10::IValue, c10::IValue, ska_ordered::detailv3::sherwood_v3_table<std::pair<c10::IValue, c10::IValue>, c10::IValue, c10::detail::DictKeyHash, ska_ordered::detailv3::KeyOrValueHasher<c10::IValue, std::pair<c10::IValue, c10::IValue>, c10::detail::DictKeyHash>, c10::detail::DictKeyEqualTo, ska_ordered::detailv3::KeyOrValueEquality<c10::IValue, std::pair<c10::IValue, c10::IValue>, c10::detail::DictKeyEqualTo>, std::allocator<std::pair<c10::IValue, c10::IValue> >, std::allocator<ska_ordered::detailv3::sherwood_v3_entry<std::pair<c10::IValue, c10::IValue> > > >::templated_iterator<std::pair<c10::IValue, c10::IValue> > >, bool> c10::Dict<c10::IValue, c10::IValue>::insert_or_assign<c10::IValue&, c10::IValue&>(c10::IValue&, c10::IValue&) const /pytorch/aten/src/ATen/core/Dict_inl.h:136:5 #2 0xea5a207 in torch::jit::Unpickler::readInstruction() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:452:14 #3 0xea56f67 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:251:27 #4 0xea56bc1 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204:3 #5 0xe96db4e in torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) /pytorch/torch/csrc/jit/serialization/import_read.cpp:53:20 #6 0xe8fc648 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/import.cpp:184:10 #7 0xe8f8935 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::deserialize(c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:287:19 #8 0xe8f6d74 in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::istream&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:386:25 #9 0xe90086e in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::istream&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:322:10 #10 0xe903209 in torch::jit::load(std::istream&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:482:10 #11 0x5c2d60 in LLVMFuzzerTestOneInput /load.cc:42:14 #12 0x5c2a8d in ExecuteFilesOnyByOne /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255:7 #13 0x5c2898 in LLVMFuzzerRunDriver /AFLplusplus/utils/aflpp_driver/aflpp_driver.c #14 0x5c2458 in main /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300:10 #15 0x7f156ae33082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee) #16 0x50220d in _start (/load_afl+0x50220d) 0x60c00031f388 is located 8 bytes to the right of 128-byte region [0x60c00031f300,0x60c00031f380) allocated by thread T0 here: #0 0x5bfaad in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3 #1 0xa86231 in std::_Vector_base<c10::IValue, std::allocator<c10::IValue> >::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:346:20 #2 0xa86231 in void std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_realloc_insert<c10::IValue&>(__gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue, std::allocator<c10::IValue> > >, c10::IValue&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:440:33 SUMMARY: AddressSanitizer: heap-buffer-overflow /pytorch/aten/src/ATen/core/ivalue.h:224:33 in c10::IValue::IValue(c10::IValue const&) Shadow bytes around the buggy address: 0x0c188005be20: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa 0x0c188005be30: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c188005be40: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd 0x0c188005be50: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa 0x0c188005be60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x0c188005be70: fa[fa]fa fa fa fa fa fa 00 00 00 00 00 00 00 00 0x0c188005be80: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa 0x0c188005be90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c188005bea0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c188005beb0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c188005bec0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==29865==ABORTING ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103667 Approved by: https://github.com/albanD	2023-06-22 18:09:19 +00:00
Huy Do	f27a9129e7	XFAIL test_MaxUnpool_index_errors CUDA slow tests (#103905 ) This has been failing in trunk for a while. Let's XFAIL it while continuing the investigation https://github.com/pytorch/pytorch/issues/103854. We might not need this PR if the fix is on the way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103905 Approved by: https://github.com/mikaylagawarecki	2023-06-22 18:05:10 +00:00
Driss Guessous	abd4ee8150	Specific namesapce for mha (#104001 ) Potential namespace collisions for downstream projects Pull Request resolved: https://github.com/pytorch/pytorch/pull/104001 Approved by: https://github.com/cpuhrsch	2023-06-22 17:57:06 +00:00
skc7	d2d3394c21	[pytorch/tensorexpr] Create LLJIT instance with an ObjectLinkingLayer (#103824 ) - Upstream LLVM switched LLJIT's default JIT linker for ELF/x86-64 to JITLink: [commit](`b92839c954`). This commit mandates clients to use JITLink plugins, following the example in "llvm/examples/OrcV2Examples/LLJITWithCustomObjectLinkingLayer". - Current change updates PytorchLLVMJITImpl to set ObjectLinkingLayer on LLJIT creation. - If setObjectLinkingLayerCreator not set, RTDyldObjectLinkingLayer will be constructed. This is currently causing "Symbols not found: [ llvm_orc_registerEHFrameSectionWrapper ]" error for tests in test_quantization.py when pytorch is built to use latest LLVM. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103824 Approved by: https://github.com/jeffdaily, https://github.com/davidberard98	2023-06-22 17:50:25 +00:00
Adnan Akhundov	f818036f85	Fix test_addmm_gelu assertion on Windows CUDA (#104031 ) Summary: This PR fixes the wrong assertion in the `test_addmm_gelu` happening in the Windows CUDA CI job caused by #103811. The addmm + GELU fusion is likely not happening (or not using the tanh approximation) on Windows. See [this comment](https://github.com/pytorch/pytorch/pull/103811#issuecomment-1601936203) in the #103811 for the details of the error. Test Plan: ``` $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 2.131s OK $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok ---------------------------------------------------------------------- Ran 6 tests in 2.194s OK ``` Reviewers: @eellison @huydhn Subscribers: Tasks: Tags: Differential Revision: [D46931688](https://our.internmc.facebook.com/intern/diff/D46931688) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104031 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-06-22 17:42:33 +00:00
PyTorch MergeBot	7b3b6dd426	Revert "_cycleviz.py: visualize reference cycles holding cuda memory (#102656 )" This reverts commit dba67f71c9b5abbdca5aa64913c50f9aa5ac6f51. Reverted https://github.com/pytorch/pytorch/pull/102656 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But I think the change is failing on Windows CUDA https://github.com/pytorch/pytorch/actions/runs/5341701630/jobs/9683293600 ([comment](https://github.com/pytorch/pytorch/pull/102656#issuecomment-1603035364))	2023-06-22 17:16:47 +00:00
dependabot[bot]	ab9ea0d0f2	Bump numpy from 1.21.6 to 1.22.0 in /benchmarks/dynamo/_onnx (#104014 ) Bumps [numpy](https://github.com/numpy/numpy) from 1.21.6 to 1.22.0. - [Release notes](https://github.com/numpy/numpy/releases) - [Changelog](https://github.com/numpy/numpy/blob/main/doc/RELEASE_WALKTHROUGH.rst) - [Commits](https://github.com/numpy/numpy/compare/v1.21.6...v1.22.0) --- updated-dependencies: - dependency-name: numpy dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-06-22 09:45:15 -07:00
Chien-Chin Huang	1c33c398c7	[FSDP][state_dict] Add a summary log when finishing state_dict (#103784 ) Add a summary log when finishing state_dict Differential Revision: [D46807103](https://our.internmc.facebook.com/intern/diff/D46807103/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103784 Approved by: https://github.com/fduwjj	2023-06-22 16:29:24 +00:00
Aleksandar Samardžić	ab8fc41e2f	Support bfloat16 dtype for CUTLASS-based semi-structured sparsity (#103978 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103978 Approved by: https://github.com/cpuhrsch	2023-06-22 15:53:27 +00:00
Charlie West-Taylor	5eb7325bc7	Add autocast support for IPU (#103890 ) As part of this, a new `AutocastIPU` dispatch key has been added. There's an existing PR, #85043, to make `Autocast` a proper per-backend functionality key, but it ran into issues with layering with other functionality keys and went stale. This has been tested in the out-of-tree IPU PyTorch backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103890 Approved by: https://github.com/albanD	2023-06-22 15:38:45 +00:00
Antoni Viros i Martin	0d653730ce	Refactory bits for the codegen cache (#103452 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103452 Approved by: https://github.com/ezyang	2023-06-22 13:04:22 +00:00
wgb	b663a41b51	add onlyprivateuse1 decorator for test framework (#103664 ) Fixes #ISSUE_NUMBER The current community testing framework does not have a decorator for privateuse1, we have fixed this Pull Request resolved: https://github.com/pytorch/pytorch/pull/103664 Approved by: https://github.com/albanD	2023-06-22 13:00:31 +00:00
albanD	4143b6b89b	Add torch_dispatch and modes to extending.rst note (#102087 ) The following subjects are not in this PR and will be done in a follow up: - Go through torch_function section and update to the latest phrasing and link to the proper new sections - Go through torch.library and custom device docs to add links to the new sections as appropriate - Top level explanations on which component should be used Pull Request resolved: https://github.com/pytorch/pytorch/pull/102087 Approved by: https://github.com/janeyx99	2023-06-22 12:56:35 +00:00
Nikita Karetnikov	e9705c52ac	[pt2] add metas for `_pdist_forward` and `_pdist_backward` (#103817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103817 Approved by: https://github.com/ezyang	2023-06-22 11:18:05 +00:00
Nikita Karetnikov	e48851033a	[pt2] add metas for `pad` ops (#103815 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103815 Approved by: https://github.com/ezyang	2023-06-22 11:18:05 +00:00
Animesh Jain	f9c64a1156	[debugging] aot_eager backend to use the min-cut partitioner (#103555 ) default_partitioner is kind of broken when it comes to memory footprint. Moving aot_eager to use min-cut partitioner is better debugging experience. One bad thing though would be that we will much lower speedup numbers, because min cut partitioner will try to recompute ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103555 Approved by: https://github.com/eellison, https://github.com/jansel	2023-06-22 09:31:08 +00:00
Iris	613970eb05	[5/n][FSDP] Update _sharded_post_state_dict_hook to use DTensor when use_dtensor=True in state_dict_config (#103921 ) This allows us use use_dtensor=True for ShardedStateDictConfig() before calling model.state_dict(). load_state_dict hooks updates will be in next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103921 Approved by: https://github.com/fduwjj, https://github.com/fegin	2023-06-22 08:32:19 +00:00
Wei Lu	34336bd625	[PyTorch Vulkan] fix the position computation with the consideration of channel padding (#103908 ) Summary: The old shader file was created before channel padding was implemented. We recompute the positions with the consideration that channels are padded as a multiple of 4. Test Plan: under `fbsource` run `buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1` full test result: P772641736 Reviewed By: SS-JIA Differential Revision: D46866159 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103908 Approved by: https://github.com/SS-JIA	2023-06-22 08:03:10 +00:00
Nikita Shulga	2d528625d7	Make PyTorch compilable without XNNPACK (#104004 ) By including `Engine.h` in `Shim.cpp` and defining `bool available()` outside of `#ifdef` guard in `Common.h`. Modernize code a bit by using nested namespaces. Fixes following compilation error if `USE_XNNPACK` is false: ``` Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/xnnpack/Shim.cpp:26:6: error: no previous prototype for function 'available' [-Werror,-Wmissing-prototypes] bool available() { ^ /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/xnnpack/Shim.cpp:30:6: error: no previous prototype for function 'use_convolution2d' [-Werror,-Wmissing-prototypes] bool use_convolution2d( ^ /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/xnnpack/Shim.cpp:42:8: error: no previous prototype for function 'convolution2d' [-Werror,-Wmissing-prototypes] Tensor convolution2d( ^ /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/xnnpack/Shim.cpp:53:6: error: no previous prototype for function 'use_linear' [-Werror,-Wmissing-prototypes] bool use_linear( ^ /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/xnnpack/Shim.cpp:60:8: error: no previous prototype for function 'linear' [-Werror,-Wmissing-prototypes] Tensor linear( ^ /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/xnnpack/Shim.cpp:67:6: error: no previous prototype for function 'use_max_pool2d' [-Werror,-Wmissing-prototypes] bool use_max_pool2d( ^ /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/xnnpack/Shim.cpp:79:8: error: no previous prototype for function 'max_pool2d' [-Werror,-Wmissing-prototypes] Tensor max_pool2d( ^ ``` <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at f8ac185</samp> > _The code for xnnpack activations_ > _Was scattered in different locations_ > _But now it's all neat_ > _In `Activation.cpp`_ > _With nested namespaces and simplifications_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104004 Approved by: https://github.com/drisspg	2023-06-22 06:41:31 +00:00
cyy	b689128db3	Fix an UBSAN error (#103900 ) UBSAN reports unaligned address access. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103900 Approved by: https://github.com/kimishpatel	2023-06-22 06:17:48 +00:00
AllenTiTaiWang	bffcfa9628	[ONNX] Separate fx _type_utils from torchscript exporter (#103942 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103942 Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi	2023-06-22 05:18:06 +00:00
Li-Huai (Allan) Lin	c575f748ab	[MPS] Remove unnecessary PSO checks (#103244 ) The checks are unnecessary as PSO derived from `metalIndexingPSO` function is already checked, see: `c4752b1a91/aten/src/ATen/mps/MPSDevice.mm (L69-L72)` <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 2d71d96</samp> This pull request removes unnecessary and duplicated error handling code for the pipeline state object in the constructors of several MPS kernel classes in `aten/src/ATen/native/mps/operations`. This makes the code more concise and clear. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103244 Approved by: https://github.com/albanD	2023-06-22 04:42:27 +00:00
Zachary DeVito	dba67f71c9	_cycleviz.py: visualize reference cycles holding cuda memory (#102656 ) Reference cycles are freed by the cycle collector rather than being cleaned up when the objects in the cycle first become unreachable. If a cycle points to a tensor, the CUDA memory for that tensor will not be freed until garbage collection runs. Accumulatin of CUDA allocations can lead to out of memory errors (OOMs), as well as non-deterministic allocation behavior which is harder to debug. This visualizer installs a garbage collection hook to look for cycles containing CUDA tensors and saves a visualization of the garbage: ``` from torch.cuda._cycleviz import warn_tensor_cycles warn_tensor_cycles() # do some work that results in a cycle getting garbage collected # ... > WARNING:root:Reference cycle includes a CUDA Tensor see visualization of cycle /tmp/tmpeideu9gl.html ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102656 Approved by: https://github.com/aaronenyeshi	2023-06-22 04:00:28 +00:00
soulitzer	73c927f901	Improve debuggability of activation checkpoint (#103859 ) This PR makes some improvements for debuggability of checkpointing: - improved error messages that are more understandable - errors are now `CheckpointError` which subclasses `RuntimeError` (only `CheckpointError` triggers debug message, see below) - stricter error checking by default: - shapes, dtypes, and device are compared - we also now error when more tensors are being saved for backward during recompute - NOTE: checks are relaxed if it is detected that you are doing backward within forward - shapes, dtype, and device checking can be disabled by passing `determinism_check="none"` - new debug flag: more helpful error message when `debug=True` Note: - cpp stack trace is only included for x86 linux machines - the error message if cpp stack trace is included can be quite long. For a function checkpointed with 8 operators, the log was around 1300 lines! (should this be hidden behind a flag?) [Error message when debug='True' (python stack trace only)](https://gist.github.com/soulitzer/3d5e19c7cceae8e22f9bdd625ec39dd4) [Error message when debug='True' (with python and cpp stacktrace)](https://gist.github.com/soulitzer/ff8fd8c3ccbb2c90dfe3df6d7713b167) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103859 Approved by: https://github.com/albanD	2023-06-22 03:57:36 +00:00
PaliC	dc15b4c838	add workflow dispatch to upload-alerts.yml (#103972 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103972 Approved by: https://github.com/huydhn	2023-06-22 03:35:39 +00:00
PyTorch MergeBot	518abe8b7e	Revert "Migrate exportdb to torch.export from torchdynamo.export (#103861 )" This reverts commit fb6173a4ac60ed5a22cff2c68134633eb72e53b9. Reverted https://github.com/pytorch/pytorch/pull/103861 on behalf of https://github.com/huydhn due to It looks like this change is failing in trunk due to a landrace `fb6173a4ac` ([comment](https://github.com/pytorch/pytorch/pull/103861#issuecomment-1601960600))	2023-06-22 03:24:01 +00:00
Richard Zou	5f88dd3e47	Link new PyTorch Contributing Guidelines from CONTRIBUTING.md (#103986 ) We wrote some new Contributing Guidelines that guide a contributor through the lifecycle of a Pull Request to PyTorch. We've gotten some positive feedback from early adopters so we are now adding it as the go-to link in CONTRIBUTING.md and the PyTorch Wiki. Note that there are older contributing guidelines over at https://github.com/pytorch/pytorch/blob/main/docs/source/community/contribution_guide.rst The new Contributing Guidelines doc is targeted towards guiding a user through submitting and merging a Pull Request to pytorch; the existing guidelines are more of a high-level overview. We should rationalize these at some point, but I left the resources for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103986 Approved by: https://github.com/kit1980, https://github.com/albanD	2023-06-22 03:18:50 +00:00
Nikita Karetnikov	c40fa8b614	[inductor] remove `fft` and `svd` ops from `fake_incorrect_kernels` (#103616 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103616 Approved by: https://github.com/eellison	2023-06-22 03:01:43 +00:00
Tugsbayasgalan Manlaibaatar	fb6173a4ac	Migrate exportdb to torch.export from torchdynamo.export (#103861 ) Things that needed to be fixed: 1. Fix a bug with returning dict output type 2. Make pass_base work with map implementation 3. Fix subtle bug with dynamo not propagating "val" in node.meta 4. Add export_constraints field in ExportCase in ExportDB Pull Request resolved: https://github.com/pytorch/pytorch/pull/103861 Approved by: https://github.com/zhxchen17, https://github.com/ydwu4	2023-06-22 02:53:41 +00:00
Wei Lu	430cb3e160	[PyTorch] add Vulkan support for `aten::tile` (#103944 ) Summary: We implement `aten::tile` on Vulkan backend through `aten::repeat`. The behavior of `aten::tile` is demonstrated here https://pytorch.org/docs/stable/generated/torch.tile.html Test Plan: Run tests for combinations of input dim between 1 and 4 and repeats of size between 1 and 4. When a test case fails, the shape info is printed, e.g. `Tile test failed when input is of shape [13, 5] and repeat of [7, 2, 3]`. ``` (base) luwei@luwei-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="tile" Building: finished in 0.1 sec (100%) 263/2812 jobs, 0/2812 updated Total time: 0.1 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = tile [==========] Running 3 tests from 1 test suite. [----------] Global test environment set-up. [----------] 3 tests from VulkanAPITest [ RUN ] VulkanAPITest.tile_invalid_inputs_exceptions [ OK ] VulkanAPITest.tile_invalid_inputs_exceptions (34 ms) [ RUN ] VulkanAPITest.tile_invalid_outpus_exceptions [ OK ] VulkanAPITest.tile_invalid_outpus_exceptions (2 ms) [ RUN ] VulkanAPITest.tile [ OK ] VulkanAPITest.tile (63 ms) [----------] 3 tests from VulkanAPITest (100 ms total) [----------] Global test environment tear-down [==========] 3 tests from 1 test suite ran. (100 ms total) [ PASSED ] 3 tests. ``` Reviewed By: yipjustin Differential Revision: D46367170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103944 Approved by: https://github.com/SS-JIA	2023-06-22 01:53:58 +00:00
Kiersten Stokes	41cc526b19	Avoid unwanted type promotion in `tensordot` (#103917 ) Fixes #103366 as noted [in comment](https://github.com/pytorch/pytorch/issues/103366#issuecomment-1589782866) by passing dtype to `sum` when dimension is 1. The code block given in the original issue now succeeds with no `RuntimeError`. ``` x1 = torch.as_tensor([0], dtype=torch.int32) x2 = torch.as_tensor([0], dtype=torch.int32) torch.inner(x1, x2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103917 Approved by: https://github.com/mikaylagawarecki	2023-06-22 01:35:24 +00:00
Richard Barnes	3535c634d1	Eliminate c10/util/array from PyTorch (#103893 ) Test Plan: Sandcastle Differential Revision: D46772319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103893 Approved by: https://github.com/Skylion007	2023-06-22 01:33:31 +00:00
PyTorch MergeBot	58d11159bd	Revert "Reenable disabled tests by pr body (#103790 )" This reverts commit 2237b4ad754cac060c906377800d28f7e56da8ec. Reverted https://github.com/pytorch/pytorch/pull/103790 on behalf of https://github.com/huydhn due to I think we tested it on PR but missed the logic in trunk where there is no PR number ([comment](https://github.com/pytorch/pytorch/pull/103790#issuecomment-1601890299))	2023-06-22 01:26:46 +00:00
BowenBao	c1a49823cd	[ONNX] Bench torch.onnx.dynamo_export and torch.onnx.export under dynamo bench (#103135 ) - Extend dynamo bench interface with '--compilers onnx' and '--compilers dynamo-onnx' - ONNX bench exports model to onnx and runs in ONNX Runtime. - Introduce error aggregation and report. - Scripts to build ONNX deps and running ONNX bench. - Huggingface accuracy check workaround for ONNX. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103135 Approved by: https://github.com/thiagocrepaldi, https://github.com/jansel	2023-06-22 01:21:09 +00:00
ts	ba6b1ae43a	Fix group norm mixed type error (#103360 ) Fixes #102922, adding a more descriptive error message when dealing with inputs that contain mixed types. Would be happy to add a test (I believe in test_nn.py?), just want to confirm that this is the correct place to put it! Pull Request resolved: https://github.com/pytorch/pytorch/pull/103360 Approved by: https://github.com/albanD	2023-06-22 01:15:32 +00:00
Catherine Lee	2237b4ad75	Reenable disabled tests by pr body (#103790 ) Query for list of renabled issues in the filter test config step: switch filter test config to query for all the PR info instead of just the labels so token usage should stay the same, move code and tests related to parsing reenabled issues to filter test config step, remove old code to get PR body and commit message. `REENABLED_ISSUES` should be a comma separated list of issue numbers to be reenabled. For testing: Fixes #103789 Check that 103789 shows up in list of ignored disabled issues Sanity check that test-config labels still work Pull Request resolved: https://github.com/pytorch/pytorch/pull/103790 Approved by: https://github.com/huydhn	2023-06-22 01:10:31 +00:00
Jack Taylor	ede1965f5d	Enable additional inductor test suites on ROCm (#102270 ) Enables additional inductor UTs on ROCm, following from https://github.com/pytorch/pytorch/pull/100981 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102270 Approved by: https://github.com/malfet	2023-06-22 00:36:35 +00:00
Nikita Shulga	cd05c3b98c	[BE] Use `TEST_MULTIGPU` from `common_cuda.py` (#103982 ) Comment about `TEST_CUDNN` called over and over has long been alleviated by wrapping the check with `LazyVal`, that caches the results. Also, delete unused `TEST_MAGMA`. Prep change for https://github.com/pytorch/pytorch/issues/100006 <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at e3a5b39</samp> > _`common_cuda.py`_ > _Refactored for dynamo tests_ > _Winter code cleanup_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/103982 Approved by: https://github.com/atalman, https://github.com/janeyx99	2023-06-22 00:07:44 +00:00
Shoaib Meenai	eed287ec19	[android] Only pass -mfpu to armv7 (#103929 ) Summary: The argument is unsupported on other architectures, and Clang 17 will error out when you pass an argument that's unsupported for the arch you're building for. Note that we need to use platform_compiler_flags instead of selects because the latter can't distinguish between architectures when doing a multi-arch app build in Buck1. Differential Revision: D46825070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103929 Approved by: https://github.com/ezyang	2023-06-21 23:23:35 +00:00
PyTorch MergeBot	626d8548df	Revert "add override to Caffe2 (#103795 )" This reverts commit f5f020adb0f8aa689b4db9881b666b6b5f3722a0. Reverted https://github.com/pytorch/pytorch/pull/103795 on behalf of https://github.com/osalpekar due to Caused some breakages due to jobs using `-Winconsistent-missing-destructor-override` detecting inconsistent usage of override. Specifically the Tensor class destructor not being marked with override ([comment](https://github.com/pytorch/pytorch/pull/103795#issuecomment-1601812803))	2023-06-21 23:21:25 +00:00
PyTorch MergeBot	13664bb535	Revert "add oncall info individual info to failing alert job alert (#103915 )" This reverts commit 1b0d23708b74e7538242be2793d1046cdd3a3a0b. Reverted https://github.com/pytorch/pytorch/pull/103915 on behalf of https://github.com/malfet due to Broke trunk with no module named tools, see https://github.com/pytorch/pytorch/actions/runs/5338343319/jobs/9675586967 ([comment](https://github.com/pytorch/pytorch/pull/103915#issuecomment-1601802715))	2023-06-21 23:06:45 +00:00
PyTorch MergeBot	08a7d60a46	Revert "[Reland][ET] Select used et_kernel_metadata only (#103705 )" This reverts commit 59a01c49ee180c8d332e14bf3d5cbd1e8707bb65. Reverted https://github.com/pytorch/pytorch/pull/103705 on behalf of https://github.com/osalpekar due to large number of internal failures in executorch contbuild. See [D46882119](https://www.internalfb.com/diff/D46882119) for more details ([comment](https://github.com/pytorch/pytorch/pull/103705#issuecomment-1601789900))	2023-06-21 22:51:38 +00:00
Animesh Jain	b7ae40f4c8	[min-cut partitioner] Disable a heuristic if graph has recomputable ops (#103635 ) Removing this heuristic leads to major memory compression and speedup bump for activation-checkpointed models. Here is the data ![image](https://github.com/pytorch/pytorch/assets/13822661/64a491ab-173d-435a-b858-61b847fbb08b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103635 Approved by: https://github.com/Chillee	2023-06-21 22:27:17 +00:00
soulitzer	3912b722f3	Upgrade LoggingTensor mode and add traceback collection (#103734 ) Parts borrowed from: https://github.com/albanD/subclass_zoo/blob/main/logging_mode.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/103734 Approved by: https://github.com/albanD	2023-06-21 22:04:30 +00:00
Aleksandar Samardžić	09fdea8564	Fix autograd issue with identity conversions (#92022 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92022 Approved by: https://github.com/pearu, https://github.com/mtaaooby, https://github.com/amjames, https://github.com/cpuhrsch	2023-06-21 21:23:03 +00:00
Paweł Piskorski	7fb2a928cf	fix hpu storage serialization (#101680 ) Change-Id: Ia534400a0e8972590374eceba5b62a2525b796e5 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/101680 Approved by: https://github.com/mikaylagawarecki	2023-06-21 21:19:49 +00:00
Siyuan	9590228303	Fix device of lengths in pack_padded_sequence when the default device is GPU (#103967 ) Fixes #103964 Always create the `lengths` tensor on cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/103967 Approved by: https://github.com/mikaylagawarecki	2023-06-21 21:14:37 +00:00
Brian Hirsh	c3c03e7cb8	Reland of https://github.com/pytorch/pytorch/pull/101818 (#103888 ) Original PR broke internal This reverts commit 5ed618132f466440ad76c884240e07796c7e2c6b. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103888 Approved by: https://github.com/albanD	2023-06-21 21:00:56 +00:00
Peter Bell	8b418f197c	[decomp] Add decomposition for torch.renorm (#103858 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103858 Approved by: https://github.com/ezyang, https://github.com/nkaretnikov	2023-06-21 20:57:43 +00:00
Michael Suo	c0596ffe85	improve repr for pytrees (#103945 ) The current thing indents based on the length of the previous line, which is totally unreadable if, e.g. the treespec is a dict with a lot of keys, since all the keys will go on a ginormous line and everything after will be super indented. Fix the indentation at 2, which is much more compact. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/103945 Approved by: https://github.com/zou3519	2023-06-21 20:53:03 +00:00
Andrew Gu	ec8aa6e592	[Easy][FSDP] Fix "column" -> "row" in PG example (#103975 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103975 Approved by: https://github.com/fduwjj	2023-06-21 20:41:50 +00:00
Chien-Chin Huang	a2d001d4dd	[FSDP][state_dict] Use _get_module_fsdp_state_if_fully_sharded_module for state_dict (#103783 ) Fix https://github.com/pytorch/pytorch/issues/90788 Use a consistent implementation as optim_state_dict Differential Revision: [D46807090](https://our.internmc.facebook.com/intern/diff/D46807090/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103783 Approved by: https://github.com/awgu, https://github.com/fduwjj	2023-06-21 20:31:30 +00:00
Peter Bell	591981c5e2	[inductor] Lower diagonal, diagonal_copy and diagonal_scatter (#103755 ) Currently these are decomposed into `as_strided`, which forces a buffer to be realized. Instead, this lowers them into a native inductor view node and so doesn't require any buffers to be realized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103755 Approved by: https://github.com/jansel	2023-06-21 20:16:24 +00:00
Peter Bell	a61096fb94	[decomp] Decompose logaddexp2 (#103765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103765 Approved by: https://github.com/Chillee	2023-06-21 20:16:24 +00:00
Adnan Akhundov	1c79003b3c	Enable addmm + GELU epilogue fusion via cuBLASLt (#103811 ) Summary: Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR: 1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8. 2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix). 3. Extends unit tests with addmm epilogue fusion and GELU activation paths. Test Plan: $ python test/test_linalg.py -k test_addmm_relu -v test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok $ python test/test_linalg.py -k test_addmm_gelu -v test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok Reviewers: @eellison Differential Revision: [D46829884](https://our.internmc.facebook.com/intern/diff/D46829884) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103811 Approved by: https://github.com/IvanYashchuk, https://github.com/eellison	2023-06-21 19:59:40 +00:00
PaliC	1b0d23708b	add oncall info individual info to failing alert job alert (#103915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103915 Approved by: https://github.com/huydhn	2023-06-21 19:25:39 +00:00
Rodrigo Kumpera	0beec88c93	Inductor support for all_gather_into_tensor_coalesced. (#98643 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98643 Approved by: https://github.com/wanchaol	2023-06-21 19:25:03 +00:00
Zhengxu Chen	2adfd1315a	[export] Serialize subgraphs. (#103901 ) Differential Revision: D46865179 Deserialization part will be added in a following up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103901 Approved by: https://github.com/larryliu0820	2023-06-21 19:17:33 +00:00
AllenTiTaiWang	6fd358e7f7	[ONNX] FX Dispatcher Test (#103971 ) The test of https://github.com/pytorch/pytorch/pull/100660 .... Pull Request resolved: https://github.com/pytorch/pytorch/pull/103971 Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi	2023-06-21 19:04:08 +00:00
Peter Bell	61cd605813	[decomp] Don't call .item() in aten.fill.Tensor decomp (#103880 ) Currently calling the fill.Tensor overload under `torch.compile` results in a `DataDependentOutputException` due to the `.item()` call. This instead does a device-device copy which can then be inlined into subsequent inductor kernels as you would expect, e.g. ```python def fn(a): result = torch.deg2rad(a).sin() return torch.empty((128, 128), device=a.device).fill_(result) ``` generates the single kernel ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 16384 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (0)) tmp1 = tl.broadcast_to(tmp0, [XBLOCK]) tmp2 = 0.017453292519943295 tmp3 = tmp1 * tmp2 tmp4 = tl.sin(tmp3) tl.store(out_ptr0 + (x0), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103880 Approved by: https://github.com/Chillee	2023-06-21 18:45:04 +00:00
Elias Ellison	785d472861	Skip Tensor-Tensor ops which have a Scalar input (#103928 ) The pass was assuming aten.mul.Tensor would have two Tensor inputs, as per its schema, but because of https://github.com/pytorch/pytorch/issues/86128 a Scalar may show up. Fix for https://github.com/pytorch/pytorch/issues/103924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103928 Approved by: https://github.com/yanboliang	2023-06-21 18:28:28 +00:00
Omkar Salpekar	ae1ed27756	[codemod][numpy] replace np.str with str (#103931 ) Summary: `np.str` is removed from numpy 1.20.0. It was an alias to builtin `str` and it's safe to do the replacement. The whole changes is mechanical, generated using the following onliner: ``` fbgr -sl 'np\.str\b' \| xargs perl -pi -e 's,\bnp\.str\b,str,g' ``` Test Plan: sandcastle Differential Revision: D46586144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103931 Approved by: https://github.com/huydhn	2023-06-21 18:16:42 +00:00
pbialecki	72f09faf10	remove CUDA 11.7 builds (#103904 ) CC @atalman @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/103904 Approved by: https://github.com/malfet, https://github.com/atalman	2023-06-21 18:16:34 +00:00
William Wen	17ef983516	skip torchinductor test_data_type_propogation if C++ compiler is not available (#103920 ) This test is failing internally (https://www.internalfb.com/intern/test/844425024008760/). Pull Request resolved: https://github.com/pytorch/pytorch/pull/103920 Approved by: https://github.com/yanboliang, https://github.com/jgong5, https://github.com/jansel	2023-06-21 18:14:50 +00:00
ecao	223f232928	Fix shape function for transpose convolution (#102139 ) Fixes #98129. Fixes the shape function for jit conv_transpose, as defined by the documentation https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html#torch.nn.ConvTranspose2d, includes output_padding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102139 Approved by: https://github.com/mingfeima, https://github.com/davidberard98	2023-06-21 17:50:56 +00:00
Aleksei Nikiforov	678ce61cdb	s390x simd: update abs() functions for vectors of complex numbers (#103850 ) This change fixes tests SignManipulation/2.Absolute and SignManipulation/3.Absolute in vec_test_all_types_ZVECTOR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103850 Approved by: https://github.com/kit1980	2023-06-21 16:00:32 +00:00
Elias Ellison	dbbf24decd	Fix counter resetting in pad mm (#103918 ) Prior counter reset interacted poorly w internal test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103918 Approved by: https://github.com/devashishshankar	2023-06-21 15:54:46 +00:00
Andrew Or	873f772df2	[quant][pt2] Fix QAT convert for resnet18 (#103759 ) Summary: Before this commit, only prepare QAT numerics matched between PT2 and FX for resnet18. Convert numerics diverged, however, for two reasons: (1) Existing patterns did not handle inplace ReLUs. This commit fixes this by adding extra patterns that use these ReLUs instead of the normal ones. (2) Subgraph rewriter could not handle skip connections in quantized models, because the dequantize node is used in both the conv node within the match pattern, and an inplace add node outside of the match pattern. This led the subgraph matcher to filter out the match, complaining that it was not self contained. This commit fixes this problem by duplicating the dequantize nodes, one for each user, such that subsequent matches will be self contained. Test Plan: python test/test_quantization.py TestQuantizePT2EModels.test_qat_resnet18 Reviewed By: jerryzh168 Differential Revision: D46564114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103759 Approved by: https://github.com/jerryzh168	2023-06-21 15:36:07 +00:00
milesial	f73ff54f9a	Use torch._foreach_lerp for SWA update (#103550 ) Launch fewer kernels during a SWA update thanks to `torch._foreach_lerp` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103550 Approved by: https://github.com/janeyx99	2023-06-21 15:35:20 +00:00
Brian Hirsh	3cfd677b1f	fix inference mode / PyDispatcher / Functionalize interaction (#103275 ) Fixes https://github.com/pytorch/pytorch/issues/103132 This is kind of annoying: Functionalization (and also vmap, I think?) manually figures out which ops have C++ CompositeImplicit decomps, and directly registers them to the Functionalize key. This is a problem for the PyDispatcher: We normally want the PyDispatcher to take precedence over the regular dispatcher. But in this case, we have a python decomp registered to `CompositeImplicitAutograd`, and a C++ decomp registered directly to the `Functionalize` key, so the C++ decomp gets precedence over the python decomp. The way this showed up was that a model was running `matmul()` under inference mode, so we never hit the autograd dispatch key, and go straight to the functionalize dispatch key. Matmul has both a python decomp and a c++ decomp, but we were running the C++ decomp. That C++ decomp isn't meant to be used with dynamic shapes, so we were failing with the "tried to call `.sizes()` on a tensor with dynamic shapes" error. For now, I had the PyDispatcher mimic the behavior of functionalization codegen: when you register a python decomp to the `CompositeImplicitAutograd` key, this PR just automatically registers that decomp to the `Functionalize` key at the same time. I'm trying to remember now why we didn't just add `Functionalize` (and all of the other functorch transform keys) directly to the `CompositeImplicitAutograd` alias keyset, but I couldn't remember (@zou3519 any chance you remember?). Pull Request resolved: https://github.com/pytorch/pytorch/pull/103275 Approved by: https://github.com/ezyang, https://github.com/zou3519	2023-06-21 15:19:55 +00:00
Brian Hirsh	106d3f0115	[AOTAutograd] make _unsafe_view() logic happen during the runtime epilogue (#103919 ) Fixes https://github.com/pytorch/pytorch/issues/103153 AOTAutograd has some logic for handling the case when we have: * a graph output that is a view of an intermediate * None of the other aliases of that output escape the graph, so from the perspective of the user + the autograd engine, we can pretend that the output is not a view However, that logic would inject an `_unsafe_view()` call into the graph at trace time. This isn't wrong, but inductor will just immediately decompose `_unsafe_view()` into `view()`, and so the output tensor will continue to show up as having view metadata w.r.t. autograd. This PR changes the `unsafe_view()` call to be in the runtime epilogue, instead of being part of the graph (where the compiler might do bad things to it - the compiler also shouldn't have to concern itself with autograd metadata). Pull Request resolved: https://github.com/pytorch/pytorch/pull/103919 Approved by: https://github.com/ezyang	2023-06-21 14:37:35 +00:00
Edward Z. Yang	7ce932a92c	Add signpost_event to dynamic_shapes (#103882 ) Added two signpost_event calls to torch.fx.experimental.symbolic_shapes, one for produce_guards (where we can give stats like how many free symbols and how many guards produced) and the other is for evaluate_expr after freeze (so we can look for cases where we're improperly discarding guards in backwards.) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103882 Approved by: https://github.com/Skylion007	2023-06-21 13:26:21 +00:00
cyy	716b3b893d	Use missing-prototypes in torch_cpu (#103725 ) This PR enables Wmissing-prototypes in torch_cpu except some generated cpp files and the mps and metal backends. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103725 Approved by: https://github.com/albanD	2023-06-21 13:19:55 +00:00
kshitij12345	d552c271db	[pt2] grad support (#102264 ) Teach dynamo about grad Pull Request resolved: https://github.com/pytorch/pytorch/pull/102264 Approved by: https://github.com/zou3519	2023-06-21 10:13:09 +00:00
Nikita Shulga	6d2887cc06	Reland "Move tensor grouping to ATen" (#103912 ) This is a reland of https://github.com/pytorch/pytorch/pull/100007 with a build fix for Windows debug builds. `at::native::ParamsHash` only works on structs with standard layout, but `std::string` isn't one in Visual C++ debug builds, which one can easily verified by running something like: ```cpp #define _DEBUG #include <type_traits> #include <string> static_assert(std::is_standard_layout_v<std::string>, "Oh noes"); ``` If above conditon is not met, instead of printing a static_assert output, VC++ raises a very cryptic compilation errors, see https://github.com/pytorch/pytorch/pull/100007#discussion_r1227116292 for more detail. Also, using `std::hash` for string should result in a faster hash function. (cherry picked from commit 74b7a6c75e698378882d30958908073407f97fb3) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 5914771</samp> This pull request introduces a new function `_group_tensors_by_device_and_dtype` that can group tensors by their device and dtype, and updates the `foreach` utilities and several optimizers to use this function. The goal is to improve the performance, readability, and compatibility of the code that handles tensors with different properties. The pull request also adds a test case and type annotations for the new function, and some error checks for the `fused` argument in Adam and AdamW. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103912 Approved by: https://github.com/janeyx99	2023-06-21 09:26:33 +00:00
Edward Z. Yang	b9f81a483a	Preserve original co_filename when FX symbolic_trace (#103885 ) Previously, you'd get `<eval_with_key>.0`; now you get `<eval_with_key>.0 from /data/users/ezyang/b/pytorch/test/dynamo/test_misc.py:5683 in forward` I used to do this with globals, but now I do it with a `co_fields` parameter that's plumbed around, because putting things in globals has implications(TM). Happy to bikeshed on the `co_fields` structure. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103885 Approved by: https://github.com/albanD	2023-06-21 08:28:50 +00:00
Chien-Chin Huang	6b1d6750b9	[FSDP][state_dict][BE] Remove outdated and fixed TODOs (#103782 ) Remove outdated and fixed TODOs Differential Revision: [D46807071](https://our.internmc.facebook.com/intern/diff/D46807071/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103782 Approved by: https://github.com/rohan-varma	2023-06-21 05:41:19 +00:00
Chien-Chin Huang	1192f5ac46	[FSDP][optim_state_dict] Cleanup the unused optimizer state_dict APIs (#103781 ) Cleanup the unused optimizer state_dict APIs Differential Revision: [D46803955](https://our.internmc.facebook.com/intern/diff/D46803955/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103781 Approved by: https://github.com/rohan-varma	2023-06-21 05:38:48 +00:00
PyTorch MergeBot	e737a8486f	Revert "[pt2] grad support (#102264 )" This reverts commit 85b83954c8820fc7473d8e7b68325fa8ed5753dc. Reverted https://github.com/pytorch/pytorch/pull/102264 on behalf of https://github.com/huydhn due to This is failing in trunk `85b83954c8` and looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/102264#issuecomment-1600001309))	2023-06-21 03:02:55 +00:00
Kurt Mohler	2642f31e4c	Make `torch.empty*` deterministic by filling with NaN or max int value (#101849 ) Part of #82004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101849 Approved by: https://github.com/lezcano, https://github.com/albanD, https://github.com/kulinseth	2023-06-21 02:53:22 +00:00
Fuzzkatt	d8352312f9	tf32 threshold fixes for various tests (#103138 ) Addresses tf32 threshold related failures from NVIDIA internal testing for following unit tests: A100: - test_nn.py: test_Conv2d_groups_thnn_cuda_tf32, test_Conv2d_pad_same_dilated_cuda_tf32, test_Conv2d_groups_cuda_tf32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103138 Approved by: https://github.com/kit1980	2023-06-21 02:25:42 +00:00
kshitij12345	85b83954c8	[pt2] grad support (#102264 ) Teach dynamo about grad Pull Request resolved: https://github.com/pytorch/pytorch/pull/102264 Approved by: https://github.com/zou3519	2023-06-21 01:37:08 +00:00
Michael Voznesensky	02f28de408	[dynamo x fsdp] Simplify stream logic handling (#103902 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103902 Approved by: https://github.com/awgu	2023-06-21 01:34:19 +00:00
Nikita Vedeneev	39a22e2791	softmax: Triton kernel for BSR inputs (#102095 ) Implements `softmax` Triton kernel for BSR inputs. So far, only over `dim=-1`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102095 Approved by: https://github.com/cpuhrsch	2023-06-21 01:23:27 +00:00
Kurt Mohler	ee83c646bb	Replace `_prims_common.check` with `torch._check` (#103240 ) This relands most of the changes from #102219 which were backed out by #103128. However, instead of removing `_prims_common.check`, it adds a warning and a comment mentioning that it will be removed in the future and `torch._check` should be used instead. As mentioned in https://github.com/pytorch/pytorch/pull/103128#pullrequestreview-1466414415, `_prims_common.check` cannot yet be removed because of some internal usage Part of #72948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103240 Approved by: https://github.com/albanD	2023-06-21 00:46:17 +00:00
PyTorch UpdateBot	f3c3d12efb	[vision hash update] update the pinned vision hash (#103869 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103869 Approved by: https://github.com/pytorchbot	2023-06-21 00:18:49 +00:00
Michael Voznesensky	e5e9d563c2	Lift user defined attributes into inputs for certain cases (user defined types and tensors) (#103386 ) (1) Lazy (converts to dynamo variable on access only) (2) Uses existing side effect/reconstruct tech (3) not tensor opinionated Pull Request resolved: https://github.com/pytorch/pytorch/pull/103386 Approved by: https://github.com/jansel	2023-06-20 23:45:19 +00:00
FindHao	8c2effcaf7	Fix bug for buffer reuse (#103720 ) When disable `allow_buffer_reuse` in `torch/_inductor/config.py`, the buffer reuse still happens. This PR fixes the missing check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103720 Approved by: https://github.com/jansel	2023-06-20 23:28:40 +00:00
PaliC	c9256ac609	add branch and sha info to alerting schema (#103631 ) https://github.com/pytorch/pytorch/pull/103897 is used for testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/103631 Approved by: https://github.com/huydhn	2023-06-20 22:58:37 +00:00
Wei Lu	a4b9872187	[PyTorch] add Vulkan support for `aten::repeat` (#103255 ) Summary: We implement `aten::repeat` on Vulkan backend through `aten::unsqueeze` and `aten::cat`. The behavior of `aten::repeat` is demonstrated here https://pytorch.org/docs/stable/generated/torch.Tensor.repeat.html Test Plan: `repeat_invalid_inputs_outputs_exceptions` check the following: - if the input tensor has dim <= 4 - if the size of `repeats` is >= input.dim - if the output tensor has dim <= 4 In `test_repeat` we check the following combinations: input is of dim between 1 and 4 and `repeats` is of size between `input.dim()` and 4. If a testcase failed, the shape info is printed, e.g. `Repeat test failed when input is of shape [13, 5, 13] and repeat of [7, 2, 3]`. ``` (base) luwei@luwei-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="repeat" Building: finished in 0.1 sec (100%) 263/2811 jobs, 0/2811 updated Total time: 0.1 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = repeat [==========] Running 2 tests from 1 test suite. [----------] Global test environment set-up. [----------] 2 tests from VulkanAPITest [ RUN ] VulkanAPITest.repeat_invalid_inputs_outputs_exceptions [ OK ] VulkanAPITest.repeat_invalid_inputs_outputs_exceptions (28 ms) [ RUN ] VulkanAPITest.repeat [ OK ] VulkanAPITest.repeat (46 ms) [----------] 2 tests from VulkanAPITest (75 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test suite ran. (75 ms total) [ PASSED ] 2 tests. ``` Reviewed By: yipjustin Differential Revision: D46244750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103255 Approved by: https://github.com/SS-JIA	2023-06-20 22:46:35 +00:00
Chien-Chin Huang	0ae4c4d417	[FSDP][optim_state_dict] Avoid calling optim.state_dict() to get the initial empty states (#103609) Users may prefix the keys optim state_dict. Using`optim.state_dict()` to get the initial states is brittle. This PR removes the call to `optim.state_dict()` and directly infers the empty states from the input states. Differential Revision: [D46729119](https://our.internmc.facebook.com/intern/diff/D46729119/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103609 Approved by: https://github.com/awgu	2023-06-20 22:11:58 +00:00
Nikita Shulga	8ce4fee68d	[BE] Use C++17 features in ParamsHash.h (#103911 ) - Nested namespaces - `std:: is_standard_layout_v` vs `std:: is_standard_layout<>::value` - Remove unnecessary typecast Pull Request resolved: https://github.com/pytorch/pytorch/pull/103911 Approved by: https://github.com/kit1980, https://github.com/atalman	2023-06-20 21:53:28 +00:00
Michael Suo	a475ea4542	[fx] change from #users to num_users in graph printout (#101140 ) `#users` means stuff in various chat apps, which makes it annoying to copypasta graphs into them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101140 Approved by: https://github.com/ezyang	2023-06-20 21:24:32 +00:00
Rodrigo Kumpera	f83ebfe1bb	[FSDP] Improve support for CPU tensors. (#103171 ) Don't emit device index when using CPU devices. Don't call Tensor::record_stream as it's CUDA only op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103171 Approved by: https://github.com/rohan-varma, https://github.com/wz337	2023-06-20 21:08:19 +00:00
shibo19	8b37821813	make balance check in DP only for cuda (#103311 ) Fixes #103825 1. if we want to use dp on other device ranther than "cuda", this balance check will raise error, so I make the balance check only effective for `cuda` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103311 Approved by: https://github.com/kit1980	2023-06-20 21:01:57 +00:00
Aleksei Nikiforov	4bd14d97f8	s390x simd: switch clamp min and max order (#103849 ) This change makes s390x behave closer to non-simd. It also fixes multiple tests in test/test_ops.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/103849 Approved by: https://github.com/kit1980	2023-06-20 20:39:26 +00:00
PyTorch MergeBot	f7737bb96b	Revert "Ensure ncclCommAbort can abort stuck ncclCommInitRank (#103264 )" This reverts commit 03881b0c925f191ec41d6899d589ed420ac285b5. Reverted https://github.com/pytorch/pytorch/pull/103264 on behalf of https://github.com/osalpekar due to This commits seems to have been causing failures in test_nccl_init_abort. Those failures may have been masked by pre-existing failures in the distributed jobs on trunk when running CI on this PR. Since those breaking changes are now reverted, we should be able to rebase this and get clean signal + uncover the breakages caused by this PR. ([comment](https://github.com/pytorch/pytorch/pull/103264#issuecomment-1599451197))	2023-06-20 20:29:43 +00:00
Kaichen Liu	d06fc1bfda	[PyTorch] Add Vulkan support and tests for at::softmax along all dimensions for 4-dim Tensors (#102988 ) Summary: Extending support for the [Softmax function](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html) on the PyTorch Vulkan GPU backend. # Before 1. Softmax could only be calculated along dim=1, AKA along channel with NCHW convention 2. Softmax input Vulkan Tensor must have had size 1 along dim=0, AKA batch size of 1 with NCHW convention 3. Softmax input Vulkan Tensor must be 4-dimensional, AKA NCHW # After 1. Softmax can be calculated along any dim={0,1,2,3} 2. Softmax input Vulkan Tensor can have any size along dim=0 3. Softmax input Vulkan Tensor must be 4-dimensional, AKA NCHW Test Plan: 1. `buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1` on Apple M1 MacBook 2. Confirm all tests pass with no regression, and the added tests `softmax` pass under `-- --gtest_filter="softmax"` 2a. All tests P758913494 2b. `softmax` tests P758910449 3. Overview: ``` ~/fbsource » buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 [...] [ RUN ] VulkanAPITest.softmax_4d [ OK ] VulkanAPITest.softmax_4d (69 ms) [...] [----------] 275 tests from VulkanAPITest (3149 ms total) [----------] Global test environment tear-down [==========] 275 tests from 1 test suite ran. (3149 ms total) [ PASSED ] 274 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log ``` Reviewed By: SS-JIA Differential Revision: D45880611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102988 Approved by: https://github.com/SS-JIA	2023-06-20 20:19:05 +00:00
Yanbo Liang	8391618b99	[Inductor][FX passes] Pre grad pass modified graph should be topological sorted (#103794 ) Error from 14k github models. Minimized repro: please check the unit test I added Error: ``` torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: RuntimeError: Argument 'getitem_2' of Node 'sin' was used before it has been defined! Please check that Nodes in the graph are topologically ordered graph(): %l_x_ : torch.Tensor [#users=1] = placeholder[target=L_x_] %sin : [#users=1] = call_function[target=torch.sin](args = (%getitem_2,), kwargs = {}) %unbind : [#users=2] = call_function[target=torch.unbind](args = (%l_x_,), kwargs = {dim: 0}) %getitem_2 : [#users=1] = call_function[target=operator.getitem](args = (%unbind, 0), kwargs = {}) %getitem_3 : [#users=1] = call_function[target=operator.getitem](args = (%unbind, 1), kwargs = {}) %sin_1 : [#users=1] = call_function[target=torch.sin](args = (%getitem_3,), kwargs = {}) %stack : [#users=1] = call_function[target=torch.stack](args = ([%sin, %sin_1],), kwargs = {}) return (stack,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103794 Approved by: https://github.com/devashishshankar	2023-06-20 20:05:47 +00:00
Andrew Gu	974525c053	De-register forward hooks upon exiting flop counter context (#103744 ) This PR fixes https://github.com/pytorch/pytorch/issues/103684. - Instead of registering forward hooks in `__init__()`, do it upon `__enter__()`. - De-register those forward hooks upon `__exit__()`. - Achieve this by saving an additional mapping `_module_to_forward_hook_handles: Dict[nn.Module, _ForwardHookHandles]`. Only the values in the mapping (i.e. not the keys) are useful for this change. (A `List[_ForwardHookHandles]` would suffice.) - The unit test accesses private attributes `_forward_hooks` and `_forward_pre_hooks` :/ Note that this PR is technically not backward compatible since it does not register the hooks upon `__init__()`, which means that you will not get the flops counting without the context manager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103744 Approved by: https://github.com/Chillee	2023-06-20 19:34:02 +00:00
Aaron Bockover	54ff8ffedd	Add Thiago Crepaldi (ONNX) to CODEOWNERS (#103894 ) Adding @thiagocrepaldi. _Note also that the errors exist in `main` as well as the following users do not have write access anymore: @z-a-f, @robieta, @NivekT._ Pull Request resolved: https://github.com/pytorch/pytorch/pull/103894 Approved by: https://github.com/kit1980, https://github.com/thiagocrepaldi	2023-06-20 19:24:26 +00:00
Huy Do	3a53dbae2a	Update viable/strict script to ignore unstable jobs (#103899 ) As distributed jobs had been failing in the past few days, viable/strict branch hasn't been updated since June 15th. The issue was discovered when looking into nightly https://hud.pytorch.org/hud/pytorch/pytorch/nightly which sync with viable/strict. Despite the fact that the failing job has been marked as unstable by https://github.com/pytorch/pytorch/issues/103612, the script still counted it as a failure https://github.com/pytorch/pytorch/actions/runs/5319411414/jobs/9631875636, and we kind of forget to monitor viable/strict delay to notice this earlier. An alarm would probably need to be setup for this. I also update the Rockset query a bit to add a comment on that it's used for. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103899 Approved by: https://github.com/clee2000, https://github.com/seemethere, https://github.com/malfet	2023-06-20 19:24:20 +00:00
rzou	036cda415f	Change HigherOrderOperator default namespace from global to 'higher_order' (#103870 ) This PR changes the default namespace for higher order operators from the global namespace (e.g. torch.ops.cond) to `higher_order` (e.g. torch.ops.higher_order.cond). We don't actually change the namespace for existing HigherOrderOperators. The motivation is to stem the bleeding; exposing operators into the global namespace is a bad idea due to name collision with other user-defined namespaces. We will go in and fix the `_deprecated_global_ns` as necessary after this diff. Differential Revision: [D46809738](https://our.internmc.facebook.com/intern/diff/D46809738/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103870 Approved by: https://github.com/ydwu4	2023-06-20 19:10:55 +00:00
Yan Li	3ca8542dff	Fix _saved_tensors argument issue in test (#103594 ) Summary: fix broken test in https://github.com/pytorch/pytorch/issues/103460 Test Plan: pytest ./generated/test_pabloppp_pytorch_tools.py -k test_015 Differential Revision: D46723640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103594 Approved by: https://github.com/yanboliang	2023-06-20 19:03:41 +00:00
Zheng, Zhaoqiong	d52d1fd5ba	add description for unexpected case (#103500 ) Fixes #88547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103500 Approved by: https://github.com/mingfeima, https://github.com/mikaylagawarecki	2023-06-20 19:02:45 +00:00
Khushi Agrawal	f730e22b5b	[cpp] remove redundant code (#103808 ) These lines are already available in the header files. Link: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/AdaptivePooling.h#L21-L27 I think we don't need them in the individual `.cpp` files. cc'ing: @Skylion007 as I've seen you working on many great C++ stuff. Could you please confirm it? Thanks! Pull Request resolved: https://github.com/pytorch/pytorch/pull/103808 Approved by: https://github.com/Skylion007	2023-06-20 19:00:31 +00:00
PyTorch MergeBot	e031dd23b0	Revert "To add brief intro for CPU backend optimization (#103666 )" This reverts commit 013ffe457e79180d6aa3b82f20116052faee242a. Reverted https://github.com/pytorch/pytorch/pull/103666 on behalf of https://github.com/huydhn due to Failing doc tests in trunk `013ffe457e` ([comment](https://github.com/pytorch/pytorch/pull/103666#issuecomment-1599301270))	2023-06-20 18:33:01 +00:00
Edward Z. Yang	2722c52e52	Allow Unequality in top level IR too (#103746 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103746 Approved by: https://github.com/wconstab	2023-06-20 18:27:55 +00:00
lkct	50d8cf27e1	Fix annotations on `torch` function signatures (#103807 ) Fixes #103806 - `reduction` related functions are now automatically generated from yaml registration. - `Optional` or `Union` with `None` is properly added to where they were missing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103807 Approved by: https://github.com/ezyang	2023-06-20 18:08:01 +00:00
Zaili Wang	013ffe457e	To add brief intro for CPU backend optimization (#103666 ) This PR is about adding brief introduction for x86 CPU backend optimization. Per previous discussion, the former PR #103307 was closed and creating this one, the contents are put into a new file. @Guobing-Chen @jgong5 @mingfeima @jingxu10 please help review, thanks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103666 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-06-20 17:35:22 +00:00
Huy Do	b1ddd5a293	Revert "[DDP] multiple forward support for static graph (#103487 )" (#103873 ) Per the discussion in https://github.com/pytorch/pytorch/pull/103629#issuecomment-1598001313, I preemptively create this revert PR to revert all commits in the stack. This seems like a safer option than using the bot as the commit has already been in trunk since last week. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103873 Approved by: https://github.com/rohan-varma	2023-06-20 16:25:00 +00:00
PyTorch MergeBot	7b6dc72ffa	Revert "[decomp] Decompose logaddexp2 (#103765 )" This reverts commit bab21d20ebf45a5dc620b48791bb526f664445a5. Reverted https://github.com/pytorch/pytorch/pull/103765 on behalf of https://github.com/ezyang due to looks like land race ([comment](https://github.com/pytorch/pytorch/pull/103765#issuecomment-1599030496))	2023-06-20 15:35:02 +00:00
Shan19900305	a39466c934	Modify DeviceThreadHandles.h file for device generic. (#95133 ) …sed for other devices. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95133 Approved by: https://github.com/ezyang	2023-06-20 15:19:11 +00:00
Peter Bell	bab21d20eb	[decomp] Decompose logaddexp2 (#103765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103765 Approved by: https://github.com/Chillee	2023-06-20 09:24:21 +00:00
Tugsbayasgalan Manlaibaatar	d4b85f3031	Support params/buffers inside cond and map (#102310 ) With #102022, params and buffers are always treated as special case of free variables. In this PR, I switch cond and map implementation to the this method and deprecate the old tracing mechanism. Differential Revision: [D46746202](https://our.internmc.facebook.com/intern/diff/D46746202) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102310 Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519	2023-06-20 05:33:10 +00:00
Yanbo Liang	1be1f5090e	[Dynamo] Fix broken NNModule comparison (#103812 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/103812 Approved by: https://github.com/msaroufim	2023-06-20 04:01:24 +00:00
Thiago Crepaldi	1dba81f56d	Abstract FX->ONNX translation through FxOnnxInterpreter (#102810 ) # Summary by author * Previous to this PR, FX-to-ONNX conversion logic was sprinkled in several functions, files and class, such as `_export_fx_to_onnx`, `export_fx_to_onnxscript`, `_export_fx_node_to_onnxscript` and `OnnxDispatcher` class[1]. Although each had its specific role in the lowering of FX, they all are part of the same lowering process. * A `FxOnnxInterpreter` class, similar to but not derived from `torch.fx.Interpreter`, is introduced to drive the FX Graph -> ONNX Graph process. All functions and utilities from previous bullet were either moved under this class with minor refactoring. * One of the main changes is that each FX node now have their own entry point, providing lower complexity. It also provides isolation among them. * Why refactored as class and not as a bunch of functions? ONNX Exporter adopted Object Oriented paradigm since its origin, so this refactoring should not be seen as any break of paradigm. This is just a continuation of a previous design decision. Example of other classes include `Exporter`, `ExportOptions`, `ExportOutput`, `ExportOutputSerializer`, `ProtobufExportOutputSerializer`, `FXGraphExtractor`, `ResolvedExportOptions`, `Analysis`, `Diagnostic`, `DiagnosticContext`, `Decompose`, `Functionalize`, `MovePlaceholderToFront`, `RemoveInputMutation`, `ReplaceGetAttrWithPlaceholder`, `ShapeInferenceWithFakeTensor`, `OnnxRegistry`, `OnnxDispatcher`, just to name a few. * `torch.fx.Interpreter` was not used because its API only passes the node name (aka `target`) instead of the actual `torch.fx.Node` object to the node implementations. This is not sufficient as the ONNX conversion process needs to inspect the node to extract type, name and other info from the node. * This PR renames `OnnxDispatch` (without functionality changes) to `OnnxFunctionDispatcher` for clarity. ONNX word was too overloaded in this context. * This PR also moved the `passes` and `serialization` handling from the `_export_fx_to_onnx` util and moved them to `Exporter.export`, where they are consumed. Passes are not the goal of this PR, so it was moved to a temporary function called `pre_export_passes` (mainly the content of `_export_fx_to_onnx` without serialization and fx -> onnx call). * This PR also removed a bug in which new registry (and dispatcher, that wouldnt be a problem) were created for each pass was. That would be an issue with custom operators because only the original registry would have a reference to the custom operator. Below is a summarized structure of the export process: ```python class Exporter def export(self) -> ExportOutput: # 1) Trace torch.nn.Module into torch.fx.GraphModule graph_module = self.options.fx_tracer.generate_fx() # 2) Adapt input and output types to match ONNX graph updated_model_args = self.options.fx_tracer.input_adapter.apply() # 3) Run pre-export passes graph_module = pre_export_passes() # 4) Dispatch each FX node to an ONNX operator implementation # Model level FX -> ONNX. fx_interpreter = fx_onnx_interpreter.FxOnnxInterpreter() fx_interpreter.run() # 5) Serialize graph to ONNX ModelProto. onnx_model = onnxscript_graph.to_model_proto(self.options.opset_version) # 6 Create ExportOutput return torch.onnx.ExportOutput() class FxOnnxInterpreter: # NOT a torch.fx.Interpreter def run(self, node: torch.fx.Node): with torch.utils._mode_utils.no_dispatch(): for node in self.graph_module.graph.nodes: run_node(node) def run_node(node): if node.op == "placeholder": self.placeholder(node) elif node.op == "get_attr": self.get_attr(node) elif node.op == "call_function": self.call_function(node) elif node.op == "call_method": self.call_method(node) elif node.op == "call_module": self.call_module(node) elif node.op == "output": self.output(node) else: raise RuntimeError( f"Found node type not defined in torch.fx: {node.op}" ) def placeholder(self, node: torch.fx.Node): pass def call_function(self, node: torch.fx.Node): pass def output(self, node: torch.fx.Node): pass def call_method(self, node: torch.fx.Node): pass def call_module(self, node: torch.fx.Node): pass def get_attr(self, node: torch.fx.Node): pass class OnnxFunctionDispatcher: def dispatch( self, node: torch.fx.Node, onnx_args: Sequence[Optional[Union[_TensorLike, str, int, float, bool, list]]], onnx_kwargs: Dict[str, _type_utils.Argument], diagnostic_context: diagnostics.DiagnosticContext, ) -> Union["onnxscript.OnnxFunction", "onnxscript.TracedOnnxFunction"]: pass def get_aten_name( # promoted to public API self, node: torch.fx.Node, diagnostic_context: diagnostics.DiagnosticContext ) -> str: pass def get_function_overloads( # promoted to public API self, node: torch.fx.Node, aten_name: str, diagnostic_context: diagnostics.DiagnosticContext, ) -> Set[Union["onnxscript.OnnxFunction", "onnxscript.TracedOnnxFunction"]]: pass ``` Before this PR, that was the structure of the code: ```python # torch/onnx/_internal/exporter.py class Exporter: def export(self) -> ExportOutput: graph_module = self.options.fx_tracer.generate_fx( self.options, self.model, self.model_args, self.model_kwargs ) updated_model_args = self.options.fx_tracer.input_adapter.apply( self.model_args, self.model_kwargs ) return self.options.fx_tracer._export_fx_to_onnx( self.options, graph_module, updated_model_args ) # torch/onnx/_internal/exporter.py class FXGraphExtractor def _export_fx_to_onnx() -> ExportOutput: `# Ignore the fact it lives inside FXGraphExtractor. It was a temporary thing # Run all passes # ... with torch.utils._mode_utils.no_dispatch(): onnxscript_graph = passes.export_fx_to_onnxscript() # Run input adapter # ... # Run output adapter # ... # Export TorchScript graph to ONNX ModelProto. onnx_model = onnxscript_graph.to_model_proto(options.opset_version) # Create ExportOutput return torch.onnx.ExportOutput() # torch/onnx/_internal/fx/passes/fx_to_onnxscript.py def export_fx_to_onnxscript(): # Initialize the ONNX graph onnxscript_graph = graph_building.TorchScriptGraph() tracer = graph_building.TorchScriptTracingEvaluator(onnxscript_graph) for node in fx_module_with_metadata.graph.nodes: _export_fx_node_to_onnxscript() return onnxscript_graph # torch/onnx/_internal/fx/passes/fx_to_onnxscript.py def _export_fx_node_to_onnxscript(): if node.op == "placeholder": # ... elif node.op == "call_function": symbolic_fn = options.onnx_dispatcher.dispatch() with evaluator.default_as(tracer): output = symbolic_fn(onnx_args, *onnx_kwargs) elif node.op == "output": # ... elif node.op == "call_method": # ... elif node.op == "call_module": # ... elif node.op == "get_attr": # ... else: raise RuntimeError(f"Found node type not defined in torch.fx: {node.op}") # torch/onnx/_internal/fx/function_dispatcher.py class OnnxDispatcher: @_beartype.beartype def dispatch() -> Union["onnxscript.OnnxFunction", "onnxscript.TracedOnnxFunction"]: # ONNX Script lookup only ``` [1] Note that the main functionalities in the fx -> onnx is orchestrated by functions in different files (see below). Although the main loop that drives the dispatching is executed by a well-defined function (`export_fx_to_onnxscript`), this is not the entry point of the export process. The entry point is a utility function called `_export_fx_to_onnx`, which calls `export_fx_to_onnxscript`, that in turn will call another utility called `_export_fx_node_to_onnxscript`. Lastly, `_export_fx_node_to_onnxscript` implements all* FX nodes in a huge monolithic block. the "call_function" segment of such block consumes `OnnxDispatcher`, completing the cycle. ```bash _export_fx_to_onnx torch/onnx/_internal/exporter.py _export_fx_node_to_onnxscript torch/onnx/_internal/fx/fx_to_onnxscript.py export_fx_to_onnxscript torch/onnx/_internal/fx/fx_to_onnxscript.py OnnxDispatcher torch/onnx/_internal/fx/onnxfunction_dispatcher.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102810 Approved by: https://github.com/wschin, https://github.com/BowenBao	2023-06-20 02:40:54 +00:00
BowenBao	724a1ba2de	Tidy __all__ under torch._refs (#103712 ) - Added ops that were missing under `__all__`. - Some misc changes to helper functions to make them private. - Set correct `fn.__module__` for `fn` created by `_make_alias`, when called in another module. All modification largely references results from a hacked version of `test_public_bindings::test_correct_module_names`. By default `torch._refs` is not included in the test because it is technically a private package. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103712 Approved by: https://github.com/lezcano	2023-06-20 00:04:58 +00:00
ekkapricious	5d34656fd7	Update dynamo sum dtype handling to match eager (#103037 ) The current behaviour for dynamo is to set the dtype to torch.int64 for integral types if the dtype is not specified explicitly which results in mismatched behaviour as compared to eager mode. In eager mode the semantics are: - If both out is specified and dtype is specified then they have to match - If dtype is not specified but out is specified then the dtype is set to match the out dtype - If neither dtype nor out is set then the dtype is set to kLong if it is a bool or an integral type Fixes #100698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103037 Approved by: https://github.com/ngimel	2023-06-19 22:26:37 +00:00
Nikita Shulga	13ef0ec186	Add "slow" tests to list of disable conditions (#103856 ) Companion PR to https://github.com/pytorch/test-infra/pull/4306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103856 Approved by: https://github.com/huydhn	2023-06-19 21:22:35 +00:00
Liang	def1b57151	Update datapipe.py (#103834 ) change 'dp' to 'source_dp' Fixes #103833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103834 Approved by: https://github.com/kit1980	2023-06-19 18:05:56 +00:00
Nikita Shulga	55814bb46e	[CI] Limit service jobs to Pytorch org (#103853 ) Otherwise, everybody who forks the repo will try to run those Pull Request resolved: https://github.com/pytorch/pytorch/pull/103853 Approved by: https://github.com/huydhn	2023-06-19 17:47:57 +00:00
PyTorch UpdateBot	3e42854caa	[xla hash update] update the pinned xla hash (#103827 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103827 Approved by: https://github.com/pytorchbot	2023-06-19 10:39:45 +00:00
leslie-fang-intel	9832cfbbfe	Quantization oneDNN backend only support VNNI CPU (#103653 ) Summary - Update the quantization document that default qconfig with oneDNN backend is recommended to be used on CPUs with Vector Neural Network Instruction support. - Add the warning message when user uses default qconfig with oneDNN backend on CPU without Vector Neural Network Instruction support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103653 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-06-19 09:50:07 +00:00
Wei Lu	7b3242d5f7	[PyTorch Vulkan] fix bug of `aten::cat` for concatenation of 3D tensors at channel dim with channels as multiple of 4 (#103718 ) Summary: The original `cat_feature_mult4ch` assumes input tensors are of 4d and use `tensor.sizes()[1]` to obtain the channel info of the tensor. This will cause bugs when the input tensors are of 3D. We generalize `cat_feature_mult4ch` to make it cover both 3D and 4D. Test Plan: Test for 3D tensors with channels as multiple of 4 is show below. Full test result is in P771032677. ``` (base) luwei@luwei-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="cat_3d_dim0_mult4ch_success" Building: finished in 0.1 sec (100%) 263/2812 jobs, 0/2812 updated Total time: 0.1 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = cat_3d_dim0_mult4ch_success [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from VulkanAPITest [ RUN ] VulkanAPITest.cat_3d_dim0_mult4ch_success [ OK ] VulkanAPITest.cat_3d_dim0_mult4ch_success (129 ms) [----------] 1 test from VulkanAPITest (129 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (129 ms total) [ PASSED ] 1 test. ``` Reviewed By: SS-JIA Differential Revision: D46755034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103718 Approved by: https://github.com/SS-JIA	2023-06-19 06:30:44 +00:00
XiaobingSuper	79fe3aef2f	indutor: fix issue of compute index_expr range (#103147 ) For the CPU inductor side, there has an optimization to convert ```int64``` index_expr to ```int32``` for good performance(https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/cpp.py#L2034), but for ```ModularIndexing``` exp, we replace it as division(https://github.com/pytorch/pytorch/blob/main/torch/_inductor/optimize_indexing.py#L73, ```ModularIndexing``` doesn't have derivative) to compute derivative and then compute the expr's value range, there may meet issue which the min value may greater than the max value(```ModularIndexing(513*i2 + i3 + 262400, 512, 513), with vars_ranges is {i2: ValueRanges(lower=0, upper=256), i3: ValueRanges(lower=0, upper=513)}```). One solution is that we don't replace ```ModularIndexing```, but it can't get the value range. Another solution is that return ```inf``` range when the min val is great than the max val. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103147 Approved by: https://github.com/jgong5, https://github.com/eellison	2023-06-19 04:39:02 +00:00
XiaobingSuper	01abccf63f	inductor: fix CppTile2D bf16 store complier error for cpp backend (#103659 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103659 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-19 00:46:30 +00:00
Bin Bao	adeb63de95	[CI] Fix a bug that bfloat16 is also used for dashboard training run (#103816 ) Summary: The past two training runs were on bfloat16. Let's merge this ASAP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103816 Approved by: https://github.com/Chillee	2023-06-18 23:05:00 +00:00
Iris	15eed5b73e	[Oncall][MTPG] Fix flaky test multi_threaded - test_broadcast_object_list (#103568 ) This test(`8340762211/test/distributed/test_multi_threaded_pg.py (L133)` ) is failing on internal sandbox with the following error msg: ``` File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/caffe2/test/distributed/__multi_threaded__/multi_threaded#link-tree/torch/testing/_internal/distributed/multi_threaded_pg.py", line 255, in _start_coll raise Exception( Exception: world not ready, only 3 PG's registered but world has 4 ranks exiting thread 1 ERROR ``` Internal error report: https://www.internalfb.com/intern/test/562950031915334?ref_report_id=0 We believe this is because we no longer perform barrier after init (see https://github.com/pytorch/pytorch/pull/99937). This PR temporarily turn back on ```TORCH_DIST_INIT_BARRIER``` to avoid flaky test for the time being, but we should look into it to find a way to properly do this. cc. @kumpera @kwen2501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103568 Approved by: https://github.com/H-Huang	2023-06-18 07:05:28 +00:00
Hansong Zhang	59a01c49ee	[Reland][ET] Select used et_kernel_metadata only (#103705 ) Currently we rely on root operator, but we also need to check for et_kernel_metadata for used specialized kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103705 Approved by: https://github.com/larryliu0820	2023-06-18 00:33:28 +00:00
cyy	f5f020adb0	add override to Caffe2 (#103795 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/103795 Approved by: https://github.com/kit1980	2023-06-17 19:46:40 +00:00
Gao Tianlin	0a7351e9ee	[Doc] Fix torch.UntypedStorage.mps() doc (#103797 ) Fix doc typo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103797 Approved by: https://github.com/kit1980	2023-06-17 18:56:18 +00:00
Elias Ellison	1b16ac7481	Add A Pass to Fold Tensors With a Uniform Value, match sdpa on a few models (#103600 ) Adds a Constant Folding pass to the joint graph only targeting tensors which can be replaced with a single value, and then removes no-ops from the graph. This allows us to match sdpa in BertForMaskedLM, AlbertForMaskedLM, and LayoutLMForMaskedLM. BertForMaskedLM Perf: 1.6853 -> 1.933, Memory: 0.9462 -> 1.41 AlbertForMaskedLM Perf: 1.6620 -> 1.761, Memory: 1.257 -> 1.94 LayoutLMForMaskedLM Perf: (non cudagraphs) 1.6991 -> 1.939x, Memory: 0.9624 -> 1.50 MobileBertForMaskedLM Perf: 1.864x -> 1.941x, Memory: 0.94 -> 1.03 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103600 Approved by: https://github.com/jansel	2023-06-17 16:50:51 +00:00
leslie-fang-intel	dbc8eb2a8f	[Quant][PT2E]Enable x86 inductor quantizer (#98730 ) Summary - Enable `X86InductorQuantizer` basics. - Recipe to annotate conv2d is added. Test Plan ``` python -u -m pytest -s -v test_x86inductor_quantizer.py -k TestQuantizePT2EX86Inductor ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98730 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-06-17 06:10:23 +00:00
Aleksei Nikiforov	2357498a8c	s390x simd: ensure that vectorized complex constructor behaves same to x86 (#103426 ) This change fixes multiple tests, including test_noncontiguous_samples_lerp_cpu_complex64 from test/test_ops.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/103426 Approved by: https://github.com/malfet	2023-06-17 02:40:51 +00:00
Bin Bao	a2988c9e6a	[CI] Switch inference accuracy and performance tests to bfloat16 (#103535 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103535 Approved by: https://github.com/eellison	2023-06-17 00:24:37 +00:00
albanD	918fe519a0	Use the new analytics ID (#103766 ) Re: https://github.com/pytorch/pytorch.github.io/issues/1397 Following the migration to latest google analytics FYI @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/103766 Approved by: https://github.com/svekars	2023-06-16 23:21:08 +00:00
Will Feng	9541053cca	[dynamo] support FakeTensor for SYM_INT/SYM_INT_LIST/INT_LIST param in python-to-cpp argument parsing (#103448 ) before the PR, when compiling a function with signature symint/symintlist/intlist, we have runtime error like ```argument 'shifts' must be tuple of ints, not FakeTensor```. see newly added unit test in test/dynamo/test_misc.py for repro after the PR, for FakeTensor with empty size and numel()=1, we will try to convert it into symint/symintlist. we will likely see expected exception ```torch._subclasses.fake_tensor.DataDependentOutputException / aten._local_scalar_dense.default``` during conversion reference PR: * we handle FakeTensor for symintlist as 1st varags: https://github.com/pytorch/pytorch/pull/97508 * we handle FakeTensor for intlist in a similar way: https://github.com/pytorch/pytorch/pull/85759/files * call local_scalar_dense on a FakeTensor: `f7365eca90` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103448 Approved by: https://github.com/yanboliang	2023-06-16 21:33:40 +00:00
PyTorch MergeBot	b34ac35b77	Revert "Use hipsolver for default svd case on ROCm (#103540 )" This reverts commit 0a4a7d4b26ab5c789df4dc690686e6a7d06b1ed0. Reverted https://github.com/pytorch/pytorch/pull/103540 on behalf of https://github.com/huydhn due to Turn out that the failure discussed in https://github.com/pytorch/pytorch/issues/102629 is not a fluke and ROCm signal in trunk is red atm ([comment](https://github.com/pytorch/pytorch/pull/103540#issuecomment-1595309297))	2023-06-16 20:59:40 +00:00
Daniil Kutz	750cbb299b	[RPC] Check stack for emptiness in interpreter (#103327 ) Hi! I found heap-buffer-overflow during PyTorch RPC-module fuzzing. [crash-9cc26b8da3b688a9c26614481239943b357c5636.zip](https://github.com/pytorch/pytorch/files/11707706/crash-9cc26b8da3b688a9c26614481239943b357c5636.zip) ``` "==10634==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6060001b6a98 at pc 0x000000639a2e bp 0x7fffffff9100 sp 0x7fffffff90f8", "READ of size 4 at 0x6060001b6a98 thread T0", " #0 0x639a2d in c10::IValue::isTensor() const /pytorch/aten/src/ATen/core/ivalue.h:432:27", " #1 0x639a2d in c10::IValue::toTensor() && /pytorch/aten/src/ATen/core/ivalue_inl.h:159:7", " #2 0xc5eb105 in at::Tensor c10::IValue::to<at::Tensor>() && /pytorch/aten/src/ATen/core/ivalue_inl.h:1690:1", " #3 0xc5eb105 in void torch::jit::pop<at::Tensor>(std::vector<c10::IValue, std::allocator<c10::IValue> >&, at::Tensor&) /pytorch/aten/src/ATen/core/stack.h:130:55", " #4 0xc5eaedb in torch::jit::dtype(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch/torch/csrc/jit/mobile/promoted_prim_ops.cpp:105:3", " #5 0xcc79600 in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch/torch/csrc/jit/runtime/interpreter.cpp:682:13", " #6 0xcc4158b in torch::jit::InterpreterStateImpl::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch/torch/csrc/jit/runtime/interpreter.cpp:1052:9", " #7 0x60f378 in runGraph(std::shared_ptr<torch::jit::Graph>, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) /jit_differential.cc:66:38", " #8 0x610bb9 in LLVMFuzzerTestOneInput /jit_differential.cc:107:25", " #9 0x535c91 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15", " #10 0x51fb9c in fuzzer::RunOneTest(fuzzer::Fuzzer, char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6", " #11 0x5258eb in fuzzer::FuzzerDriver(int, char**, int ()(unsigned char const, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9", " #12 0x54eea2 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10", " #13 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)", " #14 0x51a4bd in _start (/jit_differential_fuzz+0x51a4bd)", "", "0x6060001b6a98 is located 8 bytes to the left of 64-byte region [0x6060001b6aa0,0x6060001b6ae0)", "allocated by thread T0 here:", " #0 0x60c66d in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3", " #1 0xa5a41b in std::_Vector_base<c10::IValue, std::allocator<c10::IValue> >::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:346:20", " #2 0xa5a41b in void std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_realloc_insert<c10::IValue&>(__gnu_cxx::__normal_iterator<c10::IValue, std::vector<c10::IValue, std::allocator<c10::IValue> > >, c10::IValue&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:440:33", " #3 0xa5a241 in c10::IValue& std::vector<c10::IValue, std::allocator<c10::IValue> >::emplace_back<c10::IValue&>(c10::IValue&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:121:4", " #4 0xcc8209c in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch/torch/csrc/jit/runtime/interpreter.cpp:345:19", " #5 0xcc4158b in torch::jit::InterpreterStateImpl::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch/torch/csrc/jit/runtime/interpreter.cpp:1052:9", " #6 0x60f378 in runGraph(std::shared_ptr<torch::jit::Graph>, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) /jit_differential.cc:66:38", " #7 0x610bb9 in LLVMFuzzerTestOneInput /jit_differential.cc:107:25", " #8 0x535c91 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15", " #9 0x51fb9c in fuzzer::RunOneTest(fuzzer::Fuzzer, char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6", " #10 0x5258eb in fuzzer::FuzzerDriver(int, char**, int ()(unsigned char const*, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9", " #11 0x54eea2 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10", " #12 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)", "", "SUMMARY: AddressSanitizer: heap-buffer-overflow /pytorch/aten/src/ATen/core/ivalue.h:432:27 in c10::IValue::isTensor() const", "Shadow bytes around the buggy address:", " 0x0c0c8002ed00: 00 00 00 00 00 00 00 fa fa fa fa fa fd fd fd fd", " 0x0c0c8002ed10: fd fd fd fd fa fa fa fa fd fd fd fd fd fd fd fd", " 0x0c0c8002ed20: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa", " 0x0c0c8002ed30: fd fd fd fd fd fd fd fd fa fa fa fa 00 00 00 00", " 0x0c0c8002ed40: 00 00 00 00 fa fa fa fa fd fd fd fd fd fd fd fd", "=>0x0c0c8002ed50: fa fa fa[fa]00 00 00 00 00 00 00 00 fa fa fa fa", " 0x0c0c8002ed60: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa", " 0x0c0c8002ed70: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa", " 0x0c0c8002ed80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa", " 0x0c0c8002ed90: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa", " 0x0c0c8002eda0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa", "Shadow byte legend (one shadow byte represents 8 application bytes):", " Addressable: 00", " Partially addressable: 01 02 03 04 05 06 07", " Heap left redzone: fa", " Freed heap region: fd", " Stack left redzone: f1", " Stack mid redzone: f2", " Stack right redzone: f3", " Stack after return: f5", " Stack use after scope: f8", " Global redzone: f9", " Global init order: f6", " Poisoned by user: f7", " Container overflow: fc", " Array cookie: ac", " Intra object redzone: bb", " ASan internal: fe", " Left alloca redzone: ca", " Right alloca redzone: cb", "==10634==ABORTING" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103327 Approved by: https://github.com/Skylion007	2023-06-16 20:12:51 +00:00
Nikita Shulga	f1b367c418	[BE] Nested namespace in `ATen/native` headers (#103753 ) Use nested namespace and `enum class` in `ATen/native` headers. In particular, it helps avoid polluting global namespace with `MAX`,`MIN` enum values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103753 Approved by: https://github.com/atalman, https://github.com/Skylion007	2023-06-16 19:51:45 +00:00
lkct	fd4beb7a05	Better function annotations for `nn.functional` (#102918 ) Fixes #102768 - Provides proper function declarations in generated `torch/nn/functional.pyi`. - Moves some functions from manually defined in `functional.pyi.in` to generated code, in order to single-source the signature. - Includes some of the functions in `torch._C._nn` into its `.pyi.in`, but not exhaustive (only what's already there). Pull Request resolved: https://github.com/pytorch/pytorch/pull/102918 Approved by: https://github.com/drisspg, https://github.com/malfet	2023-06-16 19:48:04 +00:00
PaliC	36ff9879de	update multipy pin to not use install options (#103758 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103758 Approved by: https://github.com/huydhn	2023-06-16 19:36:55 +00:00
Avi Verma	d80174e2db	Do not materialize entire randperm in RandomSampler (#103339 ) In our DDP training workloads, each rank was initializing a `RandomSampler` for a dataset with a length of 3.5 billion items. We noticed that when this sampler was in scope, `gc.collect` calls were taking on the order of seconds to run, which would slow down the entire training iteration. This is because when we call `torch.randperm(n).tolist()`, we create a python list of 3.5 billion items, which massively slows down the periodic mark & sweep garbage collection. This PR swaps out the `.tolist()` call with a `.numpy()` call and manually calls `.item()` on each element as it is being requested. This has two benefits: 1. The first call to `RandomSampler::__next__` should be about twice as fast, since `.numpy` does not copy the contents of the original tensor 2. The runtime of `gc.collect()` calls no longer scales linearly with the size of the dataset passed to `RandomSampler` I've attached some `timeit` samples to illustrate the speedups with this Pr: ``` Main (no GC): 51.72115747816861 Main (10 GC calls) 83.61965207383037 PR (no GC) 33.06403830461204 PR (10 GC calls) 33.959467427805066 ``` Code ```python from timeit import timeit baseline_no_gc = """ import torch n = int(1e9) steps = n // 100 x = torch.randperm(n).tolist() x_iter = iter(x) for i in range(steps): next(x_iter) """ baseline_gc = """ import torch import gc n = int(1e9) steps = n // 100 gc_every = steps // 10 x = torch.randperm(n).tolist() x_iter = iter(x) for i in range(steps): next(x_iter) if i % gc_every == 0: gc.collect() """ numpy_no_gc = """ import torch n = int(1e9) steps = n // 100 x = torch.randperm(n).numpy() x_iter = (i.item() for i in x) for i in range(steps): next(x_iter) """ numpy_gc = """ import torch import gc n = int(1e9) steps = n // 100 gc_every = steps // 10 x = torch.randperm(n).numpy() x_iter = (i.item() for i in x) for i in range(steps): next(x_iter) if i % gc_every == 0: gc.collect() """ if __name__ == "__main__": print("Main (no GC): ", timeit(baseline_no_gc, number=1)) print("Main (10 GC calls)", timeit(baseline_gc, number=1)) print("PR (no GC)", timeit(numpy_no_gc, number=1)) print("PR (10 GC calls)", timeit(numpy_gc, number=1)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103339 Approved by: https://github.com/kit1980	2023-06-16 19:25:58 +00:00
WEN Hao	67babf7a45	Enhance decorator _use_grad_for_differentiable (#103567 ) Aim: enhance decorator _use_grad_for_differentiable so that functions (methods) decorated by it keep their docstrings and signatures unchanged. Fixes #103566 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103567 Approved by: https://github.com/janeyx99	2023-06-16 18:33:31 +00:00
Yanbo Liang	5875a2fb3c	[Inductor][FX passes] Forward fix an internal unit test failure. (#103739 ) Summary: Forward fix a corner case, see the failure at D46689080. Test Plan: Existing tests Differential Revision: D46789312 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103739 Approved by: https://github.com/devashishshankar	2023-06-16 17:28:29 +00:00
Aleksandar Samardžić	8fc687f7ee	Add activation functions (ReLU and SiLU for now) for structured sparse linear operator (#101339 ) Differential Revision: [D46453476](https://our.internmc.facebook.com/intern/diff/D46453476) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101339 Approved by: https://github.com/cpuhrsch	2023-06-16 17:24:59 +00:00
Jon Maltiel Swenson	0da38409a0	[gloo] Make it possible for gloo TCPStore to take over an existing socket fd (#103478 ) Summary: This diff allows the `TCPStore` server associated with a gloo process group to listen on an existing socket already bound to a port. Without the functionality in this diff, canonical initialization of a gloo `ProcessGroup` is fundamentally racy: 1) ask the OS for a free port by creating a socket bound to port 0, 2) close the socket, 3) attempt to initialize a `TCPStore` server that listens on the previously free port. Of course, the problem is that in between steps 2 and 3, another process on the host may have claimed the port, causing `TCPStore` and overall process group initialization to fail. With this diff, it is now possible for users to completely avoid this race (see unit test for how this can be achieved). Test Plan: Added new unit test: buck2 test caffe2/test/distributed:store Differential Revision: D46622317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103478 Approved by: https://github.com/H-Huang	2023-06-16 17:15:56 +00:00
Andrew Or	2bc56bec07	[quant][pt2] Handle literal conv args in convert QAT (#103731 ) Summary: Similar to the prepare case, we need to manually copy over literal conv args such as padding and stride to the new, replaced conv nodes, since these args are not captured by the subgraph rewriter. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_qat_conv_bn_fusion_literal_args Reviewed By: jerryzh168 Differential Revision: D46383130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103731 Approved by: https://github.com/jerryzh168	2023-06-16 17:15:37 +00:00
Richard Zou	08a054649c	[operator_compile_check] Add FakeTensor testing (#103595 ) This PR adds dedicated FakeTensor testing to operator_compile_check. We reuse CrossRefFakeMode to do this and improve the error messages on it. Note that this only really runs detailed tests for operators that do not have data-dependent output shape. In the future we should add something like a dynamic CrossRefFakeMode. Test Plan: - existing tests (these now have improved error messages). Pull Request resolved: https://github.com/pytorch/pytorch/pull/103595 Approved by: https://github.com/ezyang, https://github.com/soulitzer	2023-06-16 16:55:51 +00:00
Edward Z. Yang	23c143400e	use mutable_data_ptr for grad_input in backward passes (#98999 ) Summary: Test Plan: Rely on CI. Reviewers: ezyang Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/98999 Approved by: https://github.com/ezyang	2023-06-16 16:30:40 +00:00
Andres Lugo-Reyes	0a4a7d4b26	Use hipsolver for default svd case on ROCm (#103540 ) Fixes #102678 Fixes #102629 Fixes #102558 HipSOLVER performance on ROCm5.4.2 and later no longer serves as massive bottleneck. Additionally, using magma on rocm in this case caused test_compare_cpu_lialg_pinv_singular_cuda_float32 to fail. Using hipSOLVER, the test now passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103540 Approved by: https://github.com/lezcano	2023-06-16 14:57:34 +00:00
xuanqi	b27c3558a4	[RFC]: Create aten native op for constrain_range (#103346 ) At high current implementation of constrains functions (constrain_as_) will raise exception for the following code snippets: ``` def f(x): a = x.item() constrain_as_size(a, 4, 7) return torch.empty((a, 4)) inp = torch.tensor([5]) ep = torch._export.export(f, (inp,)) ``` The reason is because current constrain logic is: 1) Purely python so it won't survive AOT export (the full node is gone after AOT export since AOT export only maintains aten level op). 2) Utilize side effect to add range constraints for traced symbol's shape env ([code](`9591e52880/torch/fx/experimental/symbolic_shapes.py (L370-L372)`)). 3) If runtime assertion is turned on (by default). [`_AddRuntimeAssertionsForConstraintsPass`](`9591e52880/torch/_export/passes/add_runtime_assertions_for_constraints_pass.py (L98-L100)`) will try to append assertion node based on range constrains extracted from shape env of symbol during another interpretation round. 4). However, since 1), in the round of AOT export, range constraints logic won't run for symbols generated during this round. And later there is no range constrains information available for assertion round and caused issue. 5) As a result of above, it will failure at `torch.empty((a, 4))` (there is no constrains for `a` that it must be positive). The fix here is just to implement range constrain logic as a native aten op (CPU implementation as no-op) to make it be able to survive AOT export. NOTE:** [Logic](`2d745b95d7/torch/fx/experimental/symbolic_shapes.py (L350-L365C15)`) within [`constrain_range`](`2d745b95d7/torch/fx/experimental/symbolic_shapes.py (LL313C74-L313C74)`) is split out as `constrain_range_int` to capture case when non `SymInt` is passed in and reused in the new `_constrain_range`. The reason is when non `SymInt` is provided: * If it directly calls `sym_constrain_range`, the C++ version will be called which will be no-op. * So in this case it calls `constrain_range_int` instead to be able to capture issue like user provides a input whose tensor's shape could be out of range during exporting, like the following for above code example: ``` ... inp = torch.tensor([10]) ep = torch._export.export(f, (inp,)) # immediately raise error ``` Differential Revision: [D46734204](https://our.internmc.facebook.com/intern/diff/D46734204) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103346 Approved by: https://github.com/tugsbayasgalan	2023-06-16 14:55:40 +00:00
ShuaipengLi	df814484f4	remove dynamo fake param/buf check (#103574 ) Fixes #103569 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103574 Approved by: https://github.com/ezyang	2023-06-16 14:19:37 +00:00
Zachary DeVito	ae78e80123	[memory_viz] fix javascript url (#103741 ) It turns out that jsdelivr, which is used to access the MemoryViz.js source from generated files, doesn't work unless a version is specified. This wasn't able to be tested until the PR actually landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103741 Approved by: https://github.com/aaronenyeshi	2023-06-16 13:15:45 +00:00
Yu, Guangye	ad4ee297ed	allow cpu scalar to be moved to xpu in masked_fill (#103645 ) # Motivation Align to CUDA scenario, allow cpu scalar to be moved to xpu device in masked_fill. # Solution Add "xpu" support in condition control. # Additional no need for more UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103645 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-06-16 12:15:43 +00:00
AllenTiTaiWang	d3971f2d15	[ONNX] Support aten::hstack and aten::vstack (#102872 ) https://github.com/microsoft/onnxscript/pull/762 is actually not used by FX graph. Fixes https://github.com/microsoft/onnx-converters-private/issues/168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102872 Approved by: https://github.com/justinchuby	2023-06-16 06:20:14 +00:00
Angela Yi	f889c886d4	[export] Make pass base composable (#103701 ) Moving ExportTracer so that EXIR can subclass it to do handling for delegates, and ExportPassBase can use the correct tracer. Upstreaming OSS changes in D45884895 first Pull Request resolved: https://github.com/pytorch/pytorch/pull/103701 Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan, https://github.com/ydwu4	2023-06-16 06:07:18 +00:00
AllenTiTaiWang	0411fc6ab6	[ONNX] Support aten::atleast_1d and aten::atleast_2d and aten::atleast_3d (#103061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103061 Approved by: https://github.com/justinchuby	2023-06-16 06:07:00 +00:00
Yanbo Liang	703875e364	[Reland][Dynamo] VariableTracker.recursively_contains should be updated correctly when mutation happens (#103564 ) (#103717 ) Summary: Reland of https://github.com/pytorch/pytorch/pull/103564 Test Plan: contbuild & OSS CI, see `5c3556da94` Differential Revision: D46783727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103717 Approved by: https://github.com/angelayi	2023-06-16 04:25:27 +00:00
XiaobingSuper	b287cb816c	inductor: make the vec_transpose's tiling stride doesn't depend on out_idx and tiling_idex (#103651 ) For TIMM swin_base_patch4_window7_224 dynamic shape path, there has an accuracy issue with horizontal reduction with vec_transpose: ``` #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(3136L); i1+=static_cast<long>(16L)) { { #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out = omp_out + omp_in) initializer(omp_priv={{0}}) float tmp_acc0 = 0; auto tmp_acc0_vec = at::vec::Vectorized<float>(tmp_acc0); for(long i2=static_cast<long>(0L); i2<static_cast<long>(128L); i2+=static_cast<long>(16L)) { float tmp1[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<float,16,16>(in_ptr1 + static_cast<long>(i2 + (128L(static_cast<long>((static_cast<long>(i1) % static_cast<long>(56L))) % static_cast<long>(7L))) + (896L(static_cast<long>(at::native::div_floor_integer(i1, 56L)) % static_cast<long>(7L))) + (6272L(at::native::div_floor_integer((static_cast<long>(i1) % static_cast<long>(56L)), 7L))) + (50176L(at::native::div_floor_integer(i1, 392L))) + (401408Li0)), static_cast<long>(((-50176L)(at::native::div_floor_integer(i1, 392L))) + ((-6272L)(at::native::div_floor_integer((static_cast<long>(i1) % static_cast<long>(56L)), 7L))) + ((-896L)(static_cast<long>(at::native::div_floor_integer(i1, 56L)) % static_cast<long>(7L))) + ((-128L)(static_cast<long>((static_cast<long>(i1) % static_cast<long>(56L))) % static_cast<long>(7L))) + (128L(static_cast<long>((static_cast<long>((1L + i1)) % static_cast<long>(56L))) % static_cast<long>(7L))) + (896L(static_cast<long>(at::native::div_floor_integer((1L + i1), 56L)) % static_cast<long>(7L))) + (6272L(at::native::div_floor_integer((static_cast<long>((1L + i1)) % static_cast<long>(56L)), 7L))) + (50176L(at::native::div_floor_integer((1L + i1), 392L)))), tmp1, 16); for (long i2_inner = 0; i2_inner < 16; i2_inner++) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i1 + (3136Li2) + (3136Li2_inner) + (401408Li0))); auto tmp2 = at::vec::Vectorized<float>::loadu(tmp1 + static_cast<long>(16Li2_inner)); auto tmp3 = tmp0 + tmp2; tmp_acc0_vec = tmp_acc0_vec + tmp3; } } tmp_acc0_vec.store(out_ptr0 + static_cast<long>(i1 + (3136Li0))); } } } ``` The ```transpose_mxn```'s ```ld_src``` depends on ```i1``` which is not expected. This PR will add a check to make sure the tiling stride doesn't depend on out_idx(```i2```) and tiling_idex(```i1```) After this PR, the generated code will be like this: ``` #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(3136L); i1+=static_cast<long>(16L)) { { #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out = omp_out + omp_in) initializer(omp_priv={{0}}) float tmp_acc0 = 0; auto tmp_acc0_vec = at::vec::Vectorized<float>(tmp_acc0); for(long i2=static_cast<long>(0L); i2<static_cast<long>(128L); i2+=static_cast<long>(16L)) { for (long i2_inner = 0; i2_inner < 16; i2_inner++) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i1 + (3136Li2) + (3136Li2_inner) + (401408Li0))); auto tmp1 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr1[static_cast<long>(i2 + i2_inner + (128L(static_cast<long>((static_cast<long>((i1 + i1_inner)) % static_cast<long>(56L))) % static_cast<long>(7L))) + (896L(static_cast<long>(at::native::div_floor_integer((i1 + i1_inner), 56L)) % static_cast<long>(7L))) + (6272L(at::native::div_floor_integer((static_cast<long>((i1 + i1_inner)) % static_cast<long>(56L)), 7L))) + (50176L(at::native::div_floor_integer((i1 + i1_inner), 392L))) + (401408Li0))]; return at::vec::Vectorized<float>::loadu(tmpbuf); })(); auto tmp2 = tmp0 + tmp1; tmp_acc0_vec = tmp_acc0_vec + tmp2; } } tmp_acc0_vec.store(out_ptr0 + static_cast<long>(i1 + (3136Li0))); } } } ``` How to reproduce this issue: ``` python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/timm_models.py --accuracy --float32 -dcpu --inference -n5 --inductor --dynamic-shapes --only swin_base_patch4_window7_224 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103651 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-16 03:56:39 +00:00
Zachary DeVito	19b3e07fe0	[memory_viz] Unified viewer (#103565 ) This replaces the invidual visualization routines in _memory_viz.py with a single javascript application. The javascript application can load pickled snapshot dumps directly using drag/drop, requesting them via fetch, or by embedding them in a webpage. The _memory_viz.py commands use the embedding approach. We can also host MemoryViz.js on a webpage to use the drag/drop approach, e.g. https://zdevito.github.io/assets/viz/ (eventually this should be hosted with the pytorch docs). All views/multiple cuda devices are supported on one page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103565 Approved by: https://github.com/eellison, https://github.com/albanD	2023-06-16 03:49:48 +00:00
Zachary DeVito	346feb6b56	[memory_viz] profile_plot generates snapshot objects (#103497 ) This will make it easier to use a single html viewer for both ways of generating the data. The next PR will change MemoryPlot.js to simply read the snapshot information directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103497 Approved by: https://github.com/eellison	2023-06-16 03:49:48 +00:00
Zachary DeVito	efc3bcceb1	Move memory viz templates into separate javascript files (#103474 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103474 Approved by: https://github.com/eellison	2023-06-16 03:49:46 +00:00
Edward Z. Yang	69969e52c3	Cast computation_node_input_size to int (#103677 ) This bandaid fixes yolov3 with automatic_dynamic_shapes. A more proper fix probably is to figure out why when we have ``` TypeError: mkldnn_reorder_conv2d_weight(): argument 'input_size' (position 6) must be tuple of ints, but found element of type SymInt at pos 1 ``` where the SymInt is known to be constant, we aren't willing to coerce it to int. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103677 Approved by: https://github.com/voznesenskym	2023-06-16 03:31:34 +00:00
PyTorch UpdateBot	bcf2becaf2	[vision hash update] update the pinned vision hash (#103721 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103721 Approved by: https://github.com/pytorchbot	2023-06-16 03:17:59 +00:00
Edward Z. Yang	d1effcd4a9	Don't apply automatic_dynamic_shapes if we force tensor to be static (#103673 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103673 Approved by: https://github.com/voznesenskym	2023-06-16 03:05:42 +00:00
Edward Z. Yang	39ba2e6226	Allow for sympy.Expr in tensor lowering in inductor (#103678 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103678 Approved by: https://github.com/voznesenskym	2023-06-16 02:41:23 +00:00
Andrew Or	dad29f906b	[quant][pt2] Fix no conv bias in convert QAT (#103298 ) Summary: Previously, the QAT pattern for conv + bn with no conv bias was not actually replaced in convert. This commit adds an extra pattern in the convert path for this case and the numerics now match FX's. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_fusion_no_conv_bias Reviewed By: jerryzh168 Differential Revision: D46382819 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103298 Approved by: https://github.com/jerryzh168	2023-06-16 01:59:48 +00:00
Mengwei Liu	a52b6f086d	[export][serde] Add validator to compare deserializer opset version with model opset version (#103691 ) This PR adds a validator to compare model opset version and deserializer opset version. This currently raises exception if any of the version doesn't match. Note: the validator will only print warning if the op namespace in model is missing from the deserializer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103691 Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17	2023-06-16 01:36:43 +00:00
Bin Bao	1f5ee39c6c	[reland][inductor] Make clone_graph copy node name as well (#103688 ) Summary: This solves an inconsistency between two-pass fusion results when turning on cpp wrapper. This is a reland of https://github.com/pytorch/pytorch/pull/103409 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103688 Approved by: https://github.com/jansel	2023-06-16 00:32:07 +00:00
Mike Brown	806a642eb1	update README.md to reflect current build from source status on master (#92729 ) Signed-off-by: Mike Brown <brownwm@us.ibm.com> To avoid new contributor issues when building master a couple README.md comments will help... This change: ~~- Documents the current support restriction to apt package `g++-9` #91328 ** noting here that with the commit in https://github.com/pytorch/pytorch/pull/92911 g++-11.3 appears to build and run local tests at least as well as g++9, so this restriction may be overcome with that PR merge depending on success and CI updates.~~ (fixed now) - Documents wip status for CUDA 12 #91122 (by forwarding to support matrix per suggestion) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92729 Approved by: https://github.com/kit1980	2023-06-16 00:21:01 +00:00
Michael Lazos	38f35b4fc3	Add some missing disabled functions (#103662 ) Disable Adadelta, rprop, multitensor, and the fused optimizers Fixes https://github.com/pytorch/benchmark/actions/runs/5167132765/jobs/9307817625 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103662 Approved by: https://github.com/janeyx99	2023-06-16 00:11:13 +00:00
Mark Saroufim	ecf4ce7a0e	Silence CUDA graph tree cuda warning (#103636 ) Fixes ``` /data/home/marksaroufim/miniconda/envs/saf/lib/python3.8/site-packages/torch/_inductor/cudagraph_trees.py:85: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' if torch.has_cuda: Traceback (most recent call last): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103636 Approved by: https://github.com/eellison	2023-06-15 23:55:59 +00:00
Pritam Damania	03881b0c92	Ensure ncclCommAbort can abort stuck ncclCommInitRank (#103264 ) https://github.com/pytorch/pytorch/pull/95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior. However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op. To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`. I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/103264 Approved by: https://github.com/kwen2501	2023-06-15 23:40:22 +00:00
Animesh Jain	1985c490fe	[inductor] Fix tags for inductor random ops (#103648 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103648 Approved by: https://github.com/eellison, https://github.com/jansel	2023-06-15 23:27:55 +00:00
Jason Ansel	8c54cd434f	[inductor] Fix allow_buffer_reuse=False (#103630 ) Fixes #103461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103630 Approved by: https://github.com/anijain2305	2023-06-15 22:50:01 +00:00
Elias Ellison	7c152376b7	[Easy] Dont truncate cudagraph error msg (#103693 ) We're erroring anyway, we don't want to cut off important context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103693 Approved by: https://github.com/davidberard98, https://github.com/Skylion007	2023-06-15 22:21:10 +00:00
Shunting Zhang	5f979d400c	[inductor] let coordinate descent tuning respect max block size (#103660 ) It turns out that we need fix https://github.com/pytorch/pytorch/issues/103656 in coordinate descent tuner. Inductor generate triton code with assumption of max-block-size. If inductor is sure that numel is a multiple of the max-block-size, inductor will safely skip the check of the corresponding mask for perf reason. Coordinate descent tuner previous does not respect this assumption and may pick triton config with even larger block size. That will cause IMA. BTW, I was wondering how we pick those max block size. Not enforcing a max block size may allow coordinate descent tuner find an even better config. But it may slow down other cases a bit because of extra mask check. Test: ``` TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/torchbench.py --amp --performance --inference --inductor --only alexnet ``` Fail before and works after. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103660 Approved by: https://github.com/spectrometerHBH, https://github.com/jansel	2023-06-15 22:15:51 +00:00
Driss Guessous	155691a7d9	Implement meta functions for rshift and lshift (#103637 ) Fixes #103606 Was using this script to exercise new code, cause I can never remember which test it is. ``` import torch @torch.compile(fullgraph=True, dynamic=True) def shift_right(tensor: torch.Tensor) -> torch.Tensor: return (tensor >> 2).to(torch.long) def main(): sample_input = torch.tensor([4, 4, 16, 32], dtype=torch.uint8) print(shift_right(sample_input)) if __name__ == "__main__": main() ``` And iterated through the error messages Pull Request resolved: https://github.com/pytorch/pytorch/pull/103637 Approved by: https://github.com/ezyang	2023-06-15 21:49:22 +00:00
Thiago Crepaldi	6f655d4195	Add symbolic tracing support to torch._dynamo.export (fake input + weights) (#100017 ) Fixes #95900 Using the following repro as guide: ```python import torch import torch._dynamo from torch._subclasses import fake_tensor from torch.fx.experimental.symbolic_shapes import ShapeEnv from torch._dynamo.output_graph import config class Model(torch.nn.Module): def __init__(self) -> None: super().__init__() self.linear = torch.nn.Linear(2, 2) self.linear2 = torch.nn.Linear(2, 2) def forward(self, x): out = self.linear(x) out = self.linear2(out) return out fake_mode = fake_tensor.FakeTensorMode(allow_non_fake_inputs=False, allow_fallback_kernels=True, shape_env=ShapeEnv( allow_scalar_outputs=config.capture_scalar_outputs, allow_dynamic_output_shape_ops=config.capture_dynamic_output_shape_ops, frame_id=0 ), ) # Fakefying input/model before calling torch._dynamo.export with fake_mode: fake_x = torch.rand(5, 2, 2) model = Model() # Calling torch._dynamo.export without active fake mode graph_module, guards = torch._dynamo.export( model, fake_x, aten_graph=True, fake_mode=fake_mode ) graph_module.print_readable() graph_module.graph.print_tabular() ``` Summary of changes: * Plumb fake_mode through torch.export API. When specified, it replaces the creation of a new FaketendorMode at InstructionTranslator on behalf of OutputGraph Hacks FakeTensor.__new__ to prevent a torch.tensor._make_subclass call for inputs that are already fakefied by user. This probably need to be fixed in a nicer way. Any idea? * Removed a few asserts that didn't want faked tensors coming from user script * Added torch._subclasses.fake_tensor.FakeTensor to type list on a few asserts check to allow fake inputs The changes above allowed symbolic tracing with both static and dynamic shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100017 Approved by: https://github.com/ezyang	2023-06-15 21:28:10 +00:00
Nikita Shulga	f61b248d5b	[BE][Functorch] Use nested namespace (#103685 ) As we are a C++17 project now Also, replace `enum` with `enum class` to make enum values visibility limited to the namespace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103685 Approved by: https://github.com/zou3519	2023-06-15 20:56:19 +00:00
Nikita Shulga	def01eafc5	[BE] Remove unused `dim_plane` from `reflection_pad2d_backward_out_template` (#103680 ) Probably introduced by https://github.com/pytorch/pytorch/pull/102254 This fixes `variable 'dim_plane' set but not used ` on my clang-14.0.3 compiler complained about it: ``` /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/ReflectionPad.cpp:272:7: error: variable 'dim_plane' set but not used [-Werror,-Wunused-but-set-variable] int dim_plane = 0; ^ 1 error generated. ``` <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at e254b4b</samp> > _`dim_plane` is gone_ > _Simpler code, no more warning_ > _Autumn leaves fall fast_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/103680 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2023-06-15 20:43:57 +00:00
PyTorch MergeBot	8553f9c896	Revert "[ET] Select used et_kernel_metadata only (#103658 )" This reverts commit 480d20cac109836a44971af774184d9a2d98748e. Reverted https://github.com/pytorch/pytorch/pull/103658 on behalf of https://github.com/malfet due to Broke Windows builds ([comment](https://github.com/pytorch/pytorch/pull/103658#issuecomment-1593696503))	2023-06-15 20:41:45 +00:00
Ke Wen	22e8a61d9b	Implement coalesced reduce_scatter_tensor (#103561 ) Map of #101157. This PR adds support for coalesced `reduce_scatter_tensor` calls in the following syntax: Sync communication style: ``` with dist._coalescing_manager(): for i in range(num_coll): dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i]) ``` Async communication style: ``` with dist._coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i]) # do a bunch of other things cm.wait() # do things that depend on the reduce-scatters' results ``` Each `reduce_scatter_tensor` call can be independent in terms of their data and buffer locations. But could be executed in parallel by supported backends (like NCCL). Pull Request resolved: https://github.com/pytorch/pytorch/pull/103561 Approved by: https://github.com/fegin	2023-06-15 20:11:12 +00:00
Bin Bao	da7ca82121	[inductor] Store real inputs to be used for cpp wrapper codegen (#103289 ) Summary: defaked args (zeros) may cause device-side access assertion, so record the orginal real tensor inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103289 Approved by: https://github.com/jansel, https://github.com/eellison	2023-06-15 20:05:50 +00:00
Edward Z. Yang	ed3a61afcc	Add automatic_dynamic_shapes test configuration (#103598 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103598 Approved by: https://github.com/Skylion007	2023-06-15 19:55:57 +00:00
Hansong Zhang	480d20cac1	[ET] Select used et_kernel_metadata only (#103658 ) Currently we rely on root operator, but we also need to check for et_kernel_metadata for used specialized kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103658 Approved by: https://github.com/larryliu0820	2023-06-15 19:05:04 +00:00
mingfeima	0cd6ebd704	optimize replication padding performance on CPU (#102255 ) The major difference from the previous PR on ReflectionPad is the padding indexing struct, `ReplicationPad::index()`, the rest of the part is pretty much the same. The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms; ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms; (after) ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms; ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms; ``` ### single socket inference ``` (before) ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms; ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms; (after) ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms; ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102255 Approved by: https://github.com/cpuhrsch	2023-06-15 18:42:36 +00:00
Mikayla Gawarecki	d1cecd9c32	Add assign kwarg to module.load_state_dict (#102212 ) Fixes #64601 and #98906 Adds an `assign` argument to `load_state_dict` that loads params/buffers by assignment instead of doing `param.copy_(param_from_state_dict)`. Primarily intended to remove the need for the `.to_empty()` in ``` with torch.device('meta'): m = SomeModule() m.to_empty() state_dict = torch.load('...pth') m.load_state_dict(state_dict) ``` so we can instead do ``` with torch.device('meta'): m = SomeModule() state_dict = torch.load('...pth') m.load_state_dict(state_dict, assign=True) ``` A problem with this PR for the case where the model is initialized on meta is what happens to nonpersistent buffers/params corresponding to keys missing from the state dict? What happens in the case where `load_state_dict(state_dict, strict=False, assign=True)` and the state_dict is missing some keys? The corresponding params missing from the `state_dict` and nonpersistent buffers would still be on `meta` and need to be manually initialized. However, I don't think we offer an API that would initialize these. One solution would be to make these empty tensors but it might not be semantically correct... Pull Request resolved: https://github.com/pytorch/pytorch/pull/102212 Approved by: https://github.com/albanD	2023-06-15 18:41:00 +00:00
PyTorch MergeBot	73be9842be	Revert "[Dynamo] VariableTracker.recursively_contains should be updated correctly when mutation happens (#103564 )" This reverts commit 5c3556da9406f814e6a1286cb6762e5508d54971. Reverted https://github.com/pytorch/pytorch/pull/103564 on behalf of https://github.com/ZainRizvi due to Broke internal builds ([comment](https://github.com/pytorch/pytorch/pull/103564#issuecomment-1593552435))	2023-06-15 18:40:51 +00:00
Huy Do	9f39123d18	Allow to continue when fail to configure Windows Defender (#103454 ) Windows Defender will soon be removed from the AMI. Without the service, the step fails with the following error: ``` Set-MpPreference : Invalid class At C:\actions-runner\_work\_temp\1f029685-bb66-496d-beb8-19268ecbe44a.ps1:5 char:1 + Set-MpPreference -DisableRealtimeMonitoring $True + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : MetadataError: (MSFT_MpPreference:root\Microsoft\...FT_MpPreference) [Set-MpPreference], CimException + FullyQualifiedErrorId : HRESULT 0x80041010,Set-MpPreference ``` For example, https://github.com/pytorch/pytorch-canary/actions/runs/5267043497/jobs/9521809176. This is expected as the service is completely removed. Here are all the places where `Set-MpPreference` is used according to https://github.com/search?type=code&q=org%3Apytorch+Set-MpPreference Pull Request resolved: https://github.com/pytorch/pytorch/pull/103454 Approved by: https://github.com/atalman	2023-06-15 18:30:58 +00:00
Nikita Shulga	3e9eaa1a12	[GHF] Fix regression Introduced by https://github.com/pytorch/pytorch/pull/103679 That was not covered by tests :( Discovered while looking at https://github.com/pytorch/pytorch/actions/runs/5281833681/jobs/9555936751#step:5:24 Locally tested by running `python -c "from trymerge import gh_get_pr_info;print(gh_get_pr_info('pytorch', 'pytorch', 103685)['author'])"`	2023-06-15 10:48:58 -07:00
Barys Skarabahaty	e6108e8533	[caffe2] Create deterministic zip archives (#102903 ) Summary: Ensure that we create deterministic zip archives for the same inputs to make builds deterministic. Test Plan: CI Reviewed By: StanislavGlebik Differential Revision: D46417033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102903 Approved by: https://github.com/malfet	2023-06-15 17:45:19 +00:00
Angela Yi	90ef8d58cf	[export] Serialize metadata (#103274 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103274 Approved by: https://github.com/zhxchen17	2023-06-15 17:34:12 +00:00
Nikita Shulga	7b5f8988a2	[GHF] Auth when trying to fetch labels (#103679 ) There were few merge bot failures reported recently due to HTTP/403 error: - https://github.com/pytorch/pytorch/actions/runs/5269083146/jobs/9526693976#step:6:80 - https://github.com/pytorch/pytorch/actions/runs/5272750075/jobs/9535376256#step:6:93 Which likely stems from the fact that `_fetch_url` method did not try to pass the auth token even when it was available and as result was rate-limited to 60 requests per hour, according to [Resources in the REST API](https://docs.github.com/en/rest/overview/resources-in-the-rest-api?apiVersion=2022-11-28#rate-limit-headers) Refactor `gh_fetch_url` into `gh_fetch_url_and_headers` and use it from `request_for_labels` to utilize auth token, if available, which bumps rate limit to 1000 per hour as well as print more actionable message when rate limit is exceeded. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 499b805</samp> > _`gh_fetch_url` splits_ > _returns headers and body_ > _wrapper function_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/103679 Approved by: https://github.com/jeanschmidt, https://github.com/kit1980	2023-06-15 17:03:55 +00:00
cyy	f2900420da	fix missing-prototypes warnings in torch_cpu (Part 6) (#101845 ) This PR fixes more missing-prototypes violations in the torch_cpu source following PRs #100053, #100147, #100245, #100849 and #101788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101845 Approved by: https://github.com/albanD	2023-06-15 16:48:28 +00:00
Kale Kundert	e75f7994e1	Fix `Dirichlet.log_prob()` when x=0 and alpha=1 (#103605 ) `Dirichlet.log_prob()` incorrectly returns NaN in the case where $x_i=0$ and $\alpha_i=1$. The Dirichlet PDF is given by: $$\frac{1}{B(\alpha)} \prod_{i=1}^{K} x_i^{\alpha_i - 1}$$ So this corresponds to the case where one of the terms has the form $0^0=1$. The logarithm of such a term should be 0, but you get NaN if you try to calculate it as `0 * log(0)`. This PR implements the same algorithm that `scipy.stats.dirichlet` uses to avoid this behavior, namely `xlogy(alpha - 1, x)` instead of `(alpha - 1) * log(x)`. It also adds a test case comparing the pytorch and scipy implementations for this specific case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103605 Approved by: https://github.com/albanD	2023-06-15 16:16:50 +00:00
Aleksandar Samardžić	2f893d04c8	Implement adding bias vector into structured sparse linear operator (#100881 ) Differential Revision: [D46453477](https://our.internmc.facebook.com/intern/diff/D46453477) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100881 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2023-06-15 16:16:09 +00:00
magic-akari	e56cdfd74b	[MPS] Handle deserialization more permissively (#98834 ) MPS deserialization should handle `mps:0`. It is generated from some codes like the following ```python torch.rand(size=(3, 4)).to("mps") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98834 Approved by: https://github.com/kulinseth, https://github.com/kit1980, https://github.com/malfet	2023-06-15 15:51:03 +00:00
Edward Z. Yang	bc6ec97e02	Switch dynamic_shapes to True by default (#103597 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103597 Approved by: https://github.com/voznesenskym	2023-06-15 15:16:20 +00:00
cyy	5642b5a36f	enable performance-noexcept-move-constructor in clang-tidy (#103593 ) Use noexcept as much as possible can improve code performance significantly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103593 Approved by: https://github.com/albanD	2023-06-15 14:38:47 +00:00
Edward Z. Yang	f0360c99ca	Properly account for empty lists in symbol_to_source (#103633 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103633 Approved by: https://github.com/albanD	2023-06-15 13:05:13 +00:00
Mengwei Liu	96c23fe212	[dynamo][numpy] Add support for builtin functions (#103457 ) In order to be able to run stuff like: ``` def f(x): a = x.numpy() return a + a ``` This PR adds a branch in `BuiltinVariable` to handle `NumpyNdarrayVariable` case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103457 Approved by: https://github.com/ezyang	2023-06-15 09:18:45 +00:00
XiaobingSuper	da21273ad5	inductor: support rsqrt for dynamic shape (#103579 ) Fix compiler error for HF hf_BigBird dynamic shape path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103579 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-15 07:02:18 +00:00
Huy Do	5efdcd5802	Handle long Docker image name when building Docker image (#103562 ) After https://github.com/pytorch/pytorch/pull/102562, the `IMAGE_NAME` input to `.ci/docker/build_docker.sh` now accepts the name in the following two formats: * Short form, like `pytorch-linux-bionic-py3.11-clang9` * Or long form, like `308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-py3.11-clang9` This PR updates the build script to handle both cases. This bug was discovered when I saw the wrong image name in https://github.com/pytorch/pytorch/actions/runs/5261424181/jobs/9509633110. ### Testing Verify that the long form is handled correctly ``` export IMAGE_NAME=308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-py3.8-gcc7:06fdf1facf0eef5e5f303dd9cfac8639fb5f9201 export DOCKER_TAG=06fdf1facf0eef5e5f303dd9cfac8639fb5f9201 ./build_docker.sh + tag=06fdf1facf0eef5e5f303dd9cfac8639fb5f9201 + registry=308535385114.dkr.ecr.us-east-1.amazonaws.com + [[ 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-py3.8-gcc7:06fdf1facf0eef5e5f303dd9cfac8639fb5f9201 == \3\0\8\5\3\5\3\8\5\1\1\4\.\d\k\r\.\e\c\r\.\u\s\-\e\a\s\t\-\1\.\a\m\a\z\o\n\a\w\s\.\c\o\m\/\p\y\t\o\r\c\h\/ ]] ++ echo pytorch-linux-focal-py3.8-gcc7:06fdf1facf0eef5e5f303dd9cfac8639fb5f9201 ++ awk -F '[:,]' '{print $1}' + EXTRACTED_IMAGE_NAME=pytorch-linux-focal-py3.8-gcc7 + IMAGE_NAME=pytorch-linux-focal-py3.8-gcc7 + image=308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-py3.8-gcc7 + [[ -z '' ]] + retry login 308535385114.dkr.ecr.us-east-1.amazonaws.com + login 308535385114.dkr.ecr.us-east-1.amazonaws.com + aws ecr get-authorization-token --region us-east-1 --output text --query 'authorizationData[].authorizationToken' + base64 -d + cut -d: -f2 + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103562 Approved by: https://github.com/PaliC	2023-06-15 05:21:50 +00:00
cyy	1e108d9c21	enable more ASAN tests (#101483 ) Recently, we are seeing some bugs found by ASAN such as #101400, I think enabling ASAN for more tests is necessary to catch more hidden bugs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101483 Approved by: https://github.com/huydhn	2023-06-15 05:21:15 +00:00
chunyuan	17217d367f	Inductor cpp wrapper: support Constant in input (#103496 ) ## Description Fix cpp wrapper for models which have constants in the graph inputs. Python wrapper directly gets the value inside the wrapper call as a global variable passed when calling: `4081e924a8/torch/_inductor/codecache.py (L757)` The constants value has been saved in `mod.__dict__` in `4081e924a8/torch/_inductor/graph.py (L874-L875)` For cpp wrapper, we need to append constants to the input args, so as to pass this python value to the `inductor_entry_cpp` function explicitly. ### Example Example of output code for dlrm in TorchBench with this fix: ```py module = CppWrapperCodeCache.load(cpp_wrapper_src, 'inductor_entry_cpp', 'cfkc6c36t7cggi6mnokrdm5jhesnunjg5xysv3o3x3vaqmzmpe6r', False) def _wrap_func(f): def g(args): args_tensor = [arg if isinstance(arg, torch.Tensor) else torch.tensor(arg) for arg in args] constants_tensor = [constant0, constant1] args_tensor.extend(constants_tensor) return f(args_tensor) return g call = _wrap_func(module.inductor_entry_cpp) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103496 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2023-06-15 05:01:25 +00:00
Kimish Patel	90ee6a7354	[PT2][Quant] Update op names for decomposed quantized lib (#103251 ) Summary: Dynamo trace, via dynamo.export, with aten_graph, generates graph with nodes whose target is an isntance of torch._ops.OpOverload. Quantization workflow inserting quantize/dequantize ops which are sometimes instances of torch._ops.OpOverload (quantize_per_tensor.tensor) while other times instances of torch._ops.OpOverloadPacket (quantizer_per_tensor) is a bit inconsistent. Also not sure if it is a valid exported model, if it has nodes with target of type torch._ops.OpOverloadPacket. Without op overload name attached to the 'target', it fails during executorch tracing. Reason is that executorch tracing expects node's targets to be instances of torch._ops.OpOverload and not torch._ops.OpOverloadPacket. So for consistency and tracing reasons, fixing convert pass to insert ops which are torch._ops.OpOverload Test Plan: CI Reviewed By: jerryzh168 Differential Revision: D46342822 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103251 Approved by: https://github.com/andrewor14	2023-06-15 04:37:58 +00:00
Edward Z. Yang	5211fad738	cm3leon_generate is at edge of timeout, so bump it up (#103607 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103607 Approved by: https://github.com/malfet	2023-06-15 03:40:42 +00:00
Piotr Sebastian Kluska	b4056ba744	chore: Update ModelReportObserver variables to buffers (#97971 ) This commit changes ModelReportObserver variables to buffers similar to other observers. This will allow for gathering data on other device than CPU. Moreover, updates InputWeightEqualizationDetector to compute weight stats that are on GPU Tested with running tests `test/quantization/fx/test_model_report_fx.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97971 Approved by: https://github.com/vkuzo	2023-06-15 03:15:41 +00:00
Michael Lazos	00546333a5	Register more foreach op lowerings (#102654 ) Adds the necessary foreach op lowerings for Adam Adds two decomps for addcdiv and addcmul (need to verify that type promotion works correctly here) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102654 Approved by: https://github.com/jansel	2023-06-15 02:52:17 +00:00
Fuzzkatt	6d570ccd59	tf32 context fixes for various tests (#103137 ) Addresses tf32 context related failures from NVIDIA internal testing for following unit tests: H100: - functorch/test_vmap.py: test_op_has_batch_rule A100: - test_expanded_weights.py: test_cnn_model_sum - nn/test_convolution.py: test_conv2d_same_padding_backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/103137 Approved by: https://github.com/zou3519	2023-06-15 02:33:12 +00:00
Peter Bell	2e65354880	Fix inductor-perf-compare (#103538 ) The "7" seems to have been a typo in #102881 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103538 Approved by: https://github.com/kit1980	2023-06-15 02:29:01 +00:00
PyTorch MergeBot	3d6fd07c46	Revert "[inductor] Make clone_graph copy node name as well (#103409 )" This reverts commit 2d745b95d723641e575027bd4e2fff612f61cc8f. Reverted https://github.com/pytorch/pytorch/pull/103409 on behalf of https://github.com/osalpekar due to torchbench regression starting this commit. See `2d745b95d7` for more info ([comment](https://github.com/pytorch/pytorch/pull/103409#issuecomment-1592194229))	2023-06-15 01:27:55 +00:00
Animesh Jain	d6da649a1b	[benchmark] hf_T5_base - torchbench original batchsize too large (#103442 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103442 Approved by: https://github.com/desertfire	2023-06-15 01:06:40 +00:00
Animesh Jain	16c2090b2d	[benchmark][compile] Limit number of bounding boxes to 5 (#103413 ) Depends on https://github.com/pytorch/benchmark/pull/1729 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103413 Approved by: https://github.com/ezyang	2023-06-15 01:06:40 +00:00
PyTorch MergeBot	2087d32811	Revert "Support params/buffers inside cond and map (#102310 )" This reverts commit 766f236bad2327060575780219e0d4964dc661e5. Reverted https://github.com/pytorch/pytorch/pull/102310 on behalf of https://github.com/huydhn due to The test is failing in trunk `766f236bad` ([comment](https://github.com/pytorch/pytorch/pull/102310#issuecomment-1592159710))	2023-06-15 00:29:20 +00:00
Edward Z. Yang	ddf4cd69ec	Delete ifdyn and ifunspec combinators (#103596 ) Replaced with expect tests for ease of updating. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103596 Approved by: https://github.com/voznesenskym	2023-06-15 00:14:17 +00:00
Liang Hou	e82616d900	Add `generator` argument in `torch.randn` signature (#102075 ) Fix the document issue of `torch.randn` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102075 Approved by: https://github.com/kit1980, https://github.com/soulitzer	2023-06-14 23:37:19 +00:00
Simon-Martin Schröder	a0885dff98	Link torch.cat in docstring of torch.stack and vice versa (#103421 ) torch.cat and torch.stack are similar enough that they should point to each other. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103421 Approved by: https://github.com/malfet, https://github.com/svekars, https://github.com/kit1980	2023-06-14 23:31:22 +00:00
Tugsbayasgalan Manlaibaatar	766f236bad	Support params/buffers inside cond and map (#102310 ) With #102022, params and buffers are always treated as special case of free variables. In this PR, I switch cond and map implementation to the this method and deprecate the old tracing mechanism. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102310 Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519	2023-06-14 22:32:33 +00:00
Jing Xu	600f7dc211	add instruction to compile with new C++ ABI (#95177 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95177 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/kit1980	2023-06-14 22:25:26 +00:00
Will Constable	55cf5c00fa	Improve DDPOptimizer Logging (#103489 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103489 Approved by: https://github.com/ezyang	2023-06-14 22:24:44 +00:00
Andrew Gu	9152d0e5be	Silence `has_cuda` deprecation in optim (#103610 ) ``` UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103610 Approved by: https://github.com/janeyx99, https://github.com/Skylion007	2023-06-14 22:09:22 +00:00
Lucy Qiu	d0ff640ec8	[Pytorch] aten::stack (#103344 ) Summary: Stack: https://pytorch.org/docs/stable/generated/torch.stack.html This diff uses `at::unsqueeze` and `at::cat` to implement `at::stack` for all dims Re-organize the tests to 1d, 2d, 3d tensors. Test Plan: ``` lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="stack" Restarting Buck daemon because Buck version has changed... Buck daemon started. Parsing buck files: finished in 9.1 sec Creating action graph: finished in 0.7 sec Downloaded 54/3888 artifacts, 27.68 Mbytes, 97.3% cache miss (for updated rules) Building: finished in 07:36.5 min (100%) 2487/2487 jobs, 2487/2487 updated Total time: 07:46.3 min BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = stack [==========] Running 4 tests from 1 test suite. [----------] Global test environment set-up. [----------] 4 tests from VulkanAPITest [ RUN ] VulkanAPITest.stack_invalid_inputs [ OK ] VulkanAPITest.stack_invalid_inputs (499 ms) [ RUN ] VulkanAPITest.stack_1d [ OK ] VulkanAPITest.stack_1d (6 ms) [ RUN ] VulkanAPITest.stack_2d [ OK ] VulkanAPITest.stack_2d (12 ms) [ RUN ] VulkanAPITest.stack_3d [ OK ] VulkanAPITest.stack_3d (130 ms) [----------] 4 tests from VulkanAPITest (649 ms total) [----------] Global test environment tear-down [==========] 4 tests from 1 test suite ran. (649 ms total) [ PASSED ] 4 tests. lfq@lfq-mbp fbsource % ``` Reviewed By: yipjustin Differential Revision: D46178424 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103344 Approved by: https://github.com/SS-JIA	2023-06-14 21:59:36 +00:00
Andrew Gu	2eea3cb19d	Fix composable `checkpoint(use_reentrant=True)` with multi args (#103590 ) The `_ModuleHookCheckpointFunction.backward()` should take in `*output_grads` instead of `output_grads`. Otherwise, we may see an error like: ``` TypeError: backward() takes 2 positional arguments but 5 were given ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103590 Approved by: https://github.com/rohan-varma, https://github.com/fduwjj, https://github.com/fegin	2023-06-14 21:53:30 +00:00
Bin Bao	c2952e8be9	[inductor] Fix an expression printer issue during generate_return (#103557 ) Summary: This fixes a symbolic expression printing issue when cpp_wrapper is on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103557 Approved by: https://github.com/eellison	2023-06-14 21:49:53 +00:00
Michael Lazos	dc3fa9e52f	Update optimizer tests to compile with fullgraph (#103559 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/103559 Approved by: https://github.com/jansel	2023-06-14 20:54:33 +00:00
Iris	7dd0f525b5	[FSDP][4/n]Update use_dtensor option for _optim_utils.py (#103599 ) Same as https://github.com/pytorch/pytorch/pull/103069 (this branch is corrupted so have to re-submit). Pull Request resolved: https://github.com/pytorch/pytorch/pull/103599 Approved by: https://github.com/fegin	2023-06-14 20:18:33 +00:00
Animesh Jain	bd0ed940b7	[activation checkpoint][dynamo] Wrap AC into Tag based higher order op (#102935 ) These are the numbers with this PR ![image](https://github.com/pytorch/pytorch/assets/13822661/63e991d5-80e2-4e94-8e4b-243621c3990e) There are 3 main followups * A naive partitioner gives better memory footprint than min-cut partitioner here. Currently, we are using min-cut partitioner. Waiting for @Chillee to discuss this further to either modify min-cut or add a naive partitioner. * aot_eager is < 1x memory footprint. This is true even for non AC models. This could hide some inefficiency somewhere. * inductor is giving very different memory numbers between AOT-traced-AC (duplicate early) vs this implementation. This leads to some inefficiency in inductor that we need to resolve. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102935 Approved by: https://github.com/jansel	2023-06-14 20:15:43 +00:00
Animesh Jain	df0505743f	[activation checkpointing] Tagging based min cut partitioner (#103357 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103357 Approved by: https://github.com/jansel	2023-06-14 20:15:43 +00:00
Michael Voznesensky	aece6705d1	Move locals/globals to output graph, make it easier to access them anywhere (#103456 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103456 Approved by: https://github.com/jansel	2023-06-14 20:04:33 +00:00
Michael Voznesensky	d27bc34f4b	Simple Source traversal util (#103450 ) lint Pull Request resolved: https://github.com/pytorch/pytorch/pull/103450 Approved by: https://github.com/ezyang	2023-06-14 20:04:20 +00:00
arunppsg	6db21a9cf8	Update clang-tidy install in CONTRIBUTING.md (#101247 ) Updated clang-tidy install to reflect install command in github actions workflow `ce76670c6f/.github/workflows/lint.yml (L45)` I was following steps to run clang-tidy and got into the above issue. I also think that the following line is outdated: `ce76670c6f/CONTRIBUTING.md (L1077)` but not sure what is the right solution there as there is no `clang_tidy/requirements.txt` file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101247 Approved by: https://github.com/subramen, https://github.com/kit1980	2023-06-14 19:57:12 +00:00
Edward Z. Yang	9946499228	Continue simplifying dynamic shapes tests (#103592 ) Remove the static by default / no automatic dynamic configuration as this is about to become the default. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103592 Approved by: https://github.com/voznesenskym, https://github.com/Skylion007	2023-06-14 19:35:51 +00:00
Kimish Patel	49dcf48e66	[PT2][Quant] Change quat conv bn fusion code (#103556 ) Summary: Dynamo burn in scalars instead of keeping them on module. This results in quantize_per_tensor and dequantize_per_tensor nodes to have burnt in scale and zero point value, if we trace them scalar. Graph rewrite ignores literals and when match pattern is replaced with replacement pattern, we lose the scale/zp and other values from nodes in original graph and instead get one from replacement graph. This diff fixes that for q/dq per tensor node by manually copying these values over. Note that this is not robust because it works only when there is only a single q/dq node Test Plan: quantization_pt2e Reviewed By: andrewor14 Differential Revision: D46614000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103556 Approved by: https://github.com/andrewor14	2023-06-14 18:37:43 +00:00
PyTorch MergeBot	a60f6dbe69	Revert "Add groups to dynamo benchmarking output data (#103268 )" This reverts commit 455f542ed95921a073b7859fc51a3a1e7c361239. Reverted https://github.com/pytorch/pytorch/pull/103268 on behalf of https://github.com/drisspg due to no longer needed ([comment](https://github.com/pytorch/pytorch/pull/103268#issuecomment-1591732331))	2023-06-14 17:50:34 +00:00
mingfeima	69b09eca5a	optimize reflection padding performance on CPU (#102254 ) This patch improves reflection padding performance on CPU. Original kernel has nested paralleled loops, e.g. first on dim of batch and then on dim of channels, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope. The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms; ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms; (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms; ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms; ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms; ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms; (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms; ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102254 Approved by: https://github.com/cpuhrsch	2023-06-14 17:18:51 +00:00
David Berard	717e63b7bd	[inductor] use aten.kernel.OVERLOAD_NAME instead of aten.kernel in python wrapper (#103576 ) Summary: When we call an overload packet (e.g. torch.ops.aten.ge), there's some C++ code (from TorchScript) that determines which overload to use. There's sometimes ambiguity as to which op should be used. Therefore, for python we should use the specific overload name if we know it. Specifically, the issue was with ge. We had a test (test_lerp_cuda from test_torchinductor.py) that eventually got lowered to code like this: ``` torch.ops.aten.ge(torch.tensor(70000.), 0.5) ``` This can either match torch.ops.aten.ge.Scalar (the intended overload), which will return torch.tensor(True); or it can match torch.ops.aten.ge.float (a TorchScript overload), which will return `True`. The decision of which to use depends on the order in which the operators are registered. Internally, depending on the build config (opt vs. dev-nosan), the operator registration order could differ. In opt mode, the torchscript overload would appear first and therefore would get called first, and cause the inductor program to fail. Differential Revision: D46712744 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103576 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-06-14 17:14:47 +00:00
Yanbo Liang	5c3556da94	[Dynamo] VariableTracker.recursively_contains should be updated correctly when mutation happens (#103564 ) Fixes #103563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103564 Approved by: https://github.com/jansel	2023-06-14 17:08:00 +00:00
Zachary DeVito	0ca3c6f7d7	[_memory_viz.py] Fix bug when using profile_plot (#103384 ) When we updated plotting to add level of detail the Legend code for profile_plot got broken. This patch fixes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103384 Approved by: https://github.com/drisspg	2023-06-14 16:54:29 +00:00
PyTorch MergeBot	6ff6b49039	Revert "Register more foreach op lowerings (#102654 )" This reverts commit 05c01b9bfc0af1ad1bf230cac658d10a42f754d6. Reverted https://github.com/pytorch/pytorch/pull/102654 on behalf of https://github.com/ZainRizvi due to This is breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/102654#issuecomment-1591639478))	2023-06-14 16:49:30 +00:00
Peter Bell	b1adaa8777	[inductor] Fix no-xdim reductions (#103527 ) Fixes #103481 Normally triton tensors have shape `[XBLOCK, RBLOCK]`, or some variation where the lengths are 1 but the number of dimensions is the same. The `no_x_dim` change in addition to removing the x dimension, also removed the r dimension from certain values such as the results of reductions and the `xindex` variable. This fixes those two cases to correctly produce tensors of shape `[1]`, equivalent to the old shape `[XBLOCK, 1]` with the x-dimension dropped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103527 Approved by: https://github.com/ngimel	2023-06-14 16:32:17 +00:00
Rohan Varma	80139fc2db	[DDP] multiple forward support for static graph (#103487 ) Adds support for multiple forward before bwd call for static_graph=True. There are 2 changes: 1) Change tracking of accounting of when to populate static grap related maps from relying on forward iteration to backward calls 2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the delay allreduce. Instead use a flag. Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103487 Approved by: https://github.com/awgu	2023-06-14 16:14:52 +00:00
Rohan Varma	780b24b27c	[DDP] Refactor _DDPSink to take DDP weakref (#103304 ) This will make future PRs to support DDP static graph multi forward cleaner. Differential Revision: [D46584545](https://our.internmc.facebook.com/intern/diff/D46584545/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103304 Approved by: https://github.com/awgu	2023-06-14 16:14:52 +00:00
Rohan Varma	a3a32c1be0	[DDP] Rename num_iterations -> num_forward_calls (#103283 ) This more accurately represents what we're counting. At iteration is a forward + backward call, but here we're just counting forward calls. This makes things less confusing in future diffs where we support DDP static graph multiple forwards. Differential Revision: [D46580601](https://our.internmc.facebook.com/intern/diff/D46580601/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103283 Approved by: https://github.com/awgu	2023-06-14 16:14:50 +00:00
Rohan Varma	2076a2ffa7	[DDP] Rename state_dict var to ddp_state (#103282 ) This name is confusing in the context that it is just a dictionary used to pass state to DDP backward pass. Differential Revision: [D46580516](https://our.internmc.facebook.com/intern/diff/D46580516/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103282 Approved by: https://github.com/awgu	2023-06-14 16:14:49 +00:00
Bin Bao	2d745b95d7	[inductor] Make clone_graph copy node name as well (#103409 ) Summary: This solves an inconsistency between two-pass fusion results when turning on cpp wrapper. The unit test comes from yolov3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103409 Approved by: https://github.com/eellison, https://github.com/jansel	2023-06-14 15:25:18 +00:00
Edward Z. Yang	7a2a006c9e	Remove dynamic_shapes test for inductor static weights (#103377 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103377 Approved by: https://github.com/anijain2305	2023-06-14 15:00:34 +00:00
Pearu Peterson	45401ef745	Enable float16 and complex32 support for sparse CSR elementwise multiplication operation. (#100394 ) As in the title. In addition, the PR adds float16 addcmul support for CPU device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100394 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2023-06-14 14:42:39 +00:00
PyTorch MergeBot	a980b19be7	Revert "Remove dynamic_shapes test for inductor static weights (#103377 )" This reverts commit 53cb1a7d15804fef6eb25cbad8a0380a29f53e8b. Reverted https://github.com/pytorch/pytorch/pull/103377 on behalf of https://github.com/malfet due to broke lint ([comment](https://github.com/pytorch/pytorch/pull/103377#issuecomment-1591356769))	2023-06-14 14:41:13 +00:00
Richard Zou	339007fe65	operator_compile_check v0 (#103198 ) This PR adds `operator_compile_check` (pls bikeshed name), a gradcheck-like API to test if a custom operator is supported by torch.compile. The API is scoped to check only that the interaction between the operator and torch.compile works (e.g. it is not going to include gradcheck). Concretely, it currently checks the following things: - schema correctness - make_fx traceable (static shapes) - aot_autograd correctness (static shapes) - torch.compile correctness, with and without inductor (static shapes) - make_fx traceable (dynamic shapes) - aot_autograd correctness (dynamic shapes) - torch.compile correctness, with and without inductor (dynamic shapes) Test Plan: We test a bunch of error cases, including many failure modes that have tripped us up in the past, and assert that they (mostly) have nice error messages: - incorrect schema (mutates) - incorrect schema (has a view) - missing abstract impl - incorrect abstract impl - missing functionalization kernel - autograd registered at CPU/CUDA keys - operator is not traceable Pull Request resolved: https://github.com/pytorch/pytorch/pull/103198 Approved by: https://github.com/bdhirsh, https://github.com/soulitzer	2023-06-14 14:00:14 +00:00
Richard Zou	149cd09221	Refactor and improve AOTAutograd tests (#103197 ) This is in preparation for the new "custom_op_compile_check" utility, which will call the refactored testing API as a subroutine. Here are the improvements to the AOTAutograd tests that this PR makes: - we use torch.autograd.grad instead of .backward(), which makes it so that we stop destructively modifying the inputs - we get rid of the difficult-to-understand sentinel=42 logic and replace it with something more sane - We create some helper functions and add some code comments - We improve error messages Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/103197 Approved by: https://github.com/bdhirsh, https://github.com/soulitzer, https://github.com/Chillee	2023-06-14 14:00:14 +00:00
Richard Zou	27a67d8699	Refactor and improve make_fx testing (#103196 ) This is in preparation for the custom_op_compile_check utility, which will call the newly refactored function. This PR: - splits off code into helper functions - adds clearer error messages - stops updating the inputs destructively (leading to slightly slower tests) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103196 Approved by: https://github.com/bdhirsh, https://github.com/soulitzer	2023-06-14 14:00:12 +00:00
Edward Z. Yang	53cb1a7d15	Remove dynamic_shapes test for inductor static weights (#103377 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103377 Approved by: https://github.com/anijain2305	2023-06-14 13:32:24 +00:00
Peter Bell	ccf56eca84	[inductor] Fix is_broadcasted (#103514 ) Fixes #103491 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103514 Approved by: https://github.com/ngimel	2023-06-14 13:30:48 +00:00
Jack Khuu	e9674d146c	[Specialized Kernel] Propagate Specialized Kernel Support through ComputeCodegenUnboxedKernels (#103113 ) Updating ComputeCodegenUnboxedKernels to accept and write out kernel information to RegisterCodegenUnboxedKernels.cpp Differential Revision: [D46486195](https://our.internmc.facebook.com/intern/diff/D46486195/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103113 Approved by: https://github.com/larryliu0820, https://github.com/kirklandsign	2023-06-14 10:18:16 +00:00
mantaionut	e3ee5b00be	Enable test sparse allreduce basics Windows (#103317 ) The test was marked as flaky in #59965. However, it is not failing anymore so it can be enabled. This PR enables only one test, but it will only run in local tests because the test suite is disabled in CI. #94495 is a superset of this PR which enables the full test suite. The CI run there shows this test passing. Fixes #59965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103317 Approved by: https://github.com/kit1980	2023-06-14 07:37:50 +00:00
Edward Z. Yang	8b015c166c	Don't test dynamic_shapes in tensor_always_has_static_shape (#103517 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103517 Approved by: https://github.com/anijain2305	2023-06-14 07:04:17 +00:00
Richard Barnes	593642d1d8	Use CUDA DSA in caffe2/operators (#95299 ) Differential Revision: D42977333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95299 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-06-14 06:58:34 +00:00
Iris	d991ce6da3	[FSDP][3/N]_shard_utils update for dtensor state_dict support (#103479 ) Same as https://github.com/pytorch/pytorch/pull/102545 (this branch is corrupted so have to re-submit). Pull Request resolved: https://github.com/pytorch/pytorch/pull/103479 Approved by: https://github.com/fegin	2023-06-14 06:45:28 +00:00
chuanqiw	3c5ac4baa4	[CI] Enable inductor dynamic accuracy test on cpu device (#103387 ) Enable inductor dynamic accuracy test on cpu in ci workflow to capture issue early. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103387 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/desertfire	2023-06-14 06:12:41 +00:00
Nikita Shulga	f0832914ee	[Dynamo] Fix lineinfo generation on PY3.11+ (#103525 ) - Replace `for inst in instructions[0:targe.offset//2]: inst.starts_line = None`, with the one that that iterates over all instructions until `inst.offset == target.offset` condition is met, this way making it uniform across Python bytecode dialects (Python-3.11+ bytecode size is variable, while bytecode size is fixed for older Pythons) - Speedup target_index search by replacing `[i for i in instructions if i.offset == offset][0]` with `next(i for i in instructions if i.offset == offset)`, which aborts the evaluation after condition met for the first time, according to: ```python In [1]: lst=list(range(10000)) In [2]: %time [i for i in lst if i == 10] CPU times: user 144 µs, sys: 23 µs, total: 167 µs Wall time: 168 µs Out[2]: [10] In [3]: %time next(i for i in lst if i == 10) CPU times: user 6 µs, sys: 0 ns, total: 6 µs Wall time: 9.06 µs Out[3]: 10 ``` - Fix small typo - use `is_py311_plus` variable rather than checking `sys.version_info` <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 6cd7f27</samp> > _We fix the typos in our code of doom_ > _We remove the warnings that obscure our vision_ > _We refactor the `generate` function for the dynamo_ > _We resume the execution with precision_ Fixes https://github.com/pytorch/pytorch/issues/103355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103525 Approved by: https://github.com/Skylion007, https://github.com/williamwen42	2023-06-14 05:41:43 +00:00
PyTorch UpdateBot	193d8412e7	[vision hash update] update the pinned vision hash (#103560 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103560 Approved by: https://github.com/pytorchbot	2023-06-14 05:19:10 +00:00
XiaobingSuper	674d18c124	inductor: using int64 as index dtype for slice_scatter (#103511 ) For the given test case from HF AllenaiLongformerBase, there has an accuracy issue for the dynamic shape case, the reason is that we are using int32 as the index type but there has a default value ```9223372036854775807``` out of range of int32, see the IR: ``` def masked_subblock1(self, ops): get_index = self.get_index('index1') index_expr = ops.index_expr(get_index, torch.int32) get_index_1 = self.get_index('index2') index_expr_1 = ops.index_expr(get_index_1, torch.int32) ge = ops.ge(index_expr, index_expr_1) get_index_2 = self.get_index('index1') index_expr_2 = ops.index_expr(get_index_2, torch.int32) constant = ops.constant(9223372036854775807, torch.int32) lt = ops.lt(index_expr_2, constant) and_ = ops.and_(ge, lt) masked_subblock2 = self.masked_subblock2(and_, 0.0) get_index_3 = self.get_index('index4') load = ops.load('arg4_1', get_index_3) where = ops.where(and_, masked_subblock2, load) return where ``` and the CPU codegen will generate the cpp code according to the node type: ``` auto tmp3 = [&] { auto tmp4 = static_cast<int>(i3); auto tmp5 = static_cast<int>(ks2); auto tmp6 = tmp4 >= tmp5; auto tmp7 = static_cast<int>(9223372036854775807); auto tmp8 = tmp4 < tmp7; auto tmp9 = tmp6 & tmp8; auto tmp10 = [&] { auto tmp11 = in_ptr0[static_cast<long>(i2 + i3 + ((-1L)ks2) + (i1ks3) + (2Li2ks2) + (3Li0ks3) + (2Li1ks2ks3) + ( 6Li0ks2ks3))]; return tmp11; } ; auto tmp12 = tmp9 ? tmp10() : static_cast<decltype(tmp10())>(0.0); auto tmp13 = in_ptr1[static_cast<long>(i2 + i3 + (i1ks2) + (2Li1(static_cast<long>(ks2ks2))) + (2Li2ks2) + (i0ks1ks2) + (2Li0ks1(static_cast<long>(ks2ks2))))]; auto tmp14 = tmp9 ? tmp12 : tmp13; return tmp14; } ``` For ```auto tmp7 = static_cast<int>(9223372036854775807);```, ```tmp7``` is always ```-1```, this is wrong. After This PR, HF AllenaiLongformerBase CPU dynamic shape path can be passed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103511 Approved by: https://github.com/desertfire	2023-06-14 04:59:19 +00:00
Shunting Zhang	2e1369d7ad	[inductor] fix benchmark call for inplace update (#103547 ) Enabling coordinate descent tuning for a few models cause illegal memory access (or trigger a device assert before that). Command: ``` TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 python benchmarks/dynamo/huggingface.py --amp --performance --training --inductor -d cuda --only CamemBert ``` It turns out that we can not benchmark this kernel: https://gist.github.com/shunting314/a78997f54b5751f2887f4576956036ce Digging more, it shows that this kernel has a inplace argument that will be changed after running the kernel. Our benchmark API simply call a kernel multiple times. Since each run may have side effect. The previous calls may change the inplace argument in a way that fail following calls. This PR clone those inplace arguments before each benchmark call. This can increase the time for each benchmark call. But this should not affect autotuning since we increase the equal amount of time for each tuning configs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103547 Approved by: https://github.com/jansel	2023-06-14 04:10:41 +00:00
Rickey K. Liang	876161983d	`default` should be used as default value in `boolean_dispatch` (#103463 ) The original code mistakenly uses `False`, but should be `default` as passed in. NOTE: The behavior is silently changed in an internal package. Be aware if you use it for your own purposes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103463 Approved by: https://github.com/davidberard98	2023-06-14 03:16:31 +00:00
Lucy Qiu	cbea85b416	[Pytorch] aten::zero_ (#103042 ) Summary: aten::zero_: https://pytorch.org/docs/stable/generated/torch.Tensor.zero_.html Test Plan: clang-format on zero_.glsl and Zero.cpp ``` lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="zero" Downloaded 0/48 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules) Building: finished in 40.5 sec (100%) 525/525 jobs, 12/525 updated Total time: 40.5 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = zero [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from VulkanAPITest [ RUN ] VulkanAPITest.zero_ [ OK ] VulkanAPITest.zero_ (59 ms) [----------] 1 test from VulkanAPITest (59 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (59 ms total) [ PASSED ] 1 test. ``` Differential Revision: D46403983 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103042 Approved by: https://github.com/SS-JIA	2023-06-14 03:15:13 +00:00
Ravikiran Parameshwara	8340762211	Update lr_scheduler.py to check the type of eta_min (#97003 ) Add float assertion to `eta_min` parameter in `CosineAnnealingWarmRestarts`. Fixes #87757 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97003 Approved by: https://github.com/janeyx99	2023-06-14 02:13:05 +00:00
Edward Z. Yang	2f5fef5912	Refactor tests for dynamic shapes (#103542 ) First, infra improvements: new combinator `expectedFailureDynamic` which subsumes expectedFailure calls in test_dynamic_shapes.py. It's just nicer to have these right with the test. Implementation in torch/_dynamo/testing.py and it works by putting an attr on the test, which is then converted into a real expectedFailure when we actually generate the dynamic shapes test class Next, some housekeeping: * test/dynamo/test_unspec.py accidentally was running mostly statically due to the `assume_static_by_default` config flip. Don't assume static by default and xfail some tests which regressed in that time. * New test file test/dynamo/test_config.py, for testing permutations of configuration options. `test_dynamic_shapes` got moved there. Finally, grinding through tests in a way that will make them more compatible with dynamic by default: * If the test explicitly requires dynamic_shapes=False, remove that patch (and probably xfail it) * If the test checks dynamic_shapes internally, remove that test and patch the test so it ALWAYS runs with dynamic_shapes (this is not coverage loss because we're going to switch the default) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103542 Approved by: https://github.com/anijain2305	2023-06-14 02:04:54 +00:00
Bug Hunter Yan	b7777c812e	extend serialization for tensor metadata (#99808 ) Fixes #ISSUE_NUMBER Add the serialization logic of backend metadata to the serialization of tensor, which is implemented through custom registration functions. In #97429 , the structure backendMeta is provided in TensorImpl, and we think that this part of information may also need to be serialized for custom. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99808 Approved by: https://github.com/ezyang, https://github.com/huydhn	2023-06-14 01:43:21 +00:00
mfkasim1	ce0a511993	Using dynamic allocation buffer and dynamic threads on scan with index (#103502 ) What this PR does is (continuation from #103435): - Applying dynamic number of threads for innerdim scan with index function. - Using dynamically allocated shared memory to get rid of `num_threads` template arguments. @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/103502 Approved by: https://github.com/ngimel	2023-06-14 01:27:58 +00:00
Will Constable	fee01640df	Make DDPOptimizer handle subgraphs without outputs (#103488 ) Subgraphs are partitions cut out of a whole graph. Outputs of a subgraph are either global outputs of the original graph, or can be outputs of a partition that feed inputs of the subsequent partition. Subgraphs are created using the fx utility 'passes.split_module', which requires that each partition have at least one output node. In cases where DDPOptimizer asked the partitioner to cut the graph around a set of nodes which only performed inplace mutation, the partitioner could be left trying to create a subgraph with no output nodes, violating its assumptions. To circumvent this, DDPOptimizer can expand the set of nodes marked for inclusion in a subgraph that has no outputs until it includes a node that is an output for that subgraph. It still traverses nodes of the original graph in reverse order and only considers widening a subgraph by iterating further in reverse order than it would have ordinarily done (past the cut point dictated by paramter count). It may still be possible the subgraph reaches the input node of the graph without satisfying the subgraph-output condition, in which case an error would still be raised by the partitioner. Fixes #103385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103488 Approved by: https://github.com/anijain2305	2023-06-14 01:16:04 +00:00
Richard Barnes	93b0410eef	Use CUDA DSA in ATen (#95300 ) Differential Revision: D42977336 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95300 Approved by: https://github.com/xw285cornell, https://github.com/ezyang, https://github.com/malfet	2023-06-14 00:12:03 +00:00
Michel Migdal	6cc0f1c20c	Checking for nullptr in get_model_bytecode_version (#97149 ) One-liner commit to check that the ptr is not null. Just had `test_jit` that had a segfault there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97149 Approved by: https://github.com/kit1980	2023-06-13 23:54:45 +00:00
Jerry Zhang	0cd155b042	[reland][quant][pt2e] Annotate GRU module (#103358 ) (#103526 ) Summary: att, we use module partition API to identify the GRU submodule and annotate all necessary patterns Test Plan: buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e' Differential Revision: D46689428 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103526 Approved by: https://github.com/andrewor14	2023-06-13 23:43:10 +00:00
Jeremy Reizenstein	0254880015	NCCL process group: avoid workEnqueue when capturing cuda graph (#103503 ) Summary: In torch.distributed, we make ProcessGroupNCCL not call workEnqueue when the cuda stream is capturing. I.e., when capturing a CUDA graph, we do not enqueue anything for the watchdog thread to consider. This allows capturing NCCL operations in a CUDA Graph. This is followup to an internal discussion [1] where the watchdog thread was observed to crash when using cuda graphs containing an all_reduce. The watchdog thread wants to query events pertaining to enqueued work items, but this can't be done for "events" created during cuda graph capture. [1] https://fb.workplace.com/groups/1405155842844877/posts/6975201909173548/ This is another attempt at https://github.com/pytorch/pytorch/pull/102542 / D46274814, fixing the test failures. Test Plan: The repro mentioned in https://fb.workplace.com/groups/1405155842844877/posts/7003002339726838/ runs successfully after this change. Differential Revision: D46683554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103503 Approved by: https://github.com/kwen2501	2023-06-13 23:12:43 +00:00
Elias Ellison	25b6b95b2e	Fix freezing tests (#103531 ) Workaround for https://github.com/pytorch/pytorch/issues/103532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103531 Approved by: https://github.com/desertfire	2023-06-13 22:51:48 +00:00
Michael Voznesensky	056bf951bf	Strengthen partially supported invariant of base for chained sources (#103445 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103445 Approved by: https://github.com/ezyang	2023-06-13 22:44:28 +00:00
Matthew Hoffman	bc2caa7fdf	Add type hint for retains_grad (#103528 ) Fixes #103485 Type checkers don't know about the existence of `retains_grad` otherwise: ```python torch.randn(10, 10).retains_grad # Cannot access member "retains_grad" for type "Tensor" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103528 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/janeyx99	2023-06-13 21:37:32 +00:00
Nikita Karetnikov	d38b651d51	[pt2] add `SymInt` support for `cosine_similarity` (#103400 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103400 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2023-06-13 21:23:48 +00:00
Nikita Karetnikov	c07634436e	[pt2] add `SymInt` support for `bilinear` (#103396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103396 Approved by: https://github.com/ezyang	2023-06-13 21:23:48 +00:00
Nikita Karetnikov	4a76fb49f3	[pt2] add metas for `avg_pool3d` and `avg_pool3d_backward` (#103392 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103392 Approved by: https://github.com/ezyang	2023-06-13 21:23:46 +00:00
Angela Yi	8dc6001057	[export] Serialize symbolic values (#103273 ) * Modified the SymInt schema to also store the hint of the SymInt if it is represented as a symbol so that when we reconstruct the SymInt, the hint will also exist on the node. * GraphModuleDeserializer.deserialize now also optionally map of symbol names to range. ReplaceSymSizeOpPass should not be needed after https://github.com/pytorch/pytorch/pull/103107 lands Pull Request resolved: https://github.com/pytorch/pytorch/pull/103273 Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17	2023-06-13 20:29:47 +00:00
Ilya Sherstyuk	876695d4ec	[ONNX] Add constant folding for Softmax op (#102861 ) This commit adds a torch implementation for the ONNX Softmax op, which allows it to be folded during ONNX export if all its inputs are known. Fixes #97927 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102861 Approved by: https://github.com/BowenBao	2023-06-13 20:23:37 +00:00
Edward Z. Yang	3804eb109a	Always register SHAPE_ENV guard (#103521 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103521 Approved by: https://github.com/Skylion007	2023-06-13 20:15:20 +00:00
Mark Saroufim	ea384cd377	torch.compiler public namespace (#102182 ) # torch.compiler public API ## Goal The goal of this document is to describe the public facing API for torchdynamo and torchinductor. Today both dynamo and torchinductor are in `torch/_dynamo` and `torch/_inductor` namespace with the only public function `torch.compile()` which is directly placed in `torch/__init__.py` This poses a few problems for users trying to take dependencies on PyTorch 2.0 1. Unclear BC guarantees 2. No builtin discovery mechanism outside of reading the source code 3. No hard requirements for docstrings or type annotations Most importantly it mixes two personas the PyTorch 2.0 developer vs the PyTorch 2.0 customer so this is an attempt to address this. We draw a lot of inspiration from the `functorch` migration to the `func` namespace. ## Alternate names We did discuss some other alternative names 1. `torch.compile` -> problem is this would break BC on the existing `torch.compile` function 2. `torch.dynamo` -> `dynamo` is so far not something we've deliberately hidden from users but problem is now figuring out what it's `_dynamo` vs `dynamo` might be confusing 3. `torch.compiler` -> 1 would be better but to keep BC this is a good compromise # The general approach ## Proposal 1 In https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/__init__.py We have function called `reset()`, this function is essential if users are trying to `torch.compile()` a model under different settings ```python # in _dynamo/ def reset(): do_reset_stuff() ``` Instead we propose ```python # in compiler/ def reset(): do_reset_stuff() # As in copy paste the logic from _dynamo.reset # in _dynamo/ import warnings import inspect def reset(): function_name = inspect.currentframe().f_code.co_name warnings.warn(f"{function_name} is deprecated, use compiler.{function_name} instead", DeprecationWarning) return compiler.reset() ``` ## Proposal 2 ```python # in compiler/ def reset(): “”” Docstrings here “”” _dynamo.reset() # in _dynamo/ No changes ``` Consensus so far seems to be proposal 2 since fewer warnings will be less jarring and it’ll make it quite easy to merge the public API ## Docstrings The above was an example of a function that has no inputs or outputs but there are other functions which could use an improvement in their docstrings, for example allow_in_graph actually works over lists of functions but that’s not mentioned anywhere in the example only if you read the source code. def allow_in_graph(fn): """ Customize which functions TorchDynamo will include in the generated graph. Similar to `torch.fx.wrap()`. Parameters: fn (callable or list/tuple): The function(s) to be allowed in the graph. Returns: callable or list/tuple: The input function(s) included in the graph. Examples: Customize inclusion of a single function: :: torch._dynamo.allow_in_graph(my_custom_function) Customize inclusion of multiple functions: :: torch._dynamo.allow_in_graph([my_custom_function1, my_custom_function2]) @torch._dynamo.optimize(...) def fn(a): x = torch.add(x, 1) x = my_custom_function(x) x = torch.add(x, 1) return x fn(...) Notes: The `allow_in_graph` function allows customization of which functions TorchDynamo includes in the generated graph. It can be used to include specific functions that are not automatically captured by TorchDynamo. If `fn` is a list or tuple, `allow_in_graph` will be called recursively on each element in the sequence. Once a function is allowed in the graph using `allow_in_graph`, it will be captured in the graph generated by TorchDynamo. This customization enables more fine-grained control over the functions included in the graph. Note that `allow_in_graph` expects the input `fn` to be a callable. """ if isinstance(fn, (list, tuple)): return [allow_in_graph(x) for x in fn] assert callable(fn), "allow_in_graph expects a callable" allowed_functions._allowed_function_ids.add(id(fn)) allowed_functions._disallowed_function_ids.remove(id(fn)) return fn So to make the API public, we’d have to write similar docstrings for all public functions we’d like to create. The benefit of this approach is that 1. No BC risks, internal and external users relying on our tooling can slowly wean off the private functions. 2. We will also have to write correct docstrings which will automatically make our documentation easier to maintain and render correctly on pytorch.org 3. We already have some BC guarantees already, we don’t kill OptimizedModule, we rejected the PR to change the config system The con of this approach is that Will be stuck with some potentially suboptimal functions/classes that you can’t kill ## Testing strategy If the approach is to mostly make a public function call an already tested private function then all we need to do is ensure that the function signatures don't change ## Which functions should be in the public API Our heuristic for deciding whether something should be public or not is are users already relying on it for lack of other options or have we recommended some non public functions for users to debug their PT 2.0 programs. Heuristic for not putting something in public is that it’s an experimental subsystem with the goal of turning it on by default, it’s very core dev centric, meta centric, a bunch of different configs that should be batched into a single user facing one, or something that needs to be renamed because the name is confusing #### Top level `torch.compile()` -> already is a public API it does require some minor improvements like having configs be passed in to any backend and not just inductor (EDIT: This was already done https://github.com/pytorch/pytorch/pull/99645l) and renaming `mode=reduce-overhead` to `mode=cudagraph` To make sure that PT 2.0 is supported with a given pytorch version users can create a new public function and this would replace the need for `try/except` blocks around `import torch._dynamo` that has been populating user code. ```python def pt2_enabled(): if hasattr(torch, 'compile'): return True else: return False ``` For all of the below they will be translated to `torch.compiler.function_name()` #### From _dynamo As a starting point we looked at https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/__init__.py and we suggest redefining these functions in `pytorch/torch/compiler/__init__.py` It might also make sense to split them over multiple files and import them in `__init__.py` but because the number of functions is small it'd probably be fine to add them all into a single compiler/__init__.py until this list becomes larger 1. `reset()` 2. `allow_in_graph()` 10. `list_backends()` 12. `compile()`: torch.compile() would be mostly a shell function passing arguments to torch.compiler.compile() 13. `assume_constant_result()`: TODO: Double check how this is useful 15. `torch._dynamo.disable()` Some notable omissions 11. `explain()`: We need to clean up the output for this function, make it a data class and pretty printable 1. `forbid_in_graph()`: Considered adding this but should instead consolidate on `disallow_in_graph` 2. `optimize_assert()`: Already covered by `torch.compile(fullgraph=True)` 3. `check_if_dynamo_supported()`: this would be supplanted by pt2_enabled() 4. `compilation_metrics`, `graph_breaks_reasons` ..: would all be accessed via `torch.compiler.explain()` 5. `replay` does not seem useful to end customers 6. . `graph_break()`: Mostly useful for debugging or unit tests 9. `register_backend()`: End users will just pass a string backend to torch.compile, only devs will create new backends 10. `export()` : Eventually this needs to public but for now it’s not ready so just highlighting that it will be in the public API eventually 11. `disallow_in_graph()`: Usage is limited 12. `mark_static()`: we can keep this private until dynamic=True is recommended in stable 13. `mark_dynamic()`: we can keep this private until dynamic=True is recommended in trunk 14. 8. `OptimizedModule`: This is the only class that we'd expose but is crucial since users are running code like `if isinstance(mod, OptimizedModule): torch.save(mod._orig_mod)` EDIT: because we fixed pickling we no longer need to expose this 15. `is_compiling()`: Still not clear how this useful to end users There are also config variables which we need to expose https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/config.py Some of our configs are useful dev flags, others are to gate experimental functionality and others are essential debugging tools and we seperate out the essential debugging and logging tools to a public facing config. TODO: I still need to think of a good way of porting the config in a BC way here are some ideas 1. Just make all passes available and controllable via `torch.compile(options={})` but only show docstrings for the ones users should care about. The current problem with our config system is we have 3 ways of setting them once via `options={}`, environment variables and variables in `config.py`, it'd be worth settling on one source of truth and have that be the public API. The configs we should make public are 1. `log_file_name` 2. `verbose` 3. `cache_size_limit` 4. `repro_level` and `repro_after`: Although we can rename these to minifier and give human readable names to the levels Everything else should stay private in particular 1. `print_graph_breaks`, `print_specializations`: should be supplanted by `explain()` for public users 2. dynamic shape configs : Users should only have to worry about `torch.compile(dynamic=True/False)` 3. The distributed flags, hook or guard configs: If we tell a user to use FSDP and DDP then the flag should be enabled by default or be in a private namespace 4. The fbcode flags: Obviously no need to be user facing 5. Skip/Allow lists: Not something normal users should play around with #### From _inductor Very little of inductor should be exposed in a public facing API, our core audience as in people writing models mostly just need information on what certain passes mean and how to control them a high level and they can do this with `torch.compile(options={})` so the goal here should be more to make available passes clearer and ideally consolidate them into `torch.compile()` docstrings or modes. There are some exceptions though from https://github.com/pytorch/pytorch/blob/main/torch/_inductor/__init__.py 1. `list_mode_options()` 2. `list_options()`: this needs an additional pass to hide internal or debug options For both of these we’d rename them to compiler.inductor_list_mode_options and compiler.inductor_list_options() since they would be in the same init file as the one for dynamo Notable omissions 1. `_inductor.compile()`: Because of users are coming in with their own fx graph, they are likely developers 2. `_inductor.aot_compile()`:Again this is about capturing and modifying fx graphs so users APIs don't need to be public However the configs are a slightly different story, because we can choose to either 1. Make all configs public 2. Make some configs public and keep most of the private ones. If public config is set it should override the private version 3. Make all configs controllable via `torch.compile(options={})` but make list_options() hide more things For now 3 seems like the most reasonable choice with some high level configs we’ll keep like TORCH_COMPILE_DEBUG Regardless here's what should probably be public or advertised more 1. `disable_progress` and verbose_progress: Combine and enable by default 2. `fallback_random`: We could make the case this shouldn't be public if a top level deterministic mode enables this 3. `profile_bandwidth`: Or could make the case that this should be in TORCH_COMPILE_DEBUG Notable omissions 1. Any config that would generally improve performance for most that we should probably enable by default but might be disabled in the short term because of stability: example `epilogue_fusion`, `pattern_matcher`, `reordering` 2. Autotuning flags: Should just sit behind `torch.compile(mode="max-autotune")` like `max_autotune`, `max_autotune_gemm` 3. `coordinate_descent_tuning`: This one I'm a but mixed about, maybe it just also fall into `mode="max-autotune"` 4. `trace`: `TORCH_COMPILE_DEBUG` is the best flag for all of this 5. `triton.cudagraphs`: Default should be `torch.compile(mode="reduce-overhead")` - I'd go further and rename the `mode=cudagraph` and we can keep reduce-overhead for BC reasons 6. `triton_unique_kernel_names`: Mostly useful for devs debugging 7. `dce`: which doesnt really do anything 8. `shape_padding`: Elias is working on enabling this by default in which case we also remove it ## Mechanics This PR would include the public functions with their docstrings Another PR will take a stab at the configs And for work where the APIs are still being cleaned up whether its minifier or escape hatches, export or dynamic shapes, aot_inductor etc.. we’ll keep them private until a public commitment can be made Pull Request resolved: https://github.com/pytorch/pytorch/pull/102182 Approved by: https://github.com/jansel, https://github.com/albanD	2023-06-13 19:52:17 +00:00
Edward Z. Yang	3596a853b4	Always apply new_empty special case in Dynamo (#103378 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103378 Approved by: https://github.com/anijain2305	2023-06-13 19:49:35 +00:00
Iris	51d21ffd8a	[FSDP][2/n] add use_dtensor flag to both StateDictConfig and OptimStateDictConfig (#103477 ) Same as #102552 (this branch is corrupted so have to re-submit). Pull Request resolved: https://github.com/pytorch/pytorch/pull/103477 Approved by: https://github.com/fegin	2023-06-13 19:09:56 +00:00
vfdev-5	72931759fd	Unified aa_filter and aa_filter_075 for bicubic upsampling (#103510 ) Follow up PR to https://github.com/pytorch/pytorch/pull/103252 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103510 Approved by: https://github.com/NicolasHug	2023-06-13 18:55:25 +00:00
Andrew Gu	71b560208c	[FSDP] Fix `device_id` when buffer-only module (#103504 ) There was an issue reported internally that with `sync_module_states=True`, if the model had buffers on CPU, even with `device_id` specified, FSDP would try to broadcast CPU buffers, leading to an error like: ``` RuntimeError: No backend type associated with device type cpu ``` After some investigation, I determined that we should _not_ fix this by moving the buffers to GPU just for the broadcast and then back to CPU. Instead, we should fix our `device_id` logic. The issue is that we always used the _parameters_ as the proxy to tell whether we should move module states to the device specified by `device_id`. However, a module (often the root) may not have any parameters but have some buffers! In that case, the buffers are left on CPU even if `device_id` is specified. This PR fixes this by considering both parameters and buffers for movement to `device_id`. Note that this PR preserves the logic that `ignored_modules` / `ignored_parameters` are not considered for this movement, meaning that ignored parameters are moved to `device_id`. Note also that I had to move the unit test back from using MTPG to the normal PG since otherwise, I could not repro the original error. (It seems like MTPG does not complain if we try to use `dist._broadcast_coalesced()` with CPU tensors.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103504 Approved by: https://github.com/rohan-varma	2023-06-13 18:33:26 +00:00
Edward Z. Yang	1628bbecb6	Use free_symbols to determine if convolutions involve dynamic shapes (#103486 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103486 Approved by: https://github.com/shunting314	2023-06-13 18:17:03 +00:00
Edward Z. Yang	38890e1d2b	Stop disabling ShapeProp with dynamic_shapes for mkldnn (#103381 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103381 Approved by: https://github.com/anijain2305	2023-06-13 18:16:57 +00:00
Edward Z. Yang	1506acebaf	Detect symbolic tracing_mode with free_symbols (#103515 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103515 Approved by: https://github.com/anijain2305	2023-06-13 17:57:16 +00:00
Edward Z. Yang	ddb682f616	Enable Python dispatcher when ShapeProp with fake mode (#103512 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103512 Approved by: https://github.com/Skylion007	2023-06-13 17:47:33 +00:00
Edward Z. Yang	af7bd409be	Don't test dynamic_shapes in profiler (#103516 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103516 Approved by: https://github.com/anijain2305	2023-06-13 17:47:25 +00:00
Michael Lazos	05c01b9bfc	Register more foreach op lowerings (#102654 ) Adds the necessary foreach op lowerings for Adam Adds two decomps for addcdiv and addcmul (need to verify that type promotion works correctly here) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102654 Approved by: https://github.com/jansel	2023-06-13 17:30:03 +00:00
Rodrigo Kumpera	5b33d39114	[FSDP] Workaround for GLOO's lack of all_gather_into_tensor. (#103170 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103170 Approved by: https://github.com/rohan-varma	2023-06-13 17:21:41 +00:00
Tung D. Le	b77f1b0f27	Wrong type when exporting {zeros, ones, full, empty, rand, randn}_like ops to onnx (#103048 ) Fixes #99788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103048 Approved by: https://github.com/thiagocrepaldi	2023-06-13 17:17:28 +00:00
Huy Do	e9f2921bff	Fix rerun disabled test uploading logic (#103476 ) After https://github.com/pytorch/pytorch/pull/102107, rerunning disabled tests only collect and run disable tests. A side effect of this change is that the skip message `Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run` isn't in the test report anymore as these non-disabled tests are not going to be collected in the first place. This breaks the logic in the uploading script that depends on this string to know if the test report belongs to a rerunning disabled tests workflow. * This PR updates the logic in `is_rerun_disabled_tests` check to count the number of times a test is run instead. In rerunning disabled tests mode, a test is run 50 times by default and 15 times for distributed tests (to avoid timeout). Both these numbers are larger than the max number of retries a test can get normally (3 x 3) * This also removes the hacky `is_rerun_disabled_tests` check in `tools/stats/upload_test_stats.py` as rerun disabled tests reports are now very small (50 x the number of disabled tests) ### Testing * `test_gradgrad_nn_GroupNorm_cuda_float64` now shows up correctly https://github.com/pytorch/pytorch/issues/98678 ``` python3 -m tools.stats.check_disabled_tests --workflow-run-id 5229037746 --workflow-run-attempt 1 --repo "pytorch/pytorch" Using temporary directory: /var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpdojg5vq5 Downloading test-reports-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_14154925022.zip Downloading test-reports-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_14154925093.zip Downloading test-reports-test-default-2-4-linux.g5.4xlarge.nvidia.gpu_14154925167.zip Downloading test-reports-test-default-2-4-linux.g5.4xlarge.nvidia.gpu_14154925226.zip Downloading test-reports-test-default-3-4-linux.g5.4xlarge.nvidia.gpu_14154925295.zip Downloading test-reports-test-default-3-4-linux.g5.4xlarge.nvidia.gpu_14154925371.zip Downloading test-reports-test-default-4-4-linux.g5.4xlarge.nvidia.gpu_14154925453.zip Downloading test-reports-test-default-4-4-linux.g5.4xlarge.nvidia.gpu_14154925536.zip Downloading test-reports-test-slow-1-1-linux.2xlarge_14154853469.zip Downloading test-reports-test-slow-1-1-linux.rocm.gpu_14154932523.zip Downloading test-reports-test-slow-1-1-linux.rocm.gpu_14154932563.zip Downloading test-reports-test-slow-1-2-linux.4xlarge_14154873704.zip Downloading test-reports-test-slow-1-2-linux.g5.4xlarge.nvidia.gpu_14154931154.zip Downloading test-reports-test-slow-1-2-linux.g5.4xlarge.nvidia.gpu_14154931186.zip Downloading test-reports-test-slow-2-2-linux.4xlarge_14154873756.zip Downloading test-reports-test-slow-2-2-linux.g5.4xlarge.nvidia.gpu_14154931225.zip Downloading test-reports-test-slow-2-2-linux.g5.4xlarge.nvidia.gpu_14154931267.zip Extracting test-reports-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_14154925022.zip to unzipped-test-reports-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_14154925022 Extracting test-reports-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_14154925093.zip to unzipped-test-reports-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_14154925093 Extracting test-reports-test-default-2-4-linux.g5.4xlarge.nvidia.gpu_14154925167.zip to unzipped-test-reports-test-default-2-4-linux.g5.4xlarge.nvidia.gpu_14154925167 Extracting test-reports-test-default-2-4-linux.g5.4xlarge.nvidia.gpu_14154925226.zip to unzipped-test-reports-test-default-2-4-linux.g5.4xlarge.nvidia.gpu_14154925226 Extracting test-reports-test-default-3-4-linux.g5.4xlarge.nvidia.gpu_14154925295.zip to unzipped-test-reports-test-default-3-4-linux.g5.4xlarge.nvidia.gpu_14154925295 Extracting test-reports-test-default-3-4-linux.g5.4xlarge.nvidia.gpu_14154925371.zip to unzipped-test-reports-test-default-3-4-linux.g5.4xlarge.nvidia.gpu_14154925371 Extracting test-reports-test-default-4-4-linux.g5.4xlarge.nvidia.gpu_14154925453.zip to unzipped-test-reports-test-default-4-4-linux.g5.4xlarge.nvidia.gpu_14154925453 Extracting test-reports-test-default-4-4-linux.g5.4xlarge.nvidia.gpu_14154925536.zip to unzipped-test-reports-test-default-4-4-linux.g5.4xlarge.nvidia.gpu_14154925536 Extracting test-reports-test-slow-1-1-linux.2xlarge_14154853469.zip to unzipped-test-reports-test-slow-1-1-linux.2xlarge_14154853469 Extracting test-reports-test-slow-1-1-linux.rocm.gpu_14154932523.zip to unzipped-test-reports-test-slow-1-1-linux.rocm.gpu_14154932523 Extracting test-reports-test-slow-1-1-linux.rocm.gpu_14154932563.zip to unzipped-test-reports-test-slow-1-1-linux.rocm.gpu_14154932563 Extracting test-reports-test-slow-1-2-linux.4xlarge_14154873704.zip to unzipped-test-reports-test-slow-1-2-linux.4xlarge_14154873704 Extracting test-reports-test-slow-1-2-linux.g5.4xlarge.nvidia.gpu_14154931154.zip to unzipped-test-reports-test-slow-1-2-linux.g5.4xlarge.nvidia.gpu_14154931154 Extracting test-reports-test-slow-1-2-linux.g5.4xlarge.nvidia.gpu_14154931186.zip to unzipped-test-reports-test-slow-1-2-linux.g5.4xlarge.nvidia.gpu_14154931186 Extracting test-reports-test-slow-2-2-linux.4xlarge_14154873756.zip to unzipped-test-reports-test-slow-2-2-linux.4xlarge_14154873756 Extracting test-reports-test-slow-2-2-linux.g5.4xlarge.nvidia.gpu_14154931225.zip to unzipped-test-reports-test-slow-2-2-linux.g5.4xlarge.nvidia.gpu_14154931225 Extracting test-reports-test-slow-2-2-linux.g5.4xlarge.nvidia.gpu_14154931267.zip to unzipped-test-reports-test-slow-2-2-linux.g5.4xlarge.nvidia.gpu_14154931267 Downloading test-reports-runattempt1-test-slow-1-1-linux.rocm.gpu_14154932523.zip Downloading test-reports-runattempt1-test-slow-1-1-linux.rocm.gpu_14154932563.zip Extracting test-reports-runattempt1-test-slow-1-1-linux.rocm.gpu_14154932523.zip to unzipped-test-reports-runattempt1-test-slow-1-1-linux.rocm.gpu_14154932523 Extracting test-reports-runattempt1-test-slow-1-1-linux.rocm.gpu_14154932563.zip to unzipped-test-reports-runattempt1-test-slow-1-1-linux.rocm.gpu_14154932563 The following 32 tests should be re-enabled: test_huge_index (__main__.TestCuda) from test_cuda.py test_conv_bn_fuse_cpu (__main__.CpuTests) from inductor/test_torchinductor.py test_multi_threads (__main__.TestTorchrun) from backends/xeon/test_launch.py test_huge_index (__main__.TestCuda) from test_cuda_expandable_segments.py test_memory_timeline_no_id (__main__.TestMemoryProfilerE2E) from profiler/test_memory_profiler.py test_inverse_errors_large_cuda_float64 (__main__.TestLinalgCUDA) from test_linalg.py test_trace_dependencies (__main__.TestAnalyze) from test_package.py test_caching_pinned_memory (__main__.TestCuda) from test_cuda_expandable_segments.py test_graph_concurrent_replay (__main__.TestCuda) from test_cuda_expandable_segments.py test_module_attribute_mutation_violation_negative_1 (__main__.MutationExportTests) from dynamo/test_export_mutations.py test_module_attribute_mutation_violation_negative_2 (__main__.MutationExportTests) from dynamo/test_export_mutations.py test_module_attribute_mutation_violation_negative_4 (__main__.MutationExportTests) from dynamo/test_export_mutations.py test_vmapjvpall_linalg_lu_cuda_float32 (__main__.TestOperatorsCUDA) from functorch/test_ops.py test_vmapjvpvjp_linalg_lu_cuda_float32 (__main__.TestOperatorsCUDA) from functorch/test_ops.py test_Conv2d_no_bias_cuda_tf32 (__main__.TestNN) from test_nn.py test_save_graph_repro (__main__.TestAfterAot) from dynamo/test_after_aot.py test_doc_examples (__main__.TestTypeHints) from test_type_hints.py test_caching_pinned_memory (__main__.TestCuda) from test_cuda.py test_graph_concurrent_replay (__main__.TestCuda) from test_cuda.py test_non_contiguous_tensors_nn_ConvTranspose1d_cuda_complex32 (__main__.TestModuleCUDA) from test_modules.py test_pickle_nn_RNN_eval_mode_cuda_float64 (__main__.TestModuleCUDA) from test_modules.py test_op_has_batch_rule_nn_functional_conv_transpose3d_cuda_float32 (__main__.TestVmapOperatorsOpInfoCUDA) from functorch/test_vmap.py test_geometric_kstest_cuda_float32 (__main__.TestTorchDeviceTypeCUDA) from test_torch.py test_profiler_experimental_tree_with_memory (__main__.TestProfilerTree) from profiler/test_profiler_tree.py test_fs_pool (__main__.TestMultiprocessing) from test_multiprocessing.py test_forward_mode_AD_linalg_lu_factor_ex_cuda_complex128 (__main__.TestFwdGradientsCUDA) from test_ops_fwd_gradients.py test_vjp_linalg_lu_cuda_float32 (__main__.TestOperatorsCUDA) from functorch/test_ops.py test_inplace_grad_fmod_cuda_float64 (__main__.TestBwdGradientsCUDA) from test_ops_gradients.py test_inplace_gradgrad_remainder_cuda_float64 (__main__.TestBwdGradientsCUDA) from test_ops_gradients.py test_bottleneck_cuda (__main__.TestBottleneck) from test_utils.py test_comprehensive_empty_strided_cuda_int32 (__main__.TestInductorOpInfoCUDA) from inductor/test_torchinductor_opinfo.py test_vmapvjpvjp_linalg_lu_cuda_float32 (__main__.TestOperatorsCUDA) from functorch/test_ops.py The following 11 are still flaky: test_transpose_with_norm (__main__.CPUReproTests) from inductor/test_cpu_repro.py, failing 215/215 test_compare_cpu_linalg_pinv_singular_cuda_float32 (__main__.TestCommonCUDA) from test_ops.py, failing 100/100 test_conv_bn_fuse_dynamic_shapes_cpu (__main__.DynamicShapesCodegenCpuTests) from inductor/test_torchinductor_codegen_dynamic_shapes.py, failing 115/115 test_lobpcg (__main__.TestAutograd) from test_autograd.py, failing 50/50 test_module_attribute_mutation_violation_negative_3 (__main__.MutationExportTests) from dynamo/test_export_mutations.py, failing 2/50 test_Conv2d_dilated_cuda_tf32 (__main__.TestNN) from test_nn.py, failing 1/50 test_grad_nn_GroupNorm_cuda_float64 (__main__.TestModuleCUDA) from test_modules.py, failing 50/50 test_index_add_correctness (__main__.TestTorch) from test_torch.py, failing 22/50 test_attn_cuda (__main__.TestMin) from functorch/test_dims.py, failing 1/50 test_open_device_registration (__main__.TestCppExtensionOpenRgistration) from test_cpp_extensions_open_device_registration.py, failing 50/50 test_gradgrad_nn_GroupNorm_cuda_float64 (__main__.TestModuleCUDA) from test_modules.py, failing 50/50 ``` * Uploading tests stats for rerunning disabled tests takes only half a minute ``` time python3 -m tools.stats.upload_test_stats --workflow-run-id 5229037746 --workflow-run-attempt 1 --head-branch main 31.94s user 2.94s system 44% cpu 1:19.07 total ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103476 Approved by: https://github.com/clee2000	2023-06-13 17:07:40 +00:00
Devashish Shankar	3ffac08271	Fix bug in SplitCatSimplifier when next_user is an output node (#103338 ) Summary: When simplifying split cat patterns, if next user of a split node was an output node, there was a bug leading to an issue like: P765993221 Basically, the bug was in how args and kwargs of the user were getting replaced, and the code didn't handle nested arg/kwargs. Using torch.fx.Node functions such as `all_input_nodes` and `replace_input_with` fixes this issue Differential Revision: D46603618 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103338 Approved by: https://github.com/jansel	2023-06-13 16:57:51 +00:00
Nicolas Hug	9591e52880	Add vfdev-5 as reviewer for CPU Aten backend (#103524 ) As suggested by @malfet. @vfdev-5 is the primary owner of the `interpolate()` op and this will avoid having to ask for stamps like in https://github.com/pytorch/pytorch/pull/103252. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103524 Approved by: https://github.com/kit1980	2023-06-13 16:17:59 +00:00
PaDarochek	b00d388ada	Update test_misc.cpp (#97768 ) Potential null dereference after dynamic cast was found during static analysis. Description: Dereference of `ctx` is performed in `TORCH_CHECK` on line 1176, while `ctx` pointer may equal `nullptr`. Previous `TORCH_CHECK` on line 1175 checks the value of `ctx_ptr` pointer that may be of type that cannot be casted to `TestContext`. In such case, `dynamic_cast` returns `nullptr` despite `ctx_ptr` is not equal to `nullptr`. Fix:* - Check `ctx` instead of `ctx_ptr` for equality to zero. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97768 Approved by: https://github.com/kit1980	2023-06-13 16:14:11 +00:00
Pearu Peterson	cbe270d233	Fix zeros_like for sparse tensors with batch dimensions. Add opinfo-based tests to like-functions. (#101215 ) Fixes #101078 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101215 Approved by: https://github.com/cpuhrsch	2023-06-13 16:02:10 +00:00
Edward Z. Yang	597e2a11a3	indexing_dtype_strength_reduction more aggressive free_symbols tests (#103470 ) ValueRanges can't handle symbolic bounds. Be a bit more careful about detecting if you try to pass in expressions with free symbols, and fall back to "don't know" range if this occurs. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103470 Approved by: https://github.com/eellison	2023-06-13 16:00:41 +00:00
Rodrigo Kumpera	63fe26809d	Implement all_gather_into_tensor_coalesced. (#98642 ) The implementation is suboptimal since it uses c10d's group coalescing which is known to be inneficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98642 Approved by: https://github.com/wanchaol	2023-06-13 15:06:52 +00:00
mfkasim1	4081e924a8	Dynamically assign number of threads in innerdim scan (#103435 ) This is the continuation of optimizing inner-dimension scan operations (`torch.cumsum`, `torch.cumprod`, `torch.logcumsumexp`) by dynamically setting the number of threads based on the input shape from #103314. What I found that just setting the number of x-threads and y-threads following the ratio of the tensor's shape works quite well (with some clamping). Here is the speed-up of this PR, compared to `2.0.0+cu118` (not compared to #103314) using A100 with 40GB memory (up to 23x faster): ``` 2 8 32 128 512 1024 2048 4096 8096 16348 65536 262144 1048576 2: 1.07(4) 1.02(5) 1.01(6) 1.07(7) 2.16(8) 4.94(9) 8.71(9) 11.00(9) 12.99(9) 14.77(9) 16.41(9) 16.81(9) 16.97(9) 8: 1.20(4) 1.00(4) 1.01(5) 1.08(6) 2.85(7) 4.90(8) 6.34(8) 11.76(9) 13.86(9) 15.26(9) 16.96(9) 17.45(9) 19.75(9) 32: 1.08(4) 1.00(4) 1.00(4) 1.23(5) 2.48(6) 4.23(7) 5.04(7) 9.16(8) 10.11(8) 18.72(9) 20.64(9) 23.13(9) 23.50(9) 128: 1.09(4) 1.02(4) 1.03(4) 1.02(4) 1.64(5) 2.84(6) 3.08(6) 5.61(7) 5.86(7) 10.72(8) 19.22(9) 19.75(9) 19.97(9) 512: 1.06(4) 1.14(4) 1.01(4) 1.10(4) 1.02(4) 1.78(5) 1.85(5) 3.26(6) 3.34(6) 5.56(7) 8.56(8) 9.55(9) 9.62(9) 1024: 1.21(4) 1.22(4) 1.20(4) 1.06(4) 1.03(4) 1.05(4) 1.81(5) 1.86(5) 3.06(6) 3.12(6) 4.76(7) 5.20(8) 5.56(9) 2048: 1.04(4) 0.88(4) 1.00(4) 1.01(4) 1.02(4) 1.03(4) 1.02(4) 1.72(5) 1.73(5) 2.62(6) 2.86(7) 3.06(8) -------- 4096: 1.02(4) 1.12(4) 0.98(4) 1.60(4) 1.16(4) 1.09(4) 1.10(4) 1.10(4) 1.74(5) 1.75(5) 1.86(6) 2.00(7) -------- 8096: 1.03(4) 1.00(4) 1.00(4) 1.16(4) 1.17(4) 1.17(4) 1.18(4) 1.18(4) 1.18(4) 1.27(5) 1.43(6) -------- -------- 16348: 1.02(4) 1.15(4) 1.11(4) 1.17(4) 1.12(4) 1.11(4) 1.13(4) 1.12(4) 1.11(4) 1.08(4) 1.32(5) -------- -------- 65536: 1.17(4) 1.17(4) 1.16(4) 1.15(4) 1.12(4) 1.12(4) 1.12(4) 1.10(4) 1.10(4) 1.07(4) -------- -------- -------- 262144: 1.20(4) 1.20(4) 1.08(4) 1.13(4) 1.10(4) 1.09(4) 1.10(4) 1.08(4) -------- -------- -------- -------- -------- 1048576: 1.21(4) 1.14(4) 1.10(4) 1.13(4) 1.09(4) 1.08(4) -------- -------- -------- -------- -------- -------- -------- ``` The first row is the innermost dimension, the first column is the outermost dimension (i.e. the batch size). The float numbers are the speed up while the integers within the brackets are the log2 of number of x-threads. The blank cells (the ones with dashes) are not compared because of my GPU's memory limitation. There are some slowdowns that I observed (like `(2048, 8)` and `(4096, 32)`). The slowdown is because in this PR, the scan loop (the one I use with Sklansky) is not optimized by the compiler due to dynamic number of iterations (it is `log2(num_threads_x)`), while in the previous version, the scan loop can be unrolled and optimized by the compiler due to fixed number of iterations. That's why I slightly modified the operations within the scan loop to use bit operations in order to compensate for this slowdown. The most significant acceleration comes from the tensors with relatively small batch size (<= 4096) and with very long sequence. As the batch size increases, the speed up is not that significant because the previous implementation is most likely to be optimized. NOTE: I haven't optimized scan dim with indices, it could come in another PR. As for the build time, I tried not to write more templated functions than necessary. I will report the build time when I already have the numbers. UPDATE: I compared the build time when I changed ScanUtils.cuh only. In `main` branch, it took 4m2s, while in this PR, it took 3m39s. What do you think, @ngimel? Pull Request resolved: https://github.com/pytorch/pytorch/pull/103435 Approved by: https://github.com/ngimel	2023-06-13 08:29:47 +00:00
angelayi	f6b4106554	[export] Automatically add label for export (#103458 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103458 Approved by: https://github.com/clee2000	2023-06-13 08:24:01 +00:00
PyTorch MergeBot	13777e3391	Revert "[quant][pt2e] Annotate GRU module (#103358 )" This reverts commit 23892d8ee44c33abafe9b96ccb788033ffbc63ad. Reverted https://github.com/pytorch/pytorch/pull/103358 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/103358#issuecomment-1588729657))	2023-06-13 07:45:40 +00:00
Huy Do	b0a93c851c	Fix BUCK build after #103185 (#103446 ) Per title. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at c64f0c0</samp> > _`torch_headers` grows_ > _to include profiler files_ > _autumn of code change_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/103446 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet	2023-06-13 05:12:07 +00:00
cyy	db07ba3a9b	Use size_t in THManagedMapAllocator (#103331 ) When reviewing the source code, I found the ptrdiff_t size in THManagedMapAllocator::THManagedMapAllocator can be changed to size_t size to avoid unnecessary casts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103331 Approved by: https://github.com/malfet	2023-06-13 04:50:30 +00:00
Jerry Zhang	23892d8ee4	[quant][pt2e] Annotate GRU module (#103358 ) Summary: att, we use module partition API to identify the GRU submodule and annotate all necessary patterns Test Plan: buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e' Reviewed By: kimishpatel Differential Revision: D46384329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103358 Approved by: https://github.com/HDCharles	2023-06-13 04:10:13 +00:00
Yash Vardhan	6ed3c4499a	Fix fuse_custom_config_dict arg from being None (#102154 ) `fuse_custom_config_dict` in [fuse_modules.py](https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/fuse_modules.py#L164) being passed as None even if a fuse_custom_config_dict is provided. This patch fixes the `fuse_custom_config_dict` from being passed as None. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102154 Approved by: https://github.com/kit1980	2023-06-13 03:45:20 +00:00
BowenBao	45104cb67f	Different csv headers by bench mode on infra error (#103134 ) As title. The headers are different for distinct bench mode. This PR is a supplement to https://github.com/pytorch/pytorch/pull/100372 to respect `performance` mode where numerical speedup is expected instead of status text. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103134 Approved by: https://github.com/thiagocrepaldi, https://github.com/ezyang	2023-06-13 03:40:22 +00:00
lezcano	5f77be8bbe	Refactor OptimizeIndexing (#100549 ) This PR decouples the logic necessary to compute bounds on variables from the logic that uses this info to perform the strenght analysis on int64 variables. While doing so, it tries to minimize the number of attributes of the class in favour of local variables. This class is now accessible from any `LoopBody` object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100549 Approved by: https://github.com/eellison	2023-06-13 03:31:41 +00:00
mantaionut	88ebb2e321	Windows FileStore skip timeout if the file path is invalid (#103247 ) On Windows in FileStore if the path to the file to be created is not valid then it will get stuck there trying to create the file until the timeout is reached. This PR checks if the path is invalid and if it is then it will leave the loop instantly. Fixes #48475 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103247 Approved by: https://github.com/fduwjj, https://github.com/kit1980	2023-06-13 03:21:35 +00:00
Yinghai Lu	4c3799447f	Back out "Dropout support for memory efficient attention (#102038 )" & "Two small mem_eff bug fixes (#103201 )" (#103464 ) Summary: Original commit changeset: 04c4473d8510 Original Phabricator Diff: D46584152 & D46582033 Test Plan: Already explained in summary. Reviewed By: yinghai Differential Revision: D46633283 fbshipit-source-id: c23c2945408988f3c4339dfd5cd40ae46261716c Co-authored-by: Shenxiu Liu <shenxiu@meta.com>	2023-06-12 18:56:48 -07:00
blorange-amd	7360d0f904	Upgraded nightly wheels to rocm5.5 (#102242 ) Upgraded nightly wheels to rocm5.5 Follow-up to https://github.com/pytorch/builder/pull/1407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102242 Approved by: https://github.com/malfet	2023-06-13 01:34:10 +00:00
Mengwei Liu	9bc0b79369	[dynamo][numpy] Install numpy_pytorch_interop in ci jobs (#103447 ) It is required for numpy_pytorch_interop to be installed, for all tests being annotated by `@requires_numpy_pytorch_interop` decorator. This PR adds a commit for it and adds a function to install it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103447 Approved by: https://github.com/ezyang	2023-06-13 01:14:19 +00:00
Jane Xu	fa893f3f58	Fix optim state_dict casting to allow step to cast to CPU (#102619 ) I'm guessing this should fix https://github.com/pytorch/pytorch/pull/88015#issuecomment-1569523106 but am waiting on @ychfan to supply more details so I could write a good test case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102619 Approved by: https://github.com/albanD	2023-06-13 00:46:40 +00:00
Elias Ellison	666ec8160c	Skip test suite (#103472 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103472 Approved by: https://github.com/osalpekar, https://github.com/huydhn	2023-06-13 00:43:40 +00:00
Laith Hasanian	4a52694b08	[torch.compile] Add explain as a backend #102053 (#103259 ) Fixes #102053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103259 Approved by: https://github.com/voznesenskym	2023-06-13 00:32:17 +00:00
Bearnardd	2abad0c184	Add dtype check baddbmm (#102659 ) Fixes part of the #100838 related to disabling support for non matching dtypes for input/batches for `baddbmm` operator. * [x] added dtype checks * [x] added test case Pull Request resolved: https://github.com/pytorch/pytorch/pull/102659 Approved by: https://github.com/ngimel	2023-06-13 00:31:06 +00:00
Jez Ng	a18048d982	Remove redundant fallback for view_as_complex (#103261 ) This enables lowering to work for it Differential Revision: [D46585029](https://our.internmc.facebook.com/intern/diff/D46585029) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103261 Approved by: https://github.com/desertfire, https://github.com/eellison	2023-06-13 00:27:28 +00:00
PyTorch MergeBot	2c313e7b99	Revert "Record view stacks if running anomaly mode (#103185 )" This reverts commit a02c573a8996d5d47585410ceaf81c87104cfd43. Reverted https://github.com/pytorch/pytorch/pull/103185 on behalf of https://github.com/izaitsevfb due to Breaks internal builds, see D46629734 ([comment](https://github.com/pytorch/pytorch/pull/103185#issuecomment-1588258206))	2023-06-12 23:52:10 +00:00
Zain Rizvi	c3d3165f16	Enable uploading metrics and upload Test Reordering metrics to dynamodb (#102691 ) Added a feature to upload test statistics to DynamoDB and Rockset using a new function `emit_metric` in `tools/stats/upload_stats_lib.py`. Added metrics to measure test reordering effectiveness in `tools/testing/test_selections.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102691 Approved by: https://github.com/malfet	2023-06-12 23:01:53 +00:00
Aaron Enye Shi	72b7c4efe5	[Profiler] Fix flaky test_memory_timeline_no_id (#103441 ) Summary: On CPU only runs, the allocator seems to sometimes report out of context CPU allocations, but sometimes not. Let's just check the expected list is in the actual list for CPU. GPU test stays the same. Test Plan: CI, ran locally 100 times. Reviewers: dberard Resolves: https://github.com/pytorch/pytorch/issues/103286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103441 Approved by: https://github.com/davidberard98	2023-06-12 22:58:56 +00:00
Animesh Jain	58d2c66a70	[activation checkpointing] Higher order functional rng op wrappers (#102934 ) Introduces two higher order operators * run_and_save_rng_state - Saves the current rng state and then runs the op. * run_with_rng_state - Runs the op with the rng state supplied as an input Ideally, we would like to use torch.compile for these operators. But currently the plan is to introduce these operators at the partitioner level, obviating the need to support them fully through the torch.compile stack. To ensure that we have good enough debugging with minifiers, we have ensure that they work with make_fx. In future, we can move on torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102934 Approved by: https://github.com/jansel, https://github.com/zou3519	2023-06-12 22:54:17 +00:00
Peter Bell	31ee1512d3	[inductor] Update triton pin (#102736 ) There is some bug in triton's handling of `tl.reduce` that breaks the variance PR, but is fixed on the latest triton master. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102736 Approved by: https://github.com/huydhn, https://github.com/desertfire	2023-06-12 22:02:13 +00:00
Driss Guessous	455f542ed9	Add groups to dynamo benchmarking output data (#103268 ) # Summary Ads the required information to enable this issue: https://github.com/pytorch/test-infra/issues/4268 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103268 Approved by: https://github.com/huydhn	2023-06-12 21:09:42 +00:00
Edward Z. Yang	4935b3e0e7	Make specialized attributes on Tensor mandatory (#103434 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103434 Approved by: https://github.com/anijain2305	2023-06-12 21:01:24 +00:00
Nikita Vedeneev	056d92e2a0	sparse.mm backward: performance improvements (#94991 ) `torch.sparse.mm` - faster and without syncs in "most" cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94991 Approved by: https://github.com/Skylion007, https://github.com/pearu, https://github.com/cpuhrsch	2023-06-12 20:57:29 +00:00
Elias Ellison	d083d444ff	Inductor Freezing (#100652 ) Adds a freezing pass that will constant fold parameters in inductor `config.freezing`. This occurs post functionalization in aot autograd to capture both dispatching and allow passes to occur post functionalization. A few notes: - There is an option to discard parameters `config.freezing_discard_parameters` which will take the current eager modules and wrap parameters to a Tensor subclass which will error if used. - I needed to expose flat_params in aot_autograd in order to discard old references when we constant fold away parameters, like with amp. I also exposed `fw_metadata` to avoid constant folding mutated paraemters. - Caching parameter transformations/constant folding across different inferences nyi - Checking version_counter of constant folded params nyi I'm not really sure what the actual naming should be. In jit there was both "freezing", which was platform agnostic, and "optimize for inference", which made device specific optimizations. We're doing the latter here but maybe freezing is a better name. Differential Revision: [D46244033](https://our.internmc.facebook.com/intern/diff/D46244033) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100652 Approved by: https://github.com/jansel	2023-06-12 20:56:03 +00:00
Edward Z. Yang	54daf870bc	CUDA graphs overrides dynamic shapes and forces specialization (#103290 ) Previously, cudagraphs and dynamic_shapes were incompatible and enabling dynamic shapes would forcibly disable cudagraphs. This new strategy I think is better. The idea is essentially that cudagraphs is an "optimization" that happens to guard on every input. When cudagraphs is on, we force everything static, and this automatically does the right thing because we will force a recompile if sizes change. This obsoletes https://github.com/pytorch/pytorch/pull/101813 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103290 Approved by: https://github.com/voznesenskym, https://github.com/eellison	2023-06-12 20:26:55 +00:00
Michael Lazos	6c6c897d6b	Add graph break logging option instead of config flag (#103202 ) Make graph break logging a logging option vs a config setting Pull Request resolved: https://github.com/pytorch/pytorch/pull/103202 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2023-06-12 19:52:31 +00:00
zhuhong61	50c972bfd2	[c10d] Add xpu to the default device supported by user specified backend (#103410 ) Motivation: For collective dispatching, we want to provide a more user friendly usage for xpu device and CCL backend (user specified backend) mapping. Solution: We add xpu to the default device list, and it can construct the mapping between xpu and the user specified backend directly. Usage: When using xpu device, user can specify backend name only: `dist.init_process_group(backend='ccl')` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103410 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-06-12 19:46:33 +00:00
Edward Z. Yang	49754f44ee	Rewrite size/stride/numel TensorVariable handling (#103438 ) The main concept behind this refactor is this: if we know that a size/stride/etc is constant, do NOT trace it into the graph, EXCEPT for any preexisting special cases that applied for static shapes. The refactor unfolds like this: 1. Delete the `dynamic_shapes` branches in torch/_dynamo/variables/builder.py which accept int/float/bool outputs. This is over-aggressive and we don't want to allow this (because if the operator returns a constant, we shouldn't have called wrap_fx_proxy in the first place.) This causes a bunch of failures because we are blindly feeding the result of size() call to wrap_fx_proxy when dynamic shapes is enabled. 2. Modify TensorVariable.call_method in torch/_dynamo/variables/tensor.py to avoid sending constant ints to wrap_fx_proxy. After normal specialization (which should be deleted, see https://github.com/pytorch/pytorch/pull/103434) we consult the fake tensor to see if the values in question have free variables or not. If they don't we short circuit tracing into graph. We only trace into graph if the operation in question is truly symbolic. Note that there is a near miss here: it's OK to trace x.size() call entirely into the graph, even if it doesn't have all dynamic shapes, because operator.getitem with int output is special cased in builder.py. This is a preexisting special case and I don't try to get rid of it. 3. It turns out that the change here also breaks torch_np compatibility layer. So I completely rewrite getattr handling in torch/_dynamo/variables/tensor.py to follow the same pattern (only trace into graph if truly dynamic). There's some minor housekeeping in torch/fx/experimental/symbolic_shapes.py and some test files. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103438 Approved by: https://github.com/larryliu0820	2023-06-12 19:36:24 +00:00
Bin Bao	141828498c	[CI] Update inference accuracy test (#103361 ) Summary: 1) Switch inference accuracy test from fp32 to amp (consistent with dashboard run, https://github.com/pytorch/pytorch/pull/103220) 2) GoogleFnet fails in eager with amp or fp16, so fallback to always using fp32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103361 Approved by: https://github.com/eellison	2023-06-12 19:34:18 +00:00
Nikita Shulga	f22d99c784	Update C++ frontend docs (#103451 ) Specify that C++ standard is not C++17 and minimum supported CMake version is 3.18 Fixes https://github.com/pytorch/pytorch/issues/103371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103451 Approved by: https://github.com/jeanschmidt	2023-06-12 19:19:36 +00:00
SherlockNoMad	d997969b8b	[Reland] Add sym_size/stride/numel/storage_offset to native_function.yaml (#103107 ) Differential Revision: D46459100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103107 Approved by: https://github.com/angelayi, https://github.com/soulitzer	2023-06-12 19:18:49 +00:00
PyTorch MergeBot	0cb5bc3b04	Revert "Move tensor grouping to ATen (#100007 )" This reverts commit 74b7a6c75e698378882d30958908073407f97fb3. Reverted https://github.com/pytorch/pytorch/pull/100007 on behalf of https://github.com/izaitsevfb due to Breaks internal builds, see D46629727 ([comment](https://github.com/pytorch/pytorch/pull/100007#issuecomment-1587861598))	2023-06-12 18:30:33 +00:00
Nicolas Hug	3766c04736	Add uint8 support for CPU images in interpolate(mode='bicubic') (#103252 ) CC @vfdev-5 Proposed strategy: Be as close as possible to PIL when `antialias=True`. Be as close as possible to float path when `antialias=False`. Ad-hoc tests: <details> ```py import random import torch import pytest import numpy as np from PIL import Image from torch.nn.functional import interpolate @pytest.mark.parametrize("C", (1, 3, 6)) @pytest.mark.parametrize("batch_size", (1, 4)) @pytest.mark.parametrize("memory_format", (torch.contiguous_format, torch.channels_last, "strided", "cropped")) @pytest.mark.parametrize("antialias", (True, False)) # @pytest.mark.parametrize("mode", ("bilinear", "bicubic",)) @pytest.mark.parametrize("mode", ("bicubic",)) @pytest.mark.parametrize("seed", range(100)) def test_resize(C, batch_size, memory_format, antialias, mode, seed): def test_resize(C, batch_size, memory_format, antialias, mode, seed): torch.manual_seed(seed) random.seed(seed) Hi = 2random.randint(3, 10) + random.randint(0, 30) Wi = 2random.randint(3, 10) + random.randint(0, 30) Ho = 2random.randint(3, 10) + random.randint(0, 30) Wo = 2random.randint(3, 10) + random.randint(0, 30) # print(Hi, Wi, Ho, Wo) img = torch.randint(0, 256, size=(batch_size, C, Hi, Wi), dtype=torch.uint8) if memory_format in (torch.contiguous_format, torch.channels_last): img = img.to(memory_format=memory_format, copy=True) elif memory_format == "strided": img = img[:, :, ::2, ::2] elif memory_format == "cropped": a = random.randint(1, Hi // 2) b = random.randint(Hi // 2 + 1, Hi) c = random.randint(1, Wi // 2) d = random.randint(Wi // 2 + 1, Wi) img = img[:, :, a:b, c:d] else: raise ValueError("Uh?") margin = 0 img = img.clip(margin, 255 - margin) out_uint8 = interpolate(img, size=[Ho, Wo], mode=mode, antialias=antialias) if antialias and C == 3: out_pil_tensor = resize_with_pil(img, Wo, Ho, mode=mode, antialias=antialias) atol = {"bicubic": 2, "bilinear": 1}[mode] # TODO: is 2 expected when comparing with PIL bicubic? Why not 1 as for bilinear? torch.testing.assert_close(out_uint8, out_pil_tensor, rtol=0, atol=atol) out_float = interpolate(img.to(torch.float), size=[Ho, Wo], mode=mode, antialias=antialias).round().clip(0, 255).to(torch.uint8) if mode == "bicubic": diff = (out_float.float() - out_uint8.float()).abs() assert diff.max() < 30 percent = .03 if antialias else .1 assert (diff > 2).float().mean() < percent mae = .4 if antialias else .8 assert diff.mean() < mae else: torch.testing.assert_close(out_uint8, out_float, rtol=0, atol=1) def resize_with_pil(batch, Wo, Ho, mode, antialias): resample = {"bicubic": Image.BICUBIC, "bilinear": Image.BILINEAR}[mode] out_pil = [ Image.fromarray(img.permute((1, 2, 0)).numpy()).resize((Wo, Ho), resample=resample) for img in batch ] out_pil_tensor = torch.cat( [ torch.as_tensor(np.array(img, copy=True)).permute((2, 0, 1))[None] for img in out_pil ] ) return out_pil_tensor ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103252 Approved by: https://github.com/vfdev-5, https://github.com/H-Huang, https://github.com/malfet, https://github.com/atalman	2023-06-12 18:25:33 +00:00
PyTorch MergeBot	5ed618132f	Revert "change pre_autograd to pre_dispatch tracing (#101818 )" This reverts commit b0392de2c39d132b5901fc9a366afc1ddc214f96. Reverted https://github.com/pytorch/pytorch/pull/101818 on behalf of https://github.com/izaitsevfb due to Breaks internal builds see D46629736 TypeError: wrap_key() got an unexpected keyword argument pre_autograd ([comment](https://github.com/pytorch/pytorch/pull/101818#issuecomment-1587837667))	2023-06-12 18:16:37 +00:00
Edward Z. Yang	1b31665e78	Make all CI commit pin changes trigger ciflow/inductor (#103443 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103443 Approved by: https://github.com/desertfire, https://github.com/malfet	2023-06-12 18:12:45 +00:00
PyTorch MergeBot	fc46f01b55	Revert "Cleanup scatter-related code (#103074 )" This reverts commit 88aea179e379c743764e148adb86f1a320f0a299. Reverted https://github.com/pytorch/pytorch/pull/103074 on behalf of https://github.com/izaitsevfb due to Breaks internal builds, see D46629742, symbol not found: scatter_add_expanded_index_stub ([comment](https://github.com/pytorch/pytorch/pull/103074#issuecomment-1587823954))	2023-06-12 18:08:46 +00:00
YJ Shi	1ef7d6790d	[ONNX] Fix onnx constant folding (#101329 ) Fixes #101328 Note that this most likely is a bandage solution. We either need to actually fix one of those onnx passes that is causing this decomposition/functionalization issue, or need to special case all onnx op in `runTorchBackendForOnnx` like this one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101329 Approved by: https://github.com/BowenBao	2023-06-12 18:06:41 +00:00
cyy	48e3ee29ff	enable missing-prototypes in functorch (#103391 ) This PR enables missing-prototypes in functorch target and turn some functions into static ones Pull Request resolved: https://github.com/pytorch/pytorch/pull/103391 Approved by: https://github.com/malfet	2023-06-12 17:47:37 +00:00
pbialecki	de354bf53e	Replace CUDA 11.7 small pip wheels with 12.1 (#103091 ) CC @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/103091 Approved by: https://github.com/malfet, https://github.com/atalman	2023-06-12 17:29:20 +00:00
Edward Z. Yang	ac3ce0a57a	Remove dynamic_shapes special case in SizeVariable getitem (#103380 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103380 Approved by: https://github.com/voznesenskym	2023-06-12 17:29:03 +00:00
Mengwei Liu	2eac8bd2b8	[dynamo][numpy] Support ndarray methods (#97537 ) This PR adds universal support for ndarray methods. After #100839 each `NumpyNdarrayVariable` should wrap a `torch.Tensor`. This PR adds a `numpy_method_wrapper` which converts the `torch.Tensor` to `torch_np.ndarray` and then call the numpy ndarray method. Then we also try to return a `torch.Tensor` (return as-is if the value is not ndarray-like) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97537 Approved by: https://github.com/ezyang	2023-06-12 17:21:31 +00:00
Lu Fang	18f203a567	Clean up op BC check list (#103363 ) Summary: We clean up the BC op check list, and remove expired items. Test Plan: OSS CI Differential Revision: D46618642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103363 Approved by: https://github.com/Skylion007	2023-06-12 17:00:15 +00:00
David Berard	df83fe5bf7	[dynamo] graph break on nn.Parameter construction (#103262 ) Fixes #99569 nn.Parameter construction appears to run into FakeTensor / tracing issues during AOT Autograd. We could try to fix this; but nn.Parameter construction _inside_ the compiled region isn't a common scenario, so it's reasonable to just graph break on nn.Parameter construction. For reference, see #99569 for the errors/issues that appear from tracing through nn.Parameter construction with AOT Autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103262 Approved by: https://github.com/williamwen42	2023-06-12 16:41:56 +00:00
PyTorch MergeBot	08f90b3481	Revert "Update torchbench pin - torchrec_dlrm moved to canary (#103383 )" This reverts commit 114f99bba1a10da748326aa246709456cf46c10f. Reverted https://github.com/pytorch/pytorch/pull/103383 on behalf of https://github.com/malfet due to This broke inductor test ([comment](https://github.com/pytorch/pytorch/pull/103383#issuecomment-1587681978))	2023-06-12 16:40:05 +00:00
PyTorch MergeBot	caecb55223	Revert "Log functional_collectives apis to distributed logger (#103288 )" This reverts commit 37359c36fdb413df3b02996eb0ea2433c147db34. Reverted https://github.com/pytorch/pytorch/pull/103288 on behalf of https://github.com/malfet due to Broke test_inductor_collectives, see `37359c36fd` ([comment](https://github.com/pytorch/pytorch/pull/103288#issuecomment-1587677705))	2023-06-12 16:37:57 +00:00
Edward Z. Yang	c3fdfca5da	Always create ShapeEnv, always apply unspec logic (#103302 ) Originally, my goal for this PR was to remove the `dynamic_shapes` tests in torch/_dynamo/variables/builder.py. However, one thing lead to another, and it turns out that it was easiest to do all of the following in one go: * Unconditionally allocate a ShapeEnv, no matter if dynamic_shapes is enabled or not (torch/_dynamo/output_graph.py). There is a small adjustment to export torch/_dynamo/eval_frame.py to account for the fact that a ShapeEnv always exists, even if you're not doing symbolic export. * Remove dynamic_shapes test from unspec logic (torch/_dynamo/variables/builder.py), the original goal * Specialize strides and storage offset if all sizes are dynamic (torch/fx/experimental/symbolic_shapes.py). This is required to deal with unconditional ShapeEnv: if a ShapeEnv exist, fake tensor-ification may choose to allocate symbols. The idea is that with `automatic_dynamic_shapes == False`, Dynamo should never request dynamic sizes, but this invariant was not upheld for nontrivial strides/offset. The rest are just auxiliary fixups from the above: * Workaround bug in FakeTensorProp where sometimes it doesn't return a FakeTensor (torch/fx/passes/fake_tensor_prop.py), see https://github.com/pytorch/pytorch/pull/103395 for follow up * Make ShapeProp correctly handle int inputs (torch/fx/passes/shape_prop.py) * Disable indexing strength reduction if `assume_static_by_default` is False (torch/_inductor/codegen/triton.py) * Fix hf_T5_generate to NOT toggle `assume_static_by_default` if dynamic shapes is not enabled (benchmarks/dynamo/common.py); technically this is not necessary anymore but it's in for safety. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103302 Approved by: https://github.com/voznesenskym	2023-06-12 12:48:28 +00:00
PyTorch UpdateBot	f4228e7037	[xla hash update] update the pinned xla hash (#103416 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103416 Approved by: https://github.com/pytorchbot	2023-06-12 10:20:36 +00:00
Will Constable	37359c36fd	Log functional_collectives apis to distributed logger (#103288 ) This logs functional collectives API calls with debug log level only. (the `+` in the TORCH_LOGS cmdline enables debug level, otherwise only info level) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103288 Approved by: https://github.com/wanchaol, https://github.com/fduwjj	2023-06-12 06:33:26 +00:00
maxren	f37be77813	[Quant][XNNPACK] Delegate add_relu fusion (#103266 ) Quantized Resnet currently sees fused add-relu ``` --> dq \ add --> relu --> quant / --> dq ``` Let us support this fusion in the delegate as xnnpack can use the output_min and output_max of the op nodes to clamp the values and perform a fused add - relu operation Differential Revision: [D45258028](https://our.internmc.facebook.com/intern/diff/D45258028/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103266 Approved by: https://github.com/jerryzh168	2023-06-12 04:35:29 +00:00
mfkasim1	8a744c31d3	Up to 48% speed up using Sklansky method for innermost prefix scan algorithm (#103314 ) I found this algorithm (Sklansky) could provide speed-up over the previously implemented Brent-Kung (BK) algorithm. In BK algorithm, the sweeps are done twice: up-sweep and down-sweep. In up-sweep, initially all threads are working, but then half of the working threads becomes inactive in the subsequent step. Similarly for down-sweep but the other way around, where it initially starts with only 1 working thread and double the number of working threads for each sweep. This results of half of the thread is idle on average and produces `2 * log2(num_threads_x)` sweep steps. On the other hand, Sklansky algorithm only use 1 sweep and in each step of the sweep, all the threads are working. This algorithm also produces `log2(num_threads_x)` sweep steps which is half of the BK algorithm. That provides the speed up. I follow the schematics of Sklansky algorithm provided in [this paper](https://research.nvidia.com/sites/default/files/pubs/2016-03_Single-pass-Parallel-Prefix/nvr-2016-002.pdf). The same paper provides a much better algorithm (the one implemented in CUB), but I haven't got my head around it, while the Sklansky algorithm is easier to digest and implement. Here are the speed up from my experiment using `cumsum` in the innermost dimension using A100: (UPDATE: the newest commit further optimize it up to 76% on `8 x 4000` matrix) (UPDATE: added shapes with 2048 and 1M in its elements) \| Shape \| Torch cumsum \| Custom cumsum \| Speed up \| \|--------------\|---------------------------\|--------------------------\|---------------------\| \| (2, 1000) \| 4.8112869262695315e-05 \| 2.849102020263672e-05 \| 1.688702928870293 \| \| (8, 4000) \| 0.00017731189727783204 \| 0.0001005411148071289 \| 1.7635760018970834 \| \| (128, 10000) \| 0.0005342483520507813 \| 0.00035474300384521487 \| 1.5060151891928222 \| \| (1024, 20000)\| 0.0014238595962524415 \| 0.0010990619659423829 \| 1.2955225823246128 \| \| (1024, 100000)\| 0.007089591026306153 \| 0.005468320846557617 \| 1.296484099093993 \| \| (2048, 1000000)\| 0.058730244636535645 \| 0.0458010196685791 \| 1.2822912035913994 \| \| (1000, 2) \| 1.0919570922851562e-05 \| 8.106231689453125e-06 \| 1.3470588235294116 \| \| (4000, 8) \| 9.512901306152343e-06 \| 7.867813110351562e-06 \| 1.209090909090909 \| \| (10000, 128) \| 2.079010009765625e-05 \| 1.6164779663085937e-05 \| 1.2861356932153394 \| \| (20000, 1024)\| 0.00024993419647216796 \| 0.00017964839935302734 \| 1.3912408759124086 \| \| (100000, 1024)\| 0.0011160612106323243 \| 0.0009322404861450195 \| 1.1971816577581138 \| \| (1000000, 2048) \| 0.017030668258666993 \| 0.014445066452026367 \| 1.178995494082889 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/103314 Approved by: https://github.com/ngimel	2023-06-11 22:29:33 +00:00
Edward Z. Yang	0863e5503a	Handle nonzero via its meta registration (#103379 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103379 Approved by: https://github.com/Skylion007	2023-06-11 21:41:27 +00:00
Animesh Jain	114f99bba1	Update torchbench pin - torchrec_dlrm moved to canary (#103383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103383 Approved by: https://github.com/ezyang	2023-06-11 16:47:25 +00:00
Edward Z. Yang	03101a227f	Remove not dynamic_shapes case from wrap_listlike (#103301 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103301 Approved by: https://github.com/voznesenskym	2023-06-10 12:51:19 +00:00
shibo19	900226f20a	add multi swa support for custom device (#103297 ) Fixes #ISSUE_NUMBER add multi swa support for custom device Pull Request resolved: https://github.com/pytorch/pytorch/pull/103297 Approved by: https://github.com/janeyx99	2023-06-10 10:01:08 +00:00
Shunting Zhang	daf75c0759	[AOTAutograd] compare with stride hints (#103342 ) We previously compare FakeTensor's strides with real tensor's strides. This cause dynamic dimension of FakeTensor being specialized to static int. This may cause a graph specialized for one shape being used by another shape which is wrong. Use stride hints for the comparison instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103342 Approved by: https://github.com/malfet	2023-06-10 06:51:54 +00:00
Nikita Shulga	4cfa06f706	[BE] Deprecate `has_XYZ` attributes (#103279 ) Use [`__getattr__`](https://peps.python.org/pep-0562/) to raise warningwhen one tries to access `has_XYZ` methods and recommend appropriate `torch.backends.XYZ` methods Make respective properties in `torch._C` private (by prefixing them with underscore), to exclude from `from torch._C import *`. Added `warnings.simplefilter` to workaround Python-3.11 torch.compile lineinfo issue. Fixes https://github.com/pytorch/pytorch/issues/102484 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103279 Approved by: https://github.com/janeyx99, https://github.com/Skylion007	2023-06-10 05:17:17 +00:00
Aaron Enye Shi	0496d70aa0	[Profiler][Easy] Add log msg to assertEqual for flaky test_memory_timeline_no_id (#103326 ) Summary: Add msg to assertEqual field in the flaky test of test_memory_timeline_no_id, so that we print the actual tuple for debugging. Test Plan: CI Differential Revision: D46596242 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/103326 Approved by: https://github.com/davidberard98	2023-06-10 03:57:57 +00:00
Edward Z. Yang	919c567c38	Simplify has_unpack_var_sequence (#103324 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103324 Approved by: https://github.com/Skylion007	2023-06-10 03:57:29 +00:00
chunyuan	d61cd03b97	Inductor cpp wrapper: support ConvTranspose and fix Convolution ir (#103308 ) The changes in this PR include: - Support ConvTranspose in cpp wrapper - Fix cpp wrapper support for aten convolution when bias is `not None`: bias is in `args` instead of `kwargs` when it is `not None`. The change is covered by ConvTranspose dynamic shapes UT since we'll fall back to aten convolution in dynamic shape cases. - Fix cpp wrapper support for `inf`. This is a UT added in https://github.com/pytorch/pytorch/issues/101865. The cpp wrapper UT is covered in `test_conv2d_unary` of `test_cpp_wrapper.py`. It's in `slowTest` category and seems not captured in the CI of that PR. I will submit another PR to remove the hard-coded schema in these `ExternKernel`s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103308 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-06-10 03:53:05 +00:00
Edward Z. Yang	d67b676c51	Remove config.dynamic_shapes test for tracing size calls (#103325 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103325 Approved by: https://github.com/Skylion007	2023-06-10 03:42:36 +00:00
Edward Z. Yang	cf8af57c4a	Make torch.compile(dynamic=True) not assume static by default (#99469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99469 Approved by: https://github.com/ezyang	2023-06-10 02:56:01 +00:00
Nikita Shulga	f474497cd3	[Docker] Update cc/c++ to point t clang/clang++ (#103350 ) Should prevent weird issues when PyTorch is compiled with clang, but triton or torch_vision are build with gcc Pull Request resolved: https://github.com/pytorch/pytorch/pull/103350 Approved by: https://github.com/seemethere	2023-06-10 02:53:38 +00:00
Nikita Shulga	347463fddf	[cpp-extensions] Add clang to the list of supported Linux compilers (#103349 ) Not sure, why was it excluded previous (oversight I guess). Also, please note, that `clang++` is already considered acceptable compiler (as it ends with `g++` ;)) <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 55aa7db</samp> > _`clang` or `gcc`, we don't care what you use_ > _We'll build our extensions with the tools we choose_ > _Don't try to stop us with your version string_ > _We'll update our logic and make our code sing_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/103349 Approved by: https://github.com/seemethere	2023-06-10 02:53:38 +00:00
Nikita Shulga	00e16179f0	[LibTorch] Fix `append_whole_archive` macro (#103348 ) `-force_load` is not compiler, but a linker option, and as such should depend on the platform (i.e. MacOS/iOS), rather than on compiler (i.e. clang vs gcc) Otherwise, attempt to link libtorch static with clang results in a cryptic `/usr/bin/ld: -f may not be used without -shared` error on Linux. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103348 Approved by: https://github.com/seemethere	2023-06-10 02:53:37 +00:00
Nikita Shulga	5c252f2c7c	[Inductor/cpp] Fix reduction on pre clang-10 (#103347 ) `#pragma omp declare reduction` is not supported before clang-10 and results in a misleading compiler error in the following example: ```c++ template<typename T> T max_propagate_nan(T, T); extern "C" void cpp_fused_argmax_max_sum_0(const float* in_ptr0, float* out_ptr0, float* out_ptr1, long* out_ptr2) { float tmp_acc0 = 0; float tmp_acc1 = -std::numeric_limits<float>::infinity(); float tmp_acc2 = std::numeric_limits<float>::infinity(); struct IndexValue_7 {size_t index; float value;}; IndexValue_7 tmp_acc3{0, -std::numeric_limits<float>::infinity()}; #pragma omp declare reduction(argmax : IndexValue_7 : omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value, omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index) initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()}) for(long i0=static_cast<long>(0L); i0<static_cast<long>(3L); i0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i0)]; tmp_acc0 = tmp_acc0 + tmp0; tmp_acc1 = max_propagate_nan(tmp_acc1, tmp0); if (tmp_acc3.value < tmp0) { tmp_acc3.index = i0; tmp_acc3.value = tmp0; } } out_ptr0[static_cast<long>(0L)] = tmp_acc0; out_ptr1[static_cast<long>(0L)] = tmp_acc1; out_ptr2[static_cast<long>(0L)] = tmp_acc3.index; } ``` ``` % clang++-10 -std=c++17 -fopenmp bar.cpp -c -O3 % clang++-9 -std=c++17 -fopenmp bar.cpp -c -O3 bar.cpp:17:149: error: expected ')' #pragma omp declare reduction(argmax : IndexValue_7 : omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value, omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index) initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()}) ^ bar.cpp:17:34: note: to match this '(' #pragma omp declare reduction(argmax : IndexValue_7 : omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value, omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index) initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()}) ^ 1 error generated. ``` Also, remove unnecessary `struct` keyword in front of type, as C++ compiler already assumes that (and again, it causes problem with clang++-10 implementation) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103347 Approved by: https://github.com/voznesenskym	2023-06-10 02:53:37 +00:00
Edward Z. Yang	414ec6ce97	Turn off automatic_dynamic_shapes in prep for dynamic-by-default (#103320 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103320 Approved by: https://github.com/Skylion007	2023-06-10 02:49:59 +00:00
Will Constable	a8549357d2	Add distributed category to TORCH_LOGS (#103351 ) Fix use of torch distributed testing assertLogs Pull Request resolved: https://github.com/pytorch/pytorch/pull/103351 Approved by: https://github.com/wanchaol	2023-06-10 02:21:36 +00:00
Avik Chaudhuri	59ee6cd864	fix soundness bug with unsupported constraints (#102897 ) We do not raise constraint violations for complex binary conditions, such as conditions involving `%`. Moreover, while these constraints are discovered by our solver, the solver does not inject new constraint violations. This can result in cases where export passes, appropriate assertions are not added, and we get runtime crashes. Now, when the solver discovers constraints that are too complex, we force-specialize the involved dimensions and raise a constraint violation when such dimensions are marked dynamic. This forces the user to remove the dynamic marking, and causes the appropriate specialization assertions to be added. Differential Revision: [D46415786](https://our.internmc.facebook.com/intern/diff/D46415786/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102897 Approved by: https://github.com/tugsbayasgalan	2023-06-10 01:59:55 +00:00
Edward Z. Yang	1b398297dd	Rely on repeat meta reporting dynamic shapes (#103294 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103294 Approved by: https://github.com/Skylion007	2023-06-10 01:36:36 +00:00
Edward Z. Yang	1d40b394e6	Remove getitem dynamic shapes special case (#103296 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103296 Approved by: https://github.com/voznesenskym	2023-06-10 01:27:22 +00:00
Edward Z. Yang	5987c52082	Delete is_dynamic_shapes test (#103291 ) Simplified version of https://github.com/pytorch/pytorch/pull/102106/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103291 Approved by: https://github.com/voznesenskym	2023-06-10 01:27:14 +00:00
Edward Z. Yang	7be2a6228d	Delete non-dynamic shapes export special case in guard creation (#103295 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103295 Approved by: https://github.com/voznesenskym	2023-06-10 01:26:06 +00:00
Yanbo Liang	1eb762c919	[Inductor][FX passes] Normalize torch.cat for pre_grad fusion (#102951 ) Fixes #102950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102951 Approved by: https://github.com/devashishshankar	2023-06-10 00:56:51 +00:00
shaoyf42	443edb9015	[DOCS][DDP]Fix the simple of saving and reloading PowerSGD state and hook. (#102721 ) Fix the simple of saving and reloading PowerSGD state and hook. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102721 Approved by: https://github.com/H-Huang	2023-06-10 00:15:00 +00:00
Chien-Chin Huang	fff5daf3ee	[Dynamo] Support methods of NamedTuple (#103217 ) This PR adds the support of calling NamedTuple methods. https://github.com/pytorch/pytorch/issues/91662 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103217 Approved by: https://github.com/williamwen42	2023-06-10 00:01:40 +00:00
Edward Z. Yang	d84b63c4f4	Properly respect automatic dynamic config for unspec int (#103321 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103321 Approved by: https://github.com/Skylion007	2023-06-09 23:47:56 +00:00
Nikita Karetnikov	2b3d955ffd	[pt2] add meta and `SymInt` support for `linalg_matrix_exp` (#102945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102945 Approved by: https://github.com/lezcano	2023-06-09 22:45:16 +00:00
Nikita Karetnikov	3a0f37735c	[pt2] bug fix: invert condition in `checkFloatingOrComplex` (#102944 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102944 Approved by: https://github.com/lezcano	2023-06-09 22:45:16 +00:00
cyy	34ccd1dde6	[Reland2] fix missing-prototypes warnings in torch_cpu (Part 5) (#102931 ) This PR relands the changes introduced in PR #101976 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/102931 Approved by: https://github.com/Skylion007	2023-06-09 21:58:51 +00:00
PyTorch MergeBot	90110b0e4f	Revert "Add distributed category to TORCH_LOGS (#103287 )" This reverts commit 0b252aebb2e46c0dc5585ec6d296832a308f563b. Reverted https://github.com/pytorch/pytorch/pull/103287 on behalf of https://github.com/ZainRizvi due to Breaks trunk ([comment](https://github.com/pytorch/pytorch/pull/103287#issuecomment-1585161976))	2023-06-09 21:51:25 +00:00
David Berard	cde4657284	[inductor] Support complex fallback for convert_element_type, _fft_c2c, view_as_real to support GoogleFnet with cpp wrapper (#103183 ) Fixes #102752 These 3 fallback kernels appear in GoogleFnet because they take complex arguments - i.e., usually they aren't fallback kernels. To support this model, we added support for these 3 ops. Details: 1. Add these 3 ops to the allowlist. I assume that we eventually want to support all fallback kernels, but for now we just add these 3 ops to the allowlist. 2. Support complex64 in cpp codegen 3. Support List[] arguments and ScalarType arguments in cpp codegen 4. Allow alias_info in schema arguments. In the original PR supporting fallback kernels for cpp wrapper, ops with schemas with non-null alias_info for any of the arguments were disallowed; but I don't think there's any reason we need to disallow these in cpp wrapper code. Caveats: * This has not added support for complex32 or complex128 * It only works with static shapes, not dynamic shapes. It seems like the dynamic shapes issue is unrelated to cpp wrapper, since it fails in the test_torchinductor_dynamic_shapes.py test. I checked these `test_fft_.` tests, which I added in this PR, and verified that they were broken with dynamic shapes before any of the code changes from this PR. Test*: ``` benchmarks/dynamo/huggingface.py --inductor --amp --accuracy --inference --device cuda --cpp-wrapper --only GoogleFnet ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103183 Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/chunyuan-w	2023-06-09 21:12:41 +00:00
Lucy Qiu	f49b2f114a	[Pytorch] Add Vulkan support for aten::unsqueeze, 1d->2d, 3d->4d (#102987 ) Summary: Re-submitting D46057585 after revert from merge conflict Add 1d->2d, 3d->4d unsqueeze Unsqueeze operator: https://pytorch.org/docs/stable/generated/torch.unsqueeze.html#torch.unsqueeze Test Plan: Unsqueeze tests: ``` lfq@lfq-mbp xplat % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="unsqueeze" Downloaded 0/44 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules) Building: finished in 38.6 sec (100%) 523/523 jobs, 8/523 updated Total time: 38.6 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = unsqueeze [==========] Running 9 tests from 1 test suite. [----------] Global test environment set-up. [----------] 9 tests from VulkanAPITest [ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim0 [ OK ] VulkanAPITest.unsqueeze_1dto2d_dim0 (76 ms) [ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim1 [ OK ] VulkanAPITest.unsqueeze_1dto2d_dim1 (2 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim0 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim0 (9 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim1 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim1 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim2 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim2 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim0 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim0 (2 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim1 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim1 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim2 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim2 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim3 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim3 (1 ms) [----------] 9 tests from VulkanAPITest (98 ms total) [----------] Global test environment tear-down [==========] 9 tests from 1 test suite ran. (98 ms total) [ PASSED ] 9 tests. ``` clang-format on the glsl files Reviewed By: copyrightly Differential Revision: D46375157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102987 Approved by: https://github.com/SS-JIA	2023-06-09 21:04:53 +00:00
Zain Rizvi	89632b56ff	Revert "NCCL process group: avoid workEnqueue when capturing cuda graph (#102542 )" (#103341 ) This reverts commit 74a5d62d7ca9204b3b24137065c73fc7c66cc02d from PR https://github.com/pytorch/pytorch/pull/102542 That PR introduces a land race (see failure [here](`5aefa61d2f`)), and since it was exported from phabricator it cannot be reverted normally Pull Request resolved: https://github.com/pytorch/pytorch/pull/103341 Approved by: https://github.com/huydhn, https://github.com/kit1980	2023-06-09 20:38:13 +00:00
Yukio Siraichi	7550ec16a4	Add support for dictionary with torch object keys. (#103158 ) Fixes: #101979 This PR adds support for dictionaries with torch object as keys in dynamo. The main problem was that, for example, the source built for `d[torch.float]` (`d` being a dictionary) was `ODictGetItemSource(GlobalSource('d'), index=torch.float)`. When `Source.name` method was called, we got `odict_getitem(G['d'], torch.float)`. Evaluating that string raised an error, since `torch` was only available in the global dictionary `G` as `G["torch"]`. Instead, this PR builds the source: `ODictGetItemSource(GlobalSource('d'), index=AttrSource(GlobalSource('torch'), 'float'))`. The to-be-evaluated string is correctly generated as: `odict_getitem(G['d'], G['torch'].float)`. Here's a minimal example that reproduces the error, before this PR: ```python import torch d = { torch.float16: torch.float32, } @torch.compile def f(): return torch.randn(3, dtype=d[torch.float16]) f() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103158 Approved by: https://github.com/mlazos	2023-06-09 20:18:49 +00:00
PyTorch MergeBot	d1f24f73da	Revert "Make HigherOrderOperator stop appearing like torch.ops.* in FX (#103108 )" This reverts commit 194262ee49961acc8d84d6d4672748eae1826c30. Reverted https://github.com/pytorch/pytorch/pull/103108 on behalf of https://github.com/izaitsevfb due to Breaks executorch internally, see D46581996 ([comment](https://github.com/pytorch/pytorch/pull/103108#issuecomment-1585041505))	2023-06-09 19:31:40 +00:00
Will Constable	0b252aebb2	Add distributed category to TORCH_LOGS (#103287 ) This lets users run `TORCH_LOGS="+distributed" python myscript.py` and enable additional logging output for the distributed module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103287 Approved by: https://github.com/ezyang	2023-06-09 19:25:07 +00:00
PyTorch MergeBot	d89dd05e4d	Revert "CUDA graphs overrides dynamic shapes and forces specialization (#103290 )" This reverts commit c760f0e4dd5dad6146f6ab97116924786911768d. Reverted https://github.com/pytorch/pytorch/pull/103290 on behalf of https://github.com/ezyang due to to handle the other cuda graphs case ([comment](https://github.com/pytorch/pytorch/pull/103290#issuecomment-1584977767))	2023-06-09 18:25:28 +00:00
Yuxin Wu	5aefa61d2f	Fix calls to unqualified format_to to not clash with C++20's std::format_to (#103130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103130 Approved by: https://github.com/Skylion007	2023-06-09 18:19:07 +00:00
Jeremy Reizenstein	74a5d62d7c	NCCL process group: avoid workEnqueue when capturing cuda graph (#102542 ) Summary: In torch.distributed, we make ProcessGroupNCCL not call workEnqueue when the cuda stream is capturing. I.e., when capturing a CUDA graph, we do not enqueue anything for the watchdog thread to consider. This allows capturing NCCL operations in a CUDA Graph. This is followup to an internal discussion [1] where the watchdog thread was observed to crash when using cuda graphs containing an all_reduce. The watchdog thread wants to query events pertaining to enqueued work items, but this can't be done for "events" created during cuda graph capture. [1] https://fb.workplace.com/groups/1405155842844877/posts/6975201909173548/ Test Plan: Test added. Also, the repro mentioned in https://fb.workplace.com/groups/1405155842844877/posts/7003002339726838/ runs successfully after this change. Differential Revision: D46274814 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102542 Approved by: https://github.com/kwen2501	2023-06-09 18:16:02 +00:00
Damian Szwichtenberg	88aea179e3	Cleanup scatter-related code (#103074 ) This patch cleans up scatter-related code. GNN-specific implementation for scatter operation uses `radix_sort` to sort the indices, as `radix_sort` was recently moved to FBGEMM common utils (via [pytorch/FBGEMM#1672](https://github.com/pytorch/FBGEMM/pull/1672)), we do not need a local copy of the algorithm anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103074 Approved by: https://github.com/mikaylagawarecki	2023-06-09 18:11:32 +00:00
Edward Z. Yang	c760f0e4dd	CUDA graphs overrides dynamic shapes and forces specialization (#103290 ) Previously, cudagraphs and dynamic_shapes were incompatible and enabling dynamic shapes would forcibly disable cudagraphs. This new strategy I think is better. The idea is essentially that cudagraphs is an "optimization" that happens to guard on every input. When cudagraphs is on, we force everything static, and this automatically does the right thing because we will force a recompile if sizes change. This obsoletes https://github.com/pytorch/pytorch/pull/101813 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103290 Approved by: https://github.com/voznesenskym	2023-06-09 17:43:47 +00:00
Brian Hirsh	b0392de2c3	change pre_autograd to pre_dispatch tracing (#101818 ) We discussed in a composability meeting a few weeks ago that `pre_autograd` should probably be renamed to `pre_dispatch`. One question in this PR was: should I re-use a dispatch key? Or should I create a new dispatch key (that yet again corresponds to "top of the dispatcher")? ~~For now, I ended up sticking our proxy mode on the mode stack corresponding to `PythonTLSSnapshot`, because it was simple and it works. It looks like one of the functorch dispatch keys has higher priority though, so it's possible that functorch will end up running first. Open to options, but we can consider adding a new dispatch key later if that becomes a problem~~ Update: I added a dedicated dispatch key, `PreDispatch`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101818 Approved by: https://github.com/ezyang, https://github.com/Neilblaze, https://github.com/albanD, https://github.com/zou3519	2023-06-09 17:30:15 +00:00
Edward Z. Yang	1c3a7d9a7e	Resolve TODO by deleting assert sparse cannot be meta on SymInt (#103299 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103299 Approved by: https://github.com/bdhirsh	2023-06-09 17:13:54 +00:00
Edward Z. Yang	a02c573a89	Record view stacks if running anomaly mode (#103185 ) Now, when you do an inplace mutation and the view is naughty, you get this message: ``` RuntimeError: A view was created in no_grad mode and is being modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is forbidden. You can clarify your code by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked). To find out where this view was allocated, run your entire forward region under anomaly mode (torch.autograd.detect_anomaly(check_nan=False)). ``` When you run under anomaly mode, you get: ``` RuntimeError: A view was created in no_grad mode and is being modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is forbidden. You can clarify your code by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked). This view was allocated at: File "/data/users/ezyang/c/pytorch/test/test_autograd.py", line 4299, in arglebargle File "/data/users/ezyang/c/pytorch/test/test_autograd.py", line 4306, in test_anomaly_gives_view_stack File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/case.py", line 549, in _callTestMethod File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/case.py", line 591, in run File "/data/users/ezyang/c/pytorch/torch/testing/_internal/common_utils.py", line 2266, in _run_with_retry File "/data/users/ezyang/c/pytorch/torch/testing/_internal/common_utils.py", line 2337, in run File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/case.py", line 650, in __call__ File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 122, in run File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 84, in __call__ File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 122, in run File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 84, in __call__ File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/runner.py", line 184, in run File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/main.py", line 271, in runTests File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/main.py", line 101, in __init__ File "/data/users/ezyang/c/pytorch/torch/testing/_internal/common_utils.py", line 894, in run_tests File "/data/users/ezyang/c/pytorch/test/test_autograd.py", line 11209, in <module> ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103185 Approved by: https://github.com/zdevito	2023-06-09 16:56:28 +00:00
PyTorch MergeBot	79e0a1eacb	Revert "Make torch.compile(dynamic=True) not assume static by default (#99469 )" This reverts commit 7108c035bc0309f60fc86d32a42335b8808576f9. Reverted https://github.com/pytorch/pytorch/pull/99469 on behalf of https://github.com/ZainRizvi due to Breaks trunk ([comment](https://github.com/pytorch/pytorch/pull/99469#issuecomment-1584868864))	2023-06-09 16:46:29 +00:00
Edward Z. Yang	2e21cb095a	Remove capture_scalar_outputs sanity check prepping for dynamic by default (#103292 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103292 Approved by: https://github.com/voznesenskym	2023-06-09 16:13:09 +00:00
Jane Xu	4a5d56b74c	Disable dynamo'd test_optim entirely (#103323 ) See issue https://github.com/pytorch/pytorch/issues/103322. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103323 Approved by: https://github.com/DanilBaibak, https://github.com/atalman, https://github.com/malfet	2023-06-09 16:06:36 +00:00
Mikayla Gawarecki	6fa2d41dc7	Add mmap option to `torch.load` (#102549 ) Using [`nanoGPT/model.py`](https://github.com/karpathy/nanoGPT/blob/master/model.py) run <details><summary><b>Click for script to save gpt2-xlarge (1.5B params)</b></summary> ``` # test_load_save_gpt.py from model import GPT import torch import time torch.manual_seed(5) # gpt2-xlarge 1558M parameters class GPTConfig: block_size: int = 1024 vocab_size: int = 50304 # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency n_layer: int = 48 n_head: int = 25 n_embd: int = 1600 dropout: float = 0.0 bias: bool = True # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster def f(): model = GPT(GPTConfig()) state_dict = model.state_dict() start_saving = time.time() torch.save(state_dict, "gpt2-xlarge.pth") end_saving = time.time() if __name__ == "__main__": f() ``` </details> <details><summary><b>Click for script to load</b></summary> ``` # test_load_gpt.py import torch from model import GPT from test_load_save_gpt import GPTConfig import time import argparse def f(mmap, meta): device = 'meta' if meta else 'cpu' assign = True if meta else False with torch.device(device): model = GPT(GPTConfig()) start_loading = time.time() loaded_state_dict = torch.load("gpt2-xlarge.pth", _mmap=mmap) end_loading = time.time() print(f"loading time using torch.load with mmap={mmap}: ", end_loading - start_loading) model.load_state_dict(loaded_state_dict, assign=assign) end_load_state_dict = time.time() print("load_state_dict time: ", end_load_state_dict - end_loading) model.cuda() end_cuda = time.time() print("cuda time using torch.load with mmap: ", end_cuda - end_load_state_dict) if __name__ == "__main__": parser = argparse.ArgumentParser(prog='load_gpt_xlarge') parser.add_argument('-m', '--mmap', action='store_true') parser.add_argument('-d', '--devicemeta', action='store_true') args = parser.parse_args() mmap = args.mmap meta = args.devicemeta f(mmap, meta) ``` </details> `python test_load_gpt.py` <img width="614" alt="Screenshot 2023-06-06 at 1 35 43 PM" src="https://github.com/pytorch/pytorch/assets/35276741/ee06e5b3-b610-463b-a867-df995d21af29"> `python test_load_gpt.py --mmap` <img width="622" alt="Screenshot 2023-06-06 at 1 35 30 PM" src="https://github.com/pytorch/pytorch/assets/35276741/00d2fdd0-b1f5-4313-83dc-e540b654b2af"> If we further use the `with torch.device('meta')` context manager and pull the changes from https://github.com/pytorch/pytorch/pull/102212 that allow the model to reuse tensors from the state_dict, we have `python test_load_gpt.py --mmap --devicemeta` <img width="727" alt="Screenshot 2023-06-06 at 1 35 51 PM" src="https://github.com/pytorch/pytorch/assets/35276741/b50257d9-092a-49c3-acae-876ee44d009f"> \ \ Running the above in a docker container containing a build of PyTorch with RAM limited to 512mb by 1) running `make -f docker.Makefile` from `pytorch/` directory 2) `docker run -m 512m -it <image> bash` 3) docker cp `gpt2-xlarge.pth` and `test_load_gpt.py` into the image `python test_load_gpt.py` Docker will Kill the process due to OOM whereas `python test_load_gpt.py --mmap --devicemeta` <img width="635" alt="Screenshot 2023-06-06 at 1 55 48 PM" src="https://github.com/pytorch/pytorch/assets/35276741/f3820d9e-f24c-43e7-885b-3bfdf24ef8ad"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102549 Approved by: https://github.com/albanD	2023-06-09 15:49:58 +00:00
Masaki Kozuki	74b7a6c75e	Move tensor grouping to ATen (#100007 ) rel: #94344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100007 Approved by: https://github.com/janeyx99	2023-06-09 15:44:46 +00:00
Edward Z. Yang	7108c035bc	Make torch.compile(dynamic=True) not assume static by default (#99469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99469 Approved by: https://github.com/ezyang	2023-06-09 13:36:40 +00:00
Edward Z. Yang	96fd283640	Preserve CreationMeta when metafying views. (#103152 ) This helps us avoid erroring / generate more accurate error messages in Dynamo when doing mutations on views. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103152 Approved by: https://github.com/albanD	2023-06-09 12:34:54 +00:00
shibo19	c24b61bc20	Enable torch._C._get_privateuse1_backend_name in Dynamo tracing (#103141 ) Fixes https://github.com/pytorch/pytorch/issues/103125 torch._C._get_privateuse1_backend_name() will cause graph break, so I add it to the functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103141 Approved by: https://github.com/yanboliang	2023-06-09 09:19:33 +00:00
Shunting Zhang	6095a22cff	[inductor] add the ability to do heavier search for coordinate descent tuning (#99403 ) When checking Meta's internal cmf10x model, I found this interesting kernel https://gist.github.com/shunting314/d4b1fc7352c840ef185c607392e21f31 . Doing coordinate descent tuning starting from the out of box tuner find sub-optimal config: a config worse than the best one max-autotuner can find. This indicates that the coordinate descent tuner does not necessarily find the optimal config. Starting point matters. I want to make the coordinate descent tuning less depend on the starting point. Also I think by improving that, the coordinate descent tuner may be more likely to find even better configs when starting from max-autotune result. There are 2 ideas. 1. currently coordinate descent tuning only considers changing one field/coordinate at a time. I add the ability to check all directions (i.e. tuning all tunable fields at the same time) after the normal coordinate descent searching does not find better choices. I'll check how that works in cmf10x 2. currently when we change a field, we only change 1 step (i.e. radius is 1). I add the ability to use a larger radius. This only affect the search in all directions and does not affect the normal coordinate descent searching workflow. Both are disabled by default. Here are the tests I've done: - OOB (out of the box): 0.083ms 0.003GB 38.13GB/s - MA (max autotune): 0.016ms 0.003GB 195.60GB/s - best config: XBLOCK: 4, RBLOCK: 128, num_warps: 4, num_stages: 1 Default coordinate descent: - Coordesc (coordinate descent tuner) upon OOB: 0.024ms 0.003GB 131.52GB/s ( WORSE than Max Autotune ) - best config: XBLOCK: 64, RBLOCK: 4, num_warps: 16, num_stages: 1 - Coordesc upon MA: 0.016ms 0.003GB 194.31GB/s (no further improvement upon MA) Search in all directions: (radius = 1) - Coordesc upon OOB: 0.017ms 0.003GB 184.55GB/s - best config: XBLOCK: 32, RBLOCK: 16, num_warps: 32, num_stages: 1 - IMPROVE FROM 0.024ms to 0.017ms. QUITE CLOSE TO THE ONE FIND BY MAX-AUTOTUNE - Coordesc upon MA: no further improvements upon MA Search in all directions: (radius = 2) - Coordesc upon OOB: 0.016ms 0.003GB 192.60GB/s - best config: XBLOCK: 8, RBLOCK: 16, num_warps: 8, num_stages: 1 - SLIGHTLY BETTER THAN RADIUS=1 for this kernel and on par with max-autotune - Coordesc upon MA: no further improvements upon MA Overall max-autotuner does a really good job for this kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/99403 Approved by: https://github.com/jansel	2023-06-09 09:04:55 +00:00
shibo19	2961ea80f5	Deprecate "Type" and support more devices for save_on_cpu (#103245 ) Fixes #ISSUE_NUMBER 1、the class named "Type" has not been used anymore in anywhere, so I add warning message to remove it in the future. 2、add a arg(default is "cuda") for save_on_cpu so that it can support more device type (like privateuse1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103245 Approved by: https://github.com/soulitzer	2023-06-09 05:05:01 +00:00
Huy Do	c037088ac4	Debug Windows locked files (#103237 ) This PR temporally installs SysInternal https://learn.microsoft.com/en-us/sysinternals/downloads/handle and prints which processes locking `C:\action-runner\work` folder. This is needed to debug the elusive Windows locked file issues https://github.com/pytorch/pytorch/actions/runs/5216626202/jobs/9415560483. This will be reverted once the investigation is done. If the tool proves to be useful, we can add it to the AMI later on Pull Request resolved: https://github.com/pytorch/pytorch/pull/103237 Approved by: https://github.com/clee2000	2023-06-09 04:53:07 +00:00
Wanchao Liang	4cc474dec4	[dtensor] support torch.save/load with DTensor (#103106 ) This PR actually enables DTensor to be pickable and add tests to test torch.save/load works correctly for DTensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/103106 Approved by: https://github.com/kumpera	2023-06-09 04:11:15 +00:00
Wanchao Liang	d31707a257	Get rid of dim_groups attribute from DeviceMesh (#103105 ) This PR get rids of the dim_groups attribute from DeviceMesh, the main motivation behind this is that we should let c10d store the process groups during its creation instead of DeviceMesh, DeviceMesh should just handle ranks correctly. This could enable DTensor becomes picklable! (torch.save/load could be possible), which I will give it a try in the next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/103105 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2023-06-09 04:11:15 +00:00
Yuxin Wu	81b704eab3	numpy1.25 deprecation: np.product -> np.prod (#103263 ) Deprecated according to https://github.com/numpy/numpy/releases/tag/v1.25.0rc1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103263 Approved by: https://github.com/mikaylagawarecki	2023-06-09 02:18:53 +00:00
Mark Saroufim	f3553c508c	ImportLib py3.10 bug in AOTInductor (#103277 ) Other projects have seen a similar issue https://github.com/quantumlib/Cirq/issues/4637 ## Before ``` (nightly) ubuntu@ip-172-31-2-131:~$ python /tmp/torchinductor_ubuntu/eq/ceqs7t4pesfhqllk6qf4k5spu2cm23l7quqdt2mkrp4rlcjl6kw5.py Traceback (most recent call last): File "/tmp/torchinductor_ubuntu/eq/ceqs7t4pesfhqllk6qf4k5spu2cm23l7quqdt2mkrp4rlcjl6kw5.py", line 47, in <module> module = CppWrapperCodeCache.load(cpp_wrapper_src, 'inductor_entry_cpp', 'czenwgemzbe2etzbh7hzhnwjhyamvwirgodyjlly75fayy4tp3rx', False) File "/opt/conda/envs/nightly/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 846, in load assert isinstance(spec.loader, importlib.abc.Loader) AttributeError: module 'importlib' has no attribute 'abc'. Did you mean: '_abc'? ``` ## After ```sh (nightly) ubuntu@ip-172-31-2-131:~/test$ python /tmp/torchinductor_ubuntu/eq/ceqs7t4pesfhqllk6qf4k5spu2cm23l7quqdt2mkrp4rlcjl6kw5.py 0.000272 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103277 Approved by: https://github.com/desertfire	2023-06-09 02:12:34 +00:00
Bin Bao	4c03adc1f4	[dashboard] Allocate 4 shards for torchbench (#103280 ) Summary: Enabling inference nightly run unproportionally slows down torchbench, so allocate more shards for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103280 Approved by: https://github.com/huydhn	2023-06-09 01:44:39 +00:00
PaliC	8c584028a7	add github action to upload alerts to rockset / aws (#102995 ) Successful test run found at Test run found at https://github.com/pytorch/pytorch/actions/runs/5213244046/jobs/9410138550 <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 8d7d860</samp> This pull request adds a new feature to create and upload alerts for failing jobs in the pytorch/pytorch repo. It introduces a new script `tools/alerts/create_alerts.py` to generate alert entries and a new workflow `.github/workflows/upload-alerts.yml` to run the script and upload the alerts periodically. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 8d7d860</samp> > _To upload alerts to Rockset_ > _We added a workflow, you bet_ > _It runs every ten_ > _With concurrency then_ > _And `create_alerts.py` we edit_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/102995 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2023-06-09 01:33:40 +00:00
Andrew Gu	bb8278731e	[FSDP][Easy] Remove redundant var def in test (#103270 ) This is an easy follow-up from the previous PR to avoid re-running CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103270 Approved by: https://github.com/rohan-varma	2023-06-09 00:56:13 +00:00
XiaobingSuper	8e5b7ce5db	inductor: fix bf16 legalization issue for fp32 load with to bf16 case (#103080 ) Giving following ir: ``` def body(self, ops): get_index = self.get_index('index0') index_expr = ops.index_expr(get_index, torch.int32) constant = ops.constant(4, torch.int32) lt = ops.lt(index_expr, constant) masked_subblock1 = self.masked_subblock1(lt, 0.0) get_index_1 = self.get_index('index3') load = ops.load('arg2_1', get_index_1) to_dtype = ops.to_dtype(load, torch.bfloat16) where = ops.where(lt, masked_subblock1, to_dtype) get_index_2 = self.get_index('index3') store = ops.store('buf0', get_index_2, where, None) return store def masked_subblock2(self, ops): get_index = self.get_index('index2') load = ops.load('arg1_1', get_index) return load def masked_subblock1(self, ops): get_index = self.get_index('index1') index_expr = ops.index_expr(get_index, torch.int32) constant = ops.constant(1, torch.int32) ge = ops.ge(index_expr, constant) get_index_1 = self.get_index('index1') index_expr_1 = ops.index_expr(get_index_1, torch.int32) constant_1 = ops.constant(3, torch.int32) lt = ops.lt(index_expr_1, constant_1) and_ = ops.and_(ge, lt) masked_subblock2 = self.masked_subblock2(and_, 0.0) get_index_2 = self.get_index('index3') load = ops.load('arg2_1', get_index_2) to_dtype = ops.to_dtype(load, torch.bfloat16) where = ops.where(and_, masked_subblock2, to_dtype) return where ``` before this PR, the ```masked_subblock2``` will legalize as ```load_bf16+to_fp32```, and the ```masked_subblock2```'s output type is ```fp32```, but for ```load = ops.load('arg2_1', get_index_2), to_dtype = ops.to_dtype(load, torch.bfloat16)```, we didn't convert ```to_bf16``` as ```to_fp32```, which the ```op.where``` has mixed type computation, and will has compiler error: ```error: operands to ?: have different types ‘float’ and ‘c10::BFloat16’```. This PR will always convert ```to_bf16``` as ```to_fp32``` to fix such an issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103080 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-06-09 00:33:10 +00:00
Michael Lazos	40dbbcab6c	Update error message with torch logging instructions (#102892 ) https://github.com/pytorch/pytorch/issues/100109 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102892 Approved by: https://github.com/yanboliang	2023-06-09 00:07:08 +00:00
Jack Khuu	d0c0e13b69	[Specialized Kernel] Translate Kernel Assignment Logic from function.yaml to native_functions.yaml (#102576 ) Updating `gen_executorch.translate_native_yaml()` to translate kernel assignments when converting `functions.yaml` to `native_functions.yaml` --- Functions.yaml format: ``` - func: add.out type_alias: T0: [<Type>, <Type>] T1: [<Type>] dim_order_alias: D0: [0, 1, 2, 3] D1: [0, 3, 2, 1] kernels: - arg_meta: null kernel_name: default_impl - arg_meta: self: [T0, D0] other:[T0, D0] out: [T0, D0] kernel_name: test_impl ``` native_functions.yaml format ``` func: add.out(Tensor self, Tensor other, , Scalar alpha=1, Tensor(a!) out) -> Tensor(a!) kernel: default: default_impl v<Version>/<TYPE Enum>;<DIM Order>\|<TYPE Enum>;<DIM Order>\|<TYPE Enum>;<DIM Order>: test_impl ``` Example: 'v1/6;0,1,2,3\|3;0,1,2,3\|6;0,1,2,3' : 'test_impl'* ## Note: - If a "kernels" field is not present in functions.yaml (as it currently is), the output is unaffected --- Design Doc: https://docs.google.com/document/d/1gq4Wz2R6verKJ2EFseLyPdAF0wqomnCrVDDJpRkYsRw/edit?kh_source=GDOCS# Differential Revision: [D45971107](https://our.internmc.facebook.com/intern/diff/D45971107/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102576 Approved by: https://github.com/larryliu0820	2023-06-08 23:42:24 +00:00
atalman	98a1e3a3e9	Put back cuda 11.8 distributed tests (#103265 ) Put back cuda 11.8 distributed tests Follow up after: https://github.com/pytorch/pytorch/pull/102178 which accidentally disabled cuda distributed tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/103265 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-06-08 23:12:52 +00:00
PaliC	481023fb6c	add huggingface to inductor docker images (#102881 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 971a80c</samp> This pull request adds support for building docker images that can run performance benchmarks using the inductor framework. It introduces new files and scripts to install the benchmark dependencies, and updates the docker build and test workflows to use the new images. It also fixes some minor issues with the existing inductor tests and workflows. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 971a80c</samp> > _Oh we're the docker builders and we work all day and night_ > _We install the dependencies for the inductor benchmarks right_ > _We pin the versions and the commits and run the scripts with ease_ > _And we heave away and pull away and build the images_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/102881 Approved by: https://github.com/huydhn	2023-06-08 22:15:14 +00:00
Andrew Or	89d57f269f	[quant][pt2] Fix convert in Conv + BN + ReLU QAT fusion (#102993 ) Summary: Previously, the QAT pattern for conv + bn + relu was not actually replaced in convert. This is because the quantized QAT pattern used in convert doesn't actually have a relu node. This commit adds this extra pattern in the convert path and the numerics now match FX's. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_qat_conv_bn_relu_numerics Reviewed By: jerryzh168 Differential Revision: D46372411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102993 Approved by: https://github.com/jerryzh168	2023-06-08 22:10:29 +00:00
Driss Guessous	606fb882c4	Dropout support for memory efficient attention (#102038 ) # Summary This PR builds off of: - https://github.com/pytorch/pytorch/pull/101847 - https://github.com/pytorch/pytorch/pull/100583 It specifically adds dropout support to the memory efficient attention kernel. In the process of doing so roughly 3 changes were made: - Update sdpa dispatching to allow for inputs requiring grad to be sent to efficient attention - Update how memory efficient attention handles passing the rng state from forward to backward in order to enable cuda_graph support - Fix a bug in the kernel that was causing incorrect gradients to be produced for num_keys > 64 with dropout and causal masking set. https://github.com/facebookresearch/xformers/pull/755 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102038 Approved by: https://github.com/cpuhrsch	2023-06-08 21:50:12 +00:00
Michael Lazos	05e91a50d9	Manually generate guards for optimizer (#103121 ) Manually generate guards for optimizer rather than use variable builder, which can be slow with lots of params. This is the reason for ~10s compile slowdown Redisable `_init_group`. This is important, because if for any reason a frame which calls `_init_group` is run in the python interpreter, we will trace it, which we don't want to do. We only want to call it when it is accessed via the fast path implemented with the optimizer variable during symbolic interpretation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103121 Approved by: https://github.com/jansel	2023-06-08 21:45:19 +00:00
Andrew Gu	48056b168f	[FSDP] Reshard frozen params in backward (#101982 ) This PR makes a first attempt at improving FSDP's fine-tuning support by adding hooks to reshard frozen parameters in the backward pass. - Without this, frozen parameters involved in gradient computation are kept as unsharded through the entire backward pass. - The approach is to register a multi-grad ~~post~~-hook on the _input_ activations to the FSDP module, where the hook performs the resharding after all gradients for the FSDP module must have been computed (meaning that we are safe to reshard). ~~This PR relies on adding a "multi-grad post-hook" that differs from the existing "multi-grad hook" from `register_multi_grad_hook()`. I find that with `register_multi_grad_hook()`, sometimes the unit test counting the number of times `_post_backward_reshard()` is called fails (due to it not being called).~~ This was resolved in https://github.com/pytorch/pytorch/pull/102859. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101982 Approved by: https://github.com/rohan-varma	2023-06-08 21:12:45 +00:00
PyTorch MergeBot	b52ee80cdc	Revert "Add print statements to debug sharding error (#102713 )" This reverts commit c7873522c2ceefbc3b747224da1d26d566115c9a. Reverted https://github.com/pytorch/pytorch/pull/102713 on behalf of https://github.com/clee2000 due to issue should be resolved now ([comment](https://github.com/pytorch/pytorch/pull/102713#issuecomment-1583334560))	2023-06-08 21:02:17 +00:00
Tugsbayasgalan Manlaibaatar	cea899cd57	Add early validation logic to dynamic_dim (#102982 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102982 Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri	2023-06-08 20:23:49 +00:00
MysticalMusings	f1f13a35b0	Fix GELU-related docstring formatting (#102845 ) The docstring about GELU seems formatted incorrectly. The original docstring about GELU is rendered as below: $$ \text{GELU}(x) = 0.5 * x * (1 + \text{Tanh}(\sqrt(2 / \pi) * (x + 0.044715 * x^3))) $$ where the square root of which part is confusing. I double-checked the formula, which should be: $$ \text{GELU}(x) = 0.5 * x * (1 + \text{Tanh}(\sqrt{2 / \pi} * (x + 0.044715 * x^3))) $$ where round brackets in resource code should be brace brackets. > _formula in [original paper](https://arxiv.org/abs/1606.08415)_ > ![Snipaste_2023-06-03_00-43-49](https://github.com/pytorch/pytorch/assets/39690782/22511c4e-2f20-4a16-9bda-4c182a360160) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102845 Approved by: https://github.com/mikaylagawarecki	2023-06-08 20:19:03 +00:00
Xinya Zhang	1d857586f1	[ROCM] enable hipSOLVER backend for linalg.ldl_factor (#102665 ) * Add complex dtype support for linalg.ldl_factor * Fixes SWDEV-360139 * Enable the following 19 tests for ROCM + test_decomp.py TestDecompCUDA.test_comprehensive_linalg_ldl_factor_cuda_complex128 + test_decomp.py TestDecompCUDA.test_comprehensive_linalg_ldl_factor_cuda_complex64 + test_decomp.py TestDecompCUDA.test_comprehensive_linalg_ldl_factor_ex_cuda_complex128 + test_decomp.py TestDecompCUDA.test_comprehensive_linalg_ldl_factor_ex_cuda_complex64 + test_meta.py TestMetaCUDA.test_dispatch_meta_linalg_ldl_factor_cuda_complex128 + test_meta.py TestMetaCUDA.test_dispatch_meta_linalg_ldl_factor_cuda_complex64 + test_meta.py TestMetaCUDA.test_dispatch_meta_linalg_ldl_factor_ex_cuda_complex128 + test_meta.py TestMetaCUDA.test_dispatch_meta_linalg_ldl_factor_ex_cuda_complex64 + test_meta.py TestMetaCUDA.test_meta_linalg_ldl_factor_cuda_complex128 + test_ops.py TestCommonCUDA.test_noncontiguous_samples_linalg_ldl_factor_cuda_complex64 + test_ops.py TestCommonCUDA.test_noncontiguous_samples_linalg_ldl_factor_ex_cuda_complex64 + test_ops.py TestCommonCUDA.test_variant_consistency_eager_linalg_ldl_factor_cuda_complex64 + test_ops.py TestCommonCUDA.test_variant_consistency_eager_linalg_ldl_factor_ex_cuda_complex64 + test_ops.py TestMathBitsCUDA.test_conj_view_linalg_ldl_factor_cuda_complex64 + test_ops.py TestMathBitsCUDA.test_conj_view_linalg_ldl_factor_ex_cuda_complex64 + test_ops.py TestMathBitsCUDA.test_neg_conj_view_linalg_ldl_factor_cuda_complex128 + test_ops.py TestMathBitsCUDA.test_neg_conj_view_linalg_ldl_factor_ex_cuda_complex128 + test_ops_jit.py TestJitCUDA.test_variant_consistency_jit_linalg_ldl_factor_cuda_complex64 + test_ops_jit.py TestJitCUDA.test_variant_consistency_jit_linalg_ldl_factor_ex_cuda_complex64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102665 Approved by: https://github.com/lezcano	2023-06-08 20:05:01 +00:00
gmagogsfm	b4f3a6f58f	[Dynamo Hackathon] Add support for hasattr on TorchVariable (#103177 ) Fixes #101154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103177 Approved by: https://github.com/yanboliang	2023-06-08 19:34:44 +00:00
Valentin Andrei	c62fcedc44	[cuda] Limit grid size for torch.cat kernel on aligned16 contig tensors (#103233 ) When torch.cat gets called on a list of contiguous tensors that are aligned on a 16B boundary in memory, the number of thread blocks used is directly proportional with the maximum size of the tensors in the list. If one or more tensors are very large while the others are small, a high number of thread blocks results in useless redundant loads of the input metadata. This PR limits the grid size and improves the performance of cat when used on list of tensors with large variations in size. Used the same test program from https://github.com/pytorch/pytorch/pull/102815 but added new cases with list of tensors with varying sizes. <img width="735" alt="Screenshot 2023-06-07 at 10 14 18 PM" src="https://github.com/pytorch/pytorch/assets/23515689/72d0e5cb-5840-400e-b53b-d1418e664f19"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103233 Approved by: https://github.com/malfet	2023-06-08 19:14:14 +00:00
Will Constable	39201ce025	Make dynamo bench conditionally import DDP/FSDP (#103163 ) Avoids hitting importerror for singlenode benchmarks when running on a non-distributed build of pytorch. Fixes #102086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103163 Approved by: https://github.com/lezcano, https://github.com/wanchaol	2023-06-08 19:10:49 +00:00
Aidyn-A	591134f2a5	[CI] Enable UCC in CI (#100395 ) UCC was temporarily disabled in #98832. This PR re-enables it with the necessary fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100395 Approved by: https://github.com/atalman	2023-06-08 19:01:22 +00:00
Sergii Dymchenko	a1c26ba77c	Rename READEME.md to README.md (#103230 ) Fix the typo so the file is shown for the dir. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103230 Approved by: https://github.com/ZainRizvi	2023-06-08 18:42:53 +00:00
Angela Yi	4a72708d2b	[dynamo] Fix Autograd Function Classmethod bug (#103175 ) Fixes https://github.com/pytorch/pytorch/issues/103139 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103175 Approved by: https://github.com/williamwen42, https://github.com/yanboliang	2023-06-08 18:15:27 +00:00
Richard Barnes	a667b2ad1d	[codemod] Use C++17 [[fallthrough]] in caffe2/torch/csrc/utils/python_arg_parser.cpp (#103039 ) Test Plan: Sandcastle Differential Revision: D46402909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103039 Approved by: https://github.com/Skylion007	2023-06-08 17:41:48 +00:00
Elias Ellison	40d70ba7ed	Remove a number of fixed skips (#103162 ) Also adds `PYTORCH_TEST_WITH_AOT_EAGER` to distinguish errors coming from aot_autograd and not inductor (not tested in ci, but useful for local debugging) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103162 Approved by: https://github.com/desertfire	2023-06-08 17:37:59 +00:00
ydwu4	3c896a5adb	[dynamo] fix torch.distributions lazy_attribute failure (#103208 ) Fixes #93340. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103208 Approved by: https://github.com/yanboliang	2023-06-08 17:26:54 +00:00
Ke Wen	57c63aad10	[c10d] Remove test for init barrier (#103223 ) Forward fix for intermittent failures after landing of #103033 (resolves issue #103195) After #103033 , some tests are no longer applicable. Cc @huydhn Pull Request resolved: https://github.com/pytorch/pytorch/pull/103223 Approved by: https://github.com/huydhn, https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/ZainRizvi	2023-06-08 16:56:40 +00:00
Aaron Enye Shi	2a4fa25109	[Profiler] Include more uncategorized events in memory profile (#101200 ) Summary: This PR adds handling for allocations / frees which we cannot prove are for Tensors. (And thus aren't assigned an ID.) These events are still important for judging overall utilization. Test Plan: CI and Unit tests. Differential Revision: D45458885 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/101200 Approved by: https://github.com/anupambhatnagar, https://github.com/davidberard98	2023-06-08 16:22:49 +00:00
Xilun Wu	675f2597fa	[reland][DTensor][3/N] add DTensor constructor function: full (#101436 ) (#103165 ) This is a reland attempt of reverted PR #101436 . Differential Revision: [D46537531](https://our.internmc.facebook.com/intern/diff/D46537531) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103165 Approved by: https://github.com/wanchaol	2023-06-08 16:18:33 +00:00
PyTorch MergeBot	fdca7f7c2f	Revert "export helper funcs in foreach ops (#102928 )" This reverts commit 978a2f2b276b51f615aa860d47fadd16a284b2f6. Reverted https://github.com/pytorch/pytorch/pull/102928 on behalf of https://github.com/janeyx99 due to Broke build on windows. ([comment](https://github.com/pytorch/pytorch/pull/102928#issuecomment-1582720414))	2023-06-08 14:44:09 +00:00
Rodrigo Kumpera	4833dc10b8	[DCP] Rewrite read slicing to use a wrapper. (#99167 ) Moved SlicedBufferedReader to utils and renamed to _ReaderView. It no longer depends on file handles and is a pure wrapper. This makes it general enought to handle non io stream objects like fsspec's. Should help with #98386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99167 Approved by: https://github.com/wz337	2023-06-08 13:52:13 +00:00
Bin Bao	39bf86ae90	[dynamo] Support OrderedDict constructor with kwargs (#103192 ) Summary: To solve an issue in https://github.com/pytorch/pytorch/issues/102878. The solution follows the example in https://github.com/pytorch/pytorch/pull/98660. It only solves a problem for standard OrderedDict. There is another problem if we use a user-defined CustomDict. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103192 Approved by: https://github.com/yanboliang	2023-06-08 12:14:21 +00:00
PyTorch MergeBot	580958a338	Revert "add github action to upload alerts to rockset / aws (#102995 )" This reverts commit 49450fe021cc5d439f56580463461ff438f9ac96. Reverted https://github.com/pytorch/pytorch/pull/102995 on behalf of https://github.com/PaliC due to failing with no credentials error ([comment](https://github.com/pytorch/pytorch/pull/102995#issuecomment-1582466491))	2023-06-08 12:09:52 +00:00
Kimish Patel	a49aefdce2	[PT2][Quant] In linear partition include functional.linear (#103186 ) Summary: as title Test Plan: tested in subsequent diff Reviewed By: jerryzh168 Differential Revision: D46342824 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103186 Approved by: https://github.com/jerryzh168	2023-06-08 09:48:09 +00:00
Zhengxu Chen	c9681613b2	[export] Unskip non supported tests. (#103168 ) Test Plan: CI Differential Revision: D46526971 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103168 Approved by: https://github.com/tugsbayasgalan	2023-06-08 09:05:10 +00:00
shibo19	978a2f2b27	export helper funcs in foreach ops (#102928 ) Fixes #ISSUE_NUMBER export some helper funcs in foreach ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/102928 Approved by: https://github.com/janeyx99	2023-06-08 07:59:51 +00:00
Tugsbayasgalan Manlaibaatar	91e82ba0a6	[PT2 Dynamo Hackathon] Fix simple bug in inline dict (#103187 ) Fixes: https://github.com/pytorch/pytorch/issues/101980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103187 Approved by: https://github.com/yanboliang	2023-06-08 07:16:13 +00:00
PaliC	49450fe021	add github action to upload alerts to rockset / aws (#102995 ) Successful test run found at Test run found at https://github.com/pytorch/pytorch/actions/runs/5179855118/jobs/9333292038 (uses equivalent PRs) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 8d7d860</samp> This pull request adds a new feature to create and upload alerts for failing jobs in the pytorch/pytorch repo. It introduces a new script `tools/alerts/create_alerts.py` to generate alert entries and a new workflow `.github/workflows/upload-alerts.yml` to run the script and upload the alerts periodically. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 8d7d860</samp> > _To upload alerts to Rockset_ > _We added a workflow, you bet_ > _It runs every ten_ > _With concurrency then_ > _And `create_alerts.py` we edit_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/102995 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2023-06-08 06:57:05 +00:00
ts	d2d03f0f44	Make index_add_ error if input source shape is wrong (#100321 ) Fixes #92576 , checking the following as described in the documentation: "source.shape[dim] == len(index) and source.shape[i] == self.shape[i] for i != dim" Would be happy to iterate on this if there are any issues, and would be happy to implement the checking for the CUDA and MPS implementations of index_add_. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100321 Approved by: https://github.com/lezcano	2023-06-08 06:51:10 +00:00
fduwjj	52e310f7a8	Enable torch.nn.init._calculate_correct_fan in dynamo tracing (#103182 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103182 Approved by: https://github.com/yanboliang	2023-06-08 06:13:49 +00:00
Huy Do	676210a139	GHA setup-linux should always be pair with teardown-linux (#103216 ) This explains the existence of a running container `pytorch-linux-focal-py3.8-gcc7`. For example, https://github.com/pytorch/pytorch/actions/runs/5138349666/jobs/9344366066 on `i-097fac5d9b9bef249`. The immediate job running before on this runner was a doc build jobs: ``` { "name": "linux-docs / build-docs-functorch-false", "html_url": "https://github.com/pytorch/pytorch/actions/runs/5184821352/jobs/9344262329", "_event_time": "2023-06-06T05:21:23.569553Z", "runner_name": "i-097fac5d9b9bef249", "head_branch": "gh/titaiwangms/25/head", "head_sha": "5d48fd183875e4eea44a94df239eae356e852939", "workflow_name": "pull", "conclusion": "success", "line": null } ``` This might be related to OOM issue happening on these runners. Thoughts: * May be we should have a linter for this to make sure that GHA setup-OS always pairs with teardown-OS * Nova GHA `setup-linux` is updated by https://github.com/pytorch/test-infra/pull/4275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103216 Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/atalman	2023-06-08 05:59:12 +00:00
Driss Guessous	2868a5d0d1	Two small mem_eff bug fixes (#103201 ) # Summary Upstream two small bug fixes: * https://github.com/fairinternal/xformers/pull/679 * https://github.com/fairinternal/xformers/pull/681 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103201 Approved by: https://github.com/cpuhrsch	2023-06-08 05:34:56 +00:00
Andrew Or	9508e60c1e	[quant][pt2] Add prepare QAT test for resnet18 (#103020 ) Summary: Prepare QAT for resnet18 has matching numerics with FX. Adding this test requires us to refactor the way the test code is structured, however. Test Plan: python test/test_quantization.py TestQuantizePT2EModels.test_qat_resnet18 Differential Revision: D46456243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103020 Approved by: https://github.com/kimishpatel	2023-06-08 05:17:20 +00:00
Elias Ellison	18e4a466db	fix amp in inference in benchmarking suite (#103220 ) Even if you passed in --amp we would run inference in float32. `AlbertForMaskedLM` goes from 1.305 float32 to 1.724x amp, and then again to 1.910x with freezing. Benchmark numbers for amp are about to go way up lol. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103220 Approved by: https://github.com/desertfire	2023-06-08 05:16:22 +00:00
Wanchao Liang	8585784a34	[dtensor] fix allgather unpadding logic (#103219 ) This PR fixes allgather unpadding logic so that we only need to unpad the full tensor instead of first chunking it to small tensors and unpad individually, as we know how our padding algorithm works Pull Request resolved: https://github.com/pytorch/pytorch/pull/103219 Approved by: https://github.com/wz337, https://github.com/fduwjj	2023-06-08 03:31:24 +00:00
Iris	d5142c52d3	[FSDP]Remove dim_group from device_mesh init (#103218 ) 1) remove dim_group 2) don't init device_mesh if not using default_pg Pull Request resolved: https://github.com/pytorch/pytorch/pull/103218 Approved by: https://github.com/wanchaol, https://github.com/fduwjj	2023-06-08 03:29:19 +00:00
Andrii Grynenko	6acb8d3d1c	[data_loader] Extra signal handlers in DataLoader.cpp should be added on top rather than replacing defaults (#103164 ) Summary: DataLoader.cpp signal handlers are adding some special behavior (e.g. exit(0) on SIGTERM under certain conditions). To preserve this behavior we should install additional signal handlers on top of default ones, rather than completely replacing them. Test Plan: unit tests Reviewed By: drej82 Differential Revision: D46525348 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103164 Approved by: https://github.com/drej82	2023-06-08 02:55:58 +00:00
Richard Zou	194262ee49	Make HigherOrderOperator stop appearing like torch.ops.* in FX (#103108 ) Previously, defining a HigherOrderOperators (like cond) automatically generates a torch.ops.cond and causes them to trace into the FX graph as e.g. torch.ops.cond. This is not good, because: - Duplication. Since HigherOrderOperators are written in Python, they have an associated Python function that users should access them from. E.g. torch.cond (when we make it public). That is what should actually appear in the graph. - torch.ops.cond is a valid namespace for operator registration; having it be a function too confuses things. This PR: - Moves cond/map HigherOrderOperators to be under torch (necessary for the FX logic to not do weird things) - Sets the `__module__` of a HigherOrderOperator correct. This is what FX uses when tracing the operator. Test Plan: - updated tests Future: - I'll delete the ability to call cond as torch.ops.cond in a couple of days, after this change circulates internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103108 Approved by: https://github.com/ydwu4	2023-06-08 01:55:27 +00:00
Hansong Zhang	47cfcf566a	Add selector.is_et_kernel_key_selected (#103184 ) Summary: This API is used by the gen_executorch.py to check whether a kernel with specified kernel key is used or not. Test Plan: ``` buck test xplat/caffe2/tools:test_torchgen_executorch buck run fbcode//executorch/codegen/tools:test_gen_oplist_real_model ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/103184 Approved by: https://github.com/larryliu0820	2023-06-08 01:10:20 +00:00
xuanqi	c37f02f61c	[PyTorch][HAM]: Deprecate functionalize (#103053 ) TLDR is to remove this flag and utilize functionalization functionality by lower level lib directly. Differential Revision: [D46469134](https://our.internmc.facebook.com/intern/diff/D46469134/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103053 Approved by: https://github.com/tugsbayasgalan	2023-06-08 00:46:41 +00:00
Shunting Zhang	0900782f0c	[inductor][easy] raise register spill threshold (#103190 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103190 Approved by: https://github.com/spectrometerHBH, https://github.com/eellison	2023-06-08 00:35:46 +00:00
Jason Ansel	8c5d97d353	[inductor] Fix correctness issues with pre_grad and context managers (#103051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103051 Approved by: https://github.com/williamwen42	2023-06-08 00:20:34 +00:00
shaoyf42	17737f9d0e	[DTensor] Allow DTensor support cuda-like device (#102468 ) Allow DTensor support cuda-like device, fix https://github.com/pytorch/pytorch/issues/102442 Currently, DTensor supports cuda and cpu. There are other efforts to make DTensor support third-party devices, for example https://github.com/pytorch/pytorch/pull/101914 and https://github.com/pytorch/pytorch/issues/101911. However, this support only extends a portion of third-party devices and is no good support for third-party cuda-like devices. Therefore, we would like to extend DTensor to support cuda-like devices, after all, cuda is so popular! 1. Similar to what is done here, we need to initialize the communication backend for the device set by DeviceMesh. So `_default_backend_for_device` is added to `Backend`. It is worth noting that when we register a new backend for a device other than cpu and cuda, we also need to add a new default backend for this device. 2. Adding `_device_handle` to `DeviceMesh` for cuda-like devices, similar to what is set in FSDP. When `_device_handle` is not None, the device has similar behavior to `cuda`. In this way, functions like `torch.cuda.device_count()` need to be modified to `device_mesh._device_handle.device_count()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102468 Approved by: https://github.com/wanchaol	2023-06-07 23:13:53 +00:00
Mark Saroufim	790f5732f6	Fix Graph Break on builtin comparison on NNModule (#103176 ) Fixes https://github.com/pytorch/pytorch/issues/102338 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103176 Approved by: https://github.com/anijain2305	2023-06-07 22:51:43 +00:00
Mark Saroufim	95fced4483	Pretty dataclass dynamo explain (#102869 ) Also thinking out loud: maybe we only print graph break reasons? And for the rest we have a verbose print which prints everything? TODO: some tests are failing based on what they expect a guard string to look like, easy to fix i'll do it early next week # After ``` (sourcetorch) ubuntu@ip-172-31-1-136:~/test$ python pretty.py BREAK Graph Count: 2 Graph Break Count: 1 Op Count: 2 Break Reasons: Break Reason 1: Reason: call_function BuiltinVariable(print) [ConstantVariable(str)] {} User Stack: <FrameSummary file /home/ubuntu/test/pretty.py, line 6 in fn> Ops per Graph: Ops 1: <built-in function add> Ops 2: <built-in function add> Out Guards: Guard 1: Name: '' Source: global Create Function: GRAD_MODE Guard Types: ['GRAD_MODE'] Code List: ['___is_grad_enabled()'] Object Weakref: None Guarded Class Weakref: None Guard 2: Name: '' Source: global Create Function: DEFAULT_DEVICE Guard Types: ['DEFAULT_DEVICE'] Code List: ['utils_device.CURRENT_DEVICE == None'] Object Weakref: None Guarded Class Weakref: None Guard 3: Name: "G['print']" Source: global Create Function: BUILTIN_MATCH Guard Types: None Code List: None Object Weakref: None Guarded Class Weakref: None Guard 4: Name: '' Source: global Create Function: DETERMINISTIC_ALGORITHMS Guard Types: ['DETERMINISTIC_ALGORITHMS'] Code List: ['not ___are_deterministic_algorithms_enabled()'] Object Weakref: None Guarded Class Weakref: None Guard 5: Name: "L['x']" Source: local Create Function: TENSOR_MATCH Guard Types: None Code List: None Object Weakref: None Guarded Class Weakref: None Guard 6: Name: '' Source: global Create Function: GRAD_MODE Guard Types: ['GRAD_MODE'] Code List: ['___is_grad_enabled()'] Object Weakref: None Guarded Class Weakref: None Guard 7: Name: '' Source: global Create Function: DEFAULT_DEVICE Guard Types: ['DEFAULT_DEVICE'] Code List: ['utils_device.CURRENT_DEVICE == None'] Object Weakref: None Guarded Class Weakref: None Guard 8: Name: '' Source: global Create Function: DETERMINISTIC_ALGORITHMS Guard Types: ['DETERMINISTIC_ALGORITHMS'] Code List: ['not ___are_deterministic_algorithms_enabled()'] Object Weakref: None Guarded Class Weakref: None Guard 9: Name: "L['x']" Source: local Create Function: TENSOR_MATCH Guard Types: None Code List: None Object Weakref: None Guarded Class Weakref: None Compile Times: TorchDynamo compilation metrics: Function Runtimes (s) ------------------------------ -------------- _compile 0.0164, 0.0035 OutputGraph.call_user_compiler 0.0000, 0.0000 ``` ## Before ``` ('Dynamo produced 2 graphs with 1 graph break and 2 ops', [{Guard(name='print', source=<GuardSource.GLOBAL: 1>, create_fn=<function GuardBuilder.BUILTIN_MATCH at 0x7f92ea5009d0>, is_volatile=False, guard_types=None, code_list=None, obj_weakref=None, guarded_class_weakref=None), Guard(name='x', source=<GuardSource.LOCAL: 0>, create_fn=<function GuardBuilder.TENSOR_MATCH at 0x7f92ea501000>, is_volatile=False, guard_types=['TENSOR_MATCH'], code_list=None, obj_weakref=<weakref at 0x7f9224d28f40; dead>, guarded_class_weakref=<weakref at 0x7f92d81734c0; to 'torch._C._TensorMeta' at 0x540b610 (Tensor)>)}, {Guard(name='x', source=<GuardSource.LOCAL: 0>, create_fn=<function GuardBuilder.TENSOR_MATCH at 0x7f92ea501000>, is_volatile=False, guard_types=['TENSOR_MATCH'], code_list=None, obj_weakref=<weakref at 0x7f9224d5e700; dead>, guarded_class_weakref=<weakref at 0x7f92d81734c0; to 'torch._C._TensorMeta' at 0x540b610 (Tensor)>)}], [GraphModule(), GraphModule()], [[<built-in function add>], [<built-in function add>]], [GraphCompileReason(reason='call_function BuiltinVariable(print) [ConstantVariable(str)] {}', user_stack=[<FrameSummary file <ipython-input-1-9e2ddb639697>, line 6 in fn>]), GraphCompileReason(reason='return_value', user_stack=[<FrameSummary file <ipython-input-1-9e2ddb639697>, line 8 in <graph break in fn>>])], 'Dynamo produced 2 graphs with 1 graph break and 2 ops\n Break reasons: \n\n1. call_function BuiltinVariable(print) [ConstantVariable(str)] {}\n File "<ipython-input-1-9e2ddb639697>", line 6, in fn\n print("BREAK")\n \n2. return_value\n File "<ipython-input-1-9e2ddb639697>", line 8, in <graph break in fn>\n return x\n \nTorchDynamo compilation metrics:\nFunction Runtimes (s)\n------------------------------ --------------\n_compile 0.0418, 0.0084\nOutputGraph.call_user_compiler 0.0001, 0.0001') ``` ## Program ```python import torch import torch._dynamo def fn(x): x = x + 1 print("BREAK") x = x + 1 return x out = torch._dynamo.explain(fn, torch.randn(10)) print(out) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102869 Approved by: https://github.com/voznesenskym	2023-06-07 22:38:57 +00:00
Elias Ellison	2baadc2ade	Small operatorbench improvements (#103110 ) - Don't copy inputs in cudagraphs wrapping, since the copies will distorts timing and triton do_bench will clear cache anyway - Don't skip op if there is a fallback, since we have both fallbacks and lowerings for some ops - Add option for channels last Pull Request resolved: https://github.com/pytorch/pytorch/pull/103110 Approved by: https://github.com/desertfire	2023-06-07 22:04:59 +00:00
Jack Taylor	e936277cc2	[ROCm] force HIP context initialization for inductor UTs (#103149 ) Workaround for https://github.com/pytorch/pytorch/issues/102886 related to: https://github.com/pytorch/pytorch/issues/102476 https://github.com/pytorch/pytorch/issues/102475 https://github.com/pytorch/pytorch/issues/102474 https://github.com/pytorch/pytorch/issues/102473 https://github.com/pytorch/pytorch/issues/102473 https://github.com/pytorch/pytorch/issues/102472 Since `9aaa12e328` the first inductor (CPU) UT fails until the GPU context is correct initialised and the subsequent UTs pass. CUDA observes the same issue and a workaround was pushed to force initialisation of cuda context by declaring an empty tensor https://github.com/pytorch/pytorch/issues/92627, we have adopted the same approach but have opted for `torch.zeros` which correctly activates the HIP context after the kernel launch. Reproducer: ``` import torch from torch._subclasses.fake_tensor import FakeTensorMode import argparse if __name__ == '__main__': parser = argparse.ArgumentParser(description='Swap between torch.empty and torch.randn operations.') parser.add_argument('--empty', action='store_true', help='Use torch.empty operation') parser.add_argument('--rand', action='store_true', help='Use torch.randn operation') args = parser.parse_args() torch.cuda.set_device(0) if args.empty: torch.empty(1, device="cuda") elif args.rand: torch.rand(1, device="cuda") print(f": hasPrimaryContext: {torch._C._cuda_hasPrimaryContext(0)") with FakeTensorMode(): p = torch.randn(4, 2, requires_grad=True, device='cuda') x = torch.randn(8, 4, device='cuda') y = torch.mm(x, p).square().sum() y.backward() ``` ROCm python repro.py --empty 0: hasPrimaryContext: False ROCm python repro.py --rand 0: hasPrimaryContext: True CUDA python repro.py --empty 0: hasPrimaryContext: True CUDA python repro.py --rand 0: hasPrimaryContext: True Pull Request resolved: https://github.com/pytorch/pytorch/pull/103149 Approved by: https://github.com/eellison	2023-06-07 21:42:33 +00:00
atalman	376cf7965f	Use gcc9 in linux-bionic-cuda12_1-py3_10-gcc9-build workflows (#103075 ) Use gcc9 in linux-bionic-cuda12_1-py3_10-gcc9-build workflows After PR, which fixed gcc9 transition : https://github.com/pytorch/multipy/pull/321 <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at a076506</samp> This pull request updates the GCC version for Python 3.10 and CUDA 11.8/12.1 test images and removes the unused CUDA 12.1 image configuration and reference from the docker build scripts and workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103075 Approved by: https://github.com/malfet	2023-06-07 21:34:29 +00:00
fduwjj	c454534d25	Enable torch.get_autocast_gpu_dtype in Dynamo tracing (#103166 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103166 Approved by: https://github.com/williamwen42, https://github.com/yanboliang	2023-06-07 21:31:45 +00:00
fduwjj	b5021ba981	Enable torch.is_complex in Dynamo tracing (#103154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103154 Approved by: https://github.com/yanboliang	2023-06-07 20:56:46 +00:00
Andrew Or	2e8d2a2e69	[quant][pt2] Add test for inplace add (#102867 ) Summary: This was broken after the recent partitioner refactors. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_qat_inplace_add_relu Differential Revision: D46402378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102867 Approved by: https://github.com/jerryzh168	2023-06-07 19:43:28 +00:00
Weiming Zhao	28f43c767c	Fix outdated log settings in doc (#102285 ) (#102286 ) Replace torch._dynamo.config.loglevel=<level> with torch._logging.set_logs(dynamo=<level>) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/102286 Approved by: https://github.com/msaroufim, https://github.com/Neilblaze	2023-06-07 18:07:20 +00:00
Kimish Patel	471407cf78	[PT2][Quant] Use composble quantizer for embedding + static conv + dynamic (#103116 ) Summary: In this diff we test a module that does a) emedding lookup b) runs 1D (converted to 2D) conv and c) runs linear on the output of 1d conv. a is quantized using embedding quantizer. c is quantized using dynamic quantization. b is quantized using static quantization. We compose quantizer from [a, c, b]. Tested it against similar fx config. Test Plan: test_embedding_conv_linear_quantization Reviewed By: jerryzh168 Differential Revision: D46267688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103116 Approved by: https://github.com/jerryzh168	2023-06-07 17:34:59 +00:00
Li-Huai (Allan) Lin	3c0072e7c0	[MPS] Prerequisite for MPS C++ extension (#102483 ) in order to add mps kernels to torchvision codebase, we need to expose mps headers and allow objc++ files used in extensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102483 Approved by: https://github.com/malfet	2023-06-07 17:28:31 +00:00
Bin Bao	0c9117a61f	[dashboard] Bring back inference perf measurement as nightly (#103151 ) Summary: GCP workload has dropped since adding control options for manual dispatch. Let's set the inference run default to nightly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103151 Approved by: https://github.com/huydhn	2023-06-07 17:19:10 +00:00
Yanbo Liang	686d7e4c48	[Inductor] Fix x.view(dtype) decomp and make inductor support it (#102920 ) Fixes #99804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102920 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-06-07 17:10:54 +00:00
Nikita Shulga	b8caa2b08f	Fix regressions caused by https://github.com/pytorch/pytorch/pull/103128 By adding `torch.SymBool` back	2023-06-07 09:39:02 -07:00
Angela Yi	e930c0fc35	[export] Initial deserialization v2 (#102716 ) v2 of https://github.com/pytorch/pytorch/pull/102126. mentally stacked on top of https://github.com/pytorch/pytorch/pull/102707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102716 Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17	2023-06-07 16:02:35 +00:00
haozhe.zhu	adcefcb378	insert to dtype for fused mem copy scheduler node (#101042 ) Fix https://github.com/pytorch/pytorch/issues/100830. For the inplace node, there will be a `copy_` generated and the `copy_` will be `realized` as a `scheduler buffer` since it is a mutation. This `scheduler buffer` is a memory copy but after fusing with the previous buffer, it will not be a memory copy only buffers. This PR solves the issue by removing `load_bf16_as_fp32` and `store_bf16_from_fp32`. Instead, enable fp32/bf16 vec conversion in `to_dtype`. Then we always store bf16. ```python import torch import torch.nn as nn torch.manual_seed(420) from torch._inductor import config x = torch.randn(1, 18, dtype=torch.bfloat16) class ExampleModel(nn.Module): def __init__(self): super(ExampleModel, self).__init__() self.relu = nn.ReLU(inplace=True) # nn.ReLU(inplace=False) def forward(self, input1): out = self.relu(input1) # input1.copy_(out) return out func = ExampleModel() with torch.no_grad(): func.train(False) res1 = func(x) # without jit print(res1) jit_func = torch.compile(func) res2 = jit_func(x) print(res2) ``` Generated code without this PR: (`tm3` store is wrong, `tmp3` is `float` while `out_ptr1` is `bf16`) ``` auto tmp0 = load_bf16_as_float(out_ptr1 + static_cast<long>(i0)); auto tmp1 = (tmp0); auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0)); auto tmp3 = (tmp2); store_float_as_bf16(out_ptr0 + static_cast<long>(i0), tmp3); tmp3.store(out_ptr1 + static_cast<long>(i0), 16); ``` Generated code with this PR: ``` auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(out_ptr1 + static_cast<long>(i0), 16); auto tmp1 = cvt_bf16_to_fp32(tmp0); auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0)); auto tmp3 = cvt_fp32_to_bf16(tmp2); tmp3.store(out_ptr0 + static_cast<long>(i0), 16); tmp3.store(out_ptr1 + static_cast<long>(i0), 16); ``` This PR also fixed the data type propagation for `masked_subblock`. Before the masked_subblock's dtype is propagated by its input which is wrong. ``` opcode name target args kwargs ----------- --------- --------- -------------------------- -------- call_module masked_subblock1 masked_subblock1 (and__2, -inf) ``` Now we propagated it by subblock with the same name: ``` # graph for body.subblocks['masked_subblock1'] opcode name target args kwargs ----------- --------- --------- -------------------------- -------- placeholder ops ops () {} call_module get_index get_index ('index2',) {} call_method load load (ops, 'arg0_1', get_index) {} call_method to_dtype to_dtype (ops, load, torch.float32) {} output output output (to_dtype,) {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101042 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-07 15:55:25 +00:00
Edward Z. Yang	605a85249c	Fix graph break on boolean mask better (#103052 ) Previously I accidentally thought setitem takes each argument as a list. But if you write x[:, b] that actually is passed in as a tuple. Try harder. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103052 Approved by: https://github.com/desertfire	2023-06-07 14:40:56 +00:00
Edward Z. Yang	2dafa70d61	Add a little more error checking to minifier (#103057 ) Prompted by https://github.com/pytorch/pytorch/issues/101408 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103057 Approved by: https://github.com/bdhirsh	2023-06-07 14:40:12 +00:00
shibo19	e4a42bcf56	add foreach support for custom device (#102047 ) Fixes #ISSUE_NUMBER for custom device, we want to support foreach, so I add a func that we could set other device type, and the default value is cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102047 Approved by: https://github.com/janeyx99	2023-06-07 13:59:20 +00:00
Ke Wen	07104ca99c	[c10d] Make it default that PG do not perform barrier after init (#103033 ) Both internal and OSS users trying https://github.com/pytorch/pytorch/pull/99937 report that their workloads perform normally even with the barrier removed and see a scalability win. Thus in this PR, we decide to make it default that PG do not perform a barrier after init. In the discussion of #99937, people point out that such barrier might be needed for c10d + RPC cases. IMO, this need originates from RPC's programming model and should be RPC or RPC user's responsibility to deal with. That is, with other functions/libraries, it can happen too. So the need for c10d to do so big a favor is not justified IMO. Also good to remove it before users become reliant on this barrier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103033 Approved by: https://github.com/XilunWu	2023-06-07 06:11:14 +00:00
mingfeima	3e988316b5	update argument checks from padding layers (#102253 ) replacement of https://github.com/pytorch/pytorch/pull/99608, breaking the old pr into smaller ones. this one handles the common error message from both CPU and CUDA device, to simplify the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102253 Approved by: https://github.com/cpuhrsch, https://github.com/albanD	2023-06-07 05:01:59 +00:00
PyTorch UpdateBot	5acf7e266b	[vision hash update] update the pinned vision hash (#103120 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103120 Approved by: https://github.com/pytorchbot	2023-06-07 04:40:34 +00:00
Iris	a02a58d862	[FSDP][1/N]Add device_mesh to FSDPstate (#102317 ) (#102551 ) This PR creates a device_mesh and share it across all FSDP state. The device_mesh will later be used to test out dtensor state_dict (1d device_mesh). Approved by: https://github.com/awgu Add device mesh to fsdp state skip dist.get_world_size(pg) != dist.get_world_size() address test_fake_pg.py test failure fix test_fake_py.py failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/102551 Approved by: https://github.com/fegin	2023-06-07 04:14:00 +00:00
Michael Lazos	0769a50a5f	Disable dynamo on some opt methods and differentiable optimizer tests (#103066 ) - Disables dynamo on the differentiable optimizer tests - Disables dynamo on some test methods which expose a very rare dynamo edge case - Disables dynamo on export/save optimizer state methods because it shouldn't trace those anyway. I have a draft PR to fix the two tests marked skip due to unsupported mutation of step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103066 Approved by: https://github.com/janeyx99, https://github.com/malfet	2023-06-07 03:50:42 +00:00
Edward Z. Yang	f760899864	Teach Triton codegen to generate sqrt (#103084 ) Fixes https://github.com/pytorch/pytorch/issues/100972 I know ngimel doesn't like this sort of fix because we shouldn't actually be computed sqrt at runtime, I'm open to some sort of perf warning saying that we're spending FLOPs weirdly. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103084 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/ngimel	2023-06-07 03:03:56 +00:00
Junjie Wang (PyTorch)	3f6f508646	[PT-D] Update torch.distributed code owners (#103114 ) Summary: As title. Test Plan: CI Differential Revision: D46498287 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103114 Approved by: https://github.com/fegin, https://github.com/wanchaol	2023-06-07 02:28:34 +00:00
Ivan Zaitsev	821493715c	Back out "Remove `check` from `_prims_common`, replace with `torch._check*` (#102219 )", Back out "Forwatd fix for D46427687" (#103128 ) Test Plan: revertitparrot Reviewed By: malfet Differential Revision: D46506433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103128 Approved by: https://github.com/malfet	2023-06-07 01:41:41 +00:00
Animesh Jain	428bff842d	[benchmarks] Torchbench llama is not suitable for training (#103094 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103094 Approved by: https://github.com/eellison, https://github.com/desertfire	2023-06-07 01:33:07 +00:00
Driss Guessous	2800a04a17	Add device range helper and remove sm86 specific check for memory efficient attention (#102985 ) # Summary Since we have upstreamed the latest changes of memory efficient attetnion we can remove the sm86/sm89 specific check. All head_sizes (assuming correctly alignment) should work for sm86 and sm89 size and don't have a max capability. If head_size > 96 there will be a big drop in performance but should not error and still maintain memory savings by not materializing attention weights. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102985 Approved by: https://github.com/cpuhrsch	2023-06-07 00:28:40 +00:00
zhxchen17	6596cfa4d7	[export] Remove example custom_object_type to type_reflection_method. (#103015 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/103015 Approved by: https://github.com/tugsbayasgalan	2023-06-07 00:03:57 +00:00
AllenTiTaiWang	27f4dc6c0a	[ONNX] Add FX exporter MaxPool tests (#102773 ) Need https://github.com/microsoft/onnxscript/pull/757 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102773 Approved by: https://github.com/BowenBao	2023-06-06 23:31:49 +00:00
Richard Zou	5b700fc914	Disable fallback for custom kernels (#101131 ) Previous failed attempt was here: https://github.com/pytorch/pytorch/pull/97715. Basically we tried to disable fallback for all ops (aten + custom) but hit many CI failures due to missing fake tensor coverage. Let's just disable it for custom kernels for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101131 Approved by: https://github.com/zou3519	2023-06-06 23:25:29 +00:00
Kimish Patel	8e0837cf84	[PT2][Quant] Move embedding quantization to osss (#103088 ) Summary: This is in preperation to enable embeddign quantization on models with embeddings. Test Plan: test_embedding_quantizer Reviewed By: jerryzh168 Differential Revision: D46267689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103088 Approved by: https://github.com/andrewor14	2023-06-06 23:07:57 +00:00
Shunting Zhang	bf312f2d9d	[inductor] add a few tests to verify view_to_reshape pass is safe (#103034 ) This PR follows up on issue https://github.com/pytorch/pytorch/issues/102229 . I added 2 unit tests and verified that autoaugorad/functionalization already handles view properly. The view_to_reshape pass does not cause an issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103034 Approved by: https://github.com/ezyang	2023-06-06 22:32:51 +00:00
Will Feng	61736679cd	[Dynamo] No graph break for super(MyConv{1/2/3}d, self).forward and super(MyConvTranspose, self).forward (#102509 ) before the PR, running super(MyConv1d, self).forward or super(MyConvTranspose, self).foward, dynamo will create a graph break when executing NNModuleVariable.call_method and raise unimplemented error for name=_conv_forward / _output_padding. see issue for full detail: https://github.com/pytorch/pytorch/issues/101155 after the PR, for torch.nn.conv module with function name _conv_forward / _output_padding, we inline the function with tx.inline_user_function_return code refactor: added NNModuleVariable._inline_user_function_return_helper to consolidaste tx.inline_user_function_return into 1 place to keep code dry. after factor, there are 2 uncolidated inline_user_function_return with different ```fn``` and ```source``` logic. the code is still dry. For local testing, they are covered by test_modulelist, test_moduledict, test_conv_call_super_forward_directly and test_conv_transpose_call_super_forward_directly in test_modules.py Differential Revision: [D46494460](https://our.internmc.facebook.com/intern/diff/D46494460) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102509 Approved by: https://github.com/yanboliang	2023-06-06 22:01:17 +00:00
David Berard	038955f489	torch.compile docs: "Profiling to understand torch.compile performance (#102862 ) Docs on how to use torch.profiler.profile to understand torch.compile performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102862 Approved by: https://github.com/eellison	2023-06-06 22:00:36 +00:00
Xuan Xie	6261055471	dst_bin_of_end_center is defined twice (#102755 ) (line 995 and line 1011) both definations are the same. Delete one of them. Fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/102755 Approved by: https://github.com/janeyx99	2023-06-06 21:17:07 +00:00
Aaron Enye Shi	0279d0b611	[Profiler] Update Kineto Submodule (#103031 ) Summary: Update Kineto Submodule to pick-up fixes to Tensorboard and CUPTI log level. Test Plan: CI Differential Revision: D46455120 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/103031 Approved by: https://github.com/Skylion007	2023-06-06 20:30:16 +00:00
Rohan Varma	dfa64fddeb	[FSDP] Fix for optim state dict (#102901 ) Fix for HSDP + use_orig_params where we need to pass in the PG that might not be the default. Differential Revision: [D46417327](https://our.internmc.facebook.com/intern/diff/D46417327/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102901 Approved by: https://github.com/wz337	2023-06-06 20:21:23 +00:00
Nikita Shulga	2405c59c75	[BE] Use `value_or` (#103065 ) `s/!k.has_value \|\| *k == foo/k.value_or(foo) == foo/` Which yields the same code see https://godbolt.org/z/6b35zYcYc <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 003c703</samp> Simplify the logic for registering and looking up backend fallbacks in `library.cpp` by using a constant and an optional helper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103065 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-06-06 20:07:29 +00:00
Catherine Lee	08c4a442fd	Dont run test files that are already run in test_optim (#103017 ) they get run twice on accident Pull Request resolved: https://github.com/pytorch/pytorch/pull/103017 Approved by: https://github.com/janeyx99	2023-06-06 17:31:21 +00:00
Catherine Lee	90fd90dd94	Fix rocm sharding (#102871 ) Rocm queries for the number of processes it should use per machine, which might cause it be different across shards, which leads to inconsistencies when distributing tests among shards. My solution is to separate the vars used for shard calculations and the actual number of procs that can be used and to ensure that the var used for shard calculations is consistent across all shards for a test config + job. I believe that the only consequence is that rocm sharding might become unbalanced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102871 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-06-06 17:29:53 +00:00
Edward Z. Yang	a867e6db85	Add newline before minified repro path (#103083 ) Minor QOL change. This log message is pushed into my history by the backtrace, which is a pain because if I tab up in tmux I can no longer paste it without line breaks. This makes it more convenient to use tmux copy mode to get only the file (as I get the entire line this way.) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103083 Approved by: https://github.com/albanD	2023-06-06 17:09:44 +00:00
Bin Bao	fbbde8df69	[inductor] fix a numel expr codegen issue (#103005 ) Summary: Correctly use pexpr or cexpr for generating symbolic expression during wrapper codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103005 Approved by: https://github.com/jansel	2023-06-06 14:08:05 +00:00
Bin Bao	49577c7e47	[inductor] Turn off autotune_cublasLt for cpp_wrapper (#103004 ) Summary: bias_addmm is not backed up by a cpp funciton, so turn autotune_cublasLt for cpp_wrapper + max_autotune. We can add a cpp function implementation if there is a performance need. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103004 Approved by: https://github.com/jansel	2023-06-06 14:08:05 +00:00
Bin Bao	44fdfd3222	[inductor] Support select_algorithm with cpp_wrapper (#103003 ) Summary: This is one step towards getting cpp_wrapper work with max_autotune. Switch to use unique kernel name to cache generated cubin file. This is a copy of https://github.com/pytorch/pytorch/pull/102738 to solve a ghstack issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103003 Approved by: https://github.com/jansel	2023-06-06 14:08:05 +00:00
Kimish Patel	8824101fb6	[PT2][Quant] Introduce composable quantizer (#102846 ) Summary: Using composable quantizer, we can now composable two or more quantizers. In the test here we compose quantizer configured with dynamic linear quantization, with quantizer configured for static quantization. Note that composable quantizer has strict order in which annotations are applied Test Plan: test_composable_quantizer* Reviewed By: jerryzh168 Differential Revision: D46267690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102846 Approved by: https://github.com/andrewor14	2023-06-06 14:01:55 +00:00
Edward Z. Yang	eeb3c62117	Add Wav2Vec2 HuggingFace support (#103009 ) This is not actually enabled in the benchmark suite as you need https://github.com/pytorch/pytorch/pull/103001 and also training is broken per https://github.com/pytorch/pytorch/issues/101160 but might as well review this part first. Contains https://github.com/pytorch/pytorch/pull/102979 but I will probably rebase past that once it lands. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103009 Approved by: https://github.com/Skylion007	2023-06-06 13:25:06 +00:00
Edward Z. Yang	ba962fefea	Add parametrization version of weight_norm (#103001 ) This done in the ordinary way, but also: * Deprecation warning for the old API, and a migration guide * Backwards compatibility for state_dict loading the old weight_norm * Test for pickling and deepcopy, which was the motivating reason weight_norm is still used by HuggingFace Wav2Vec2. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103001 Approved by: https://github.com/albanD	2023-06-06 13:14:43 +00:00
atalman	3a38acf18f	Move CUDA 11.8 CI jobs to CUDA 12.1, CUDA 11.7 jobs to CUDA 11.8 (#102178 ) Move CUDA 11.8 CI jobs to CUDA 12.1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102178 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-06-06 11:53:35 +00:00
Nikita Karetnikov	1fcc67fd8c	[pt2] add `SymInt` support for `linalg.tensorsolve` (#102466 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102466 Approved by: https://github.com/Skylion007, https://github.com/lezcano	2023-06-06 08:06:55 +00:00
Nikita Karetnikov	ec0aa965da	[pt2] add meta for `_linalg_solve_ex` (#102454 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102454 Approved by: https://github.com/lezcano	2023-06-06 08:06:55 +00:00
Nikita Karetnikov	4bda4a7e4d	[pt2] add meta for `lu_unpack` (#102937 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102937 Approved by: https://github.com/lezcano	2023-06-06 08:06:53 +00:00
Xiao Wang	39f3514fa3	Add an env PYTORCH_TEST_SKIP_CUDAGRAPH to skip all cuda graph-related unit tests (#103032 ) Skip all cuda graph-related unit tests by setting env var `PYTORCH_TEST_SKIP_CUDAGRAPH=1` This PR refactors the `TEST_CUDA` python variable in test_cuda.py into common_utils.py. This PR also creates a new python variable `TEST_CUDA_GRAPH` in common_utils.py, which has an env var switch to turn off all cuda graph-related tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103032 Approved by: https://github.com/malfet	2023-06-06 07:51:57 +00:00
Richard Barnes	b592e67516	Use C++17 [[fallthrough]]; (#102849 ) Test Plan: Sandcastle Reviewed By: meyering Differential Revision: D46385240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102849 Approved by: https://github.com/Skylion007	2023-06-06 07:06:26 +00:00
cyy	30e2764221	remove c10::guts::{max,min} (#102952 ) Because we have enabled C++17, and std::{max,min} are required to be constexpr since C++14 according to [cppreference](https://en.cppreference.com/w/cpp/algorithm/max) we can safely remove them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102952 Approved by: https://github.com/Skylion007	2023-06-06 05:40:30 +00:00
Angela Yi	3a385656b5	[export] Initial serialization v2 (#102707 ) v2 of https://github.com/pytorch/pytorch/pull/102125 because of git issues corresponding deserialization diff: https://github.com/pytorch/pytorch/pull/102716 Implementing serialization of the exported program to a python dataclass, and then from that dataclass to json. This is split into a couple of sections: - `serialize(ep: ep.ExportedProgram, opset_version: Dict[str, int]) -> Tuple[bytes, bytes]` -- takes an exported program object, a dictionary mapping opset namespaces to versions, and returns the serialized exported program in bytes, and separately the state dict serialized in bytes - `GraphModuleSerializer` class that serializes torch.fx.GraphModule to the schema.GraphModule dataclass - `ExportedProgramSerializer` class that serializes torch._export.exported_program.ExportedProgram to the schema.ExportedProgram dataclass Serialization TODOs: - [x] pytree spec: https://github.com/pytorch/pytorch/pull/102577 - [ ] higher order ops - [ ] node metadata (specifically nn_module_stack/source_fn) - [ ] constraints - [ ] graph module metadata The tests are not super comprehensive, but that's because I think it'll be better tested + easier to test once deserialization is implemented. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102707 Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17	2023-06-06 05:12:49 +00:00
leslie-fang-intel	d7035ffde3	Enable uint8/int8 mkldnn/dense tensor conversion (#102965 ) Summary Support mkldnn tensor and dense tensor conversion with uint8/int8 data type. Test Plan ``` python -m pytest -s -v test_mkldnn.py -k test_conversion_byte_char ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102965 Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper	2023-06-06 05:05:29 +00:00
cyy	7a42a03547	fix use-after-free in test (#102734 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102734 Approved by: https://github.com/Skylion007	2023-06-06 04:41:20 +00:00
Jerry Zhang	5fbbae4283	[quant][pt2e][be] Cleanup prepare function in _pt2e (#103022 ) Summary: att Test Plan: ``` buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e' ``` Differential Revision: D46346087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103022 Approved by: https://github.com/andrewor14	2023-06-06 04:33:05 +00:00
Edward Z. Yang	872fdb329b	This extra message would have helped with Wav2Vec2 debugging. (#103002 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103002 Approved by: https://github.com/janeyx99, https://github.com/anijain2305, https://github.com/voznesenskym, https://github.com/malfet	2023-06-06 04:28:16 +00:00
PyTorch UpdateBot	6408b85d88	[vision hash update] update the pinned vision hash (#103038 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103038 Approved by: https://github.com/pytorchbot	2023-06-06 02:43:48 +00:00
Matthew Hoffman	dda59162f1	Native `rearrange` in `functorch` (#101957 ) Fixes #92675 Here we implement a native version of [`einops.rearrange`](https://einops.rocks/api/rearrange/) using first class dims to perform the operations. The string parsing + validation, documentation, and relevant tests are adapted from `einops`. The API is exactly the same as the `einops` API. The main idea is to take the string and convert it to a left and right `ParsedExpression`, and then find a mapping from the axes to first class dims. Once the mapping exists we convert the left expression `composition` list into a `Tensor.__getitem__` index and the right expression `composition` into the `Tensor.order` arguments, and then use this to dynamically create a callable that performs the `rearrange` operation as specified by the pattern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101957 Approved by: https://github.com/zdevito	2023-06-06 02:10:42 +00:00
Chao Yang	367b0ad062	enforce `dtype` (reland) (#102996 ) Summary: The original diff didn't break the test. Test Plan: N/A Differential Revision: D46448488 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102996 Approved by: https://github.com/malfet, https://github.com/wanchaol	2023-06-06 00:35:04 +00:00
Eli Uriegas	e26f5b2ac7	docs: Render bullet points correctly (#103021 ) This wasn't rendering correctly on the website, this should make it so that the bullet points actually show correctly now. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/103021 Approved by: https://github.com/albanD	2023-06-06 00:22:49 +00:00
lkct	9567aaebe5	Package `torch/*.pyi` type hints (#103016 ) Including `torch._VF` and `torch.return_types` These are generated by: `4003e96ca1/tools/pyi/gen_pyi.py (L1139-L1155)` Ref #99541 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103016 Approved by: https://github.com/Skylion007	2023-06-05 23:08:10 +00:00
Masaki Kozuki	258525093e	Exclude clang-format diff from git-blame (#103000 ) Add https://github.com/pytorch/pytorch/pull/102887 to `.git-blame-ignore-revs` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103000 Approved by: https://github.com/Skylion007	2023-06-05 22:59:01 +00:00
Zain Rizvi	117f9bb847	[BE] Explain how to get consistent linter behavior locally (#102990 ) Sometimes you'll see linter failures on CI that don't repro locally, caused by the local linter not having installed the latest config. These instructions explain how to make both the CI and local linter consistent again Pull Request resolved: https://github.com/pytorch/pytorch/pull/102990 Approved by: https://github.com/huydhn	2023-06-05 22:51:17 +00:00
Edward Z. Yang	12cd1dbba0	Handle recursive tuple in clone_inputs (#102979 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102979 Approved by: https://github.com/wconstab	2023-06-05 22:11:48 +00:00
Elias Ellison	4479e2fa19	fix profiling ref in side panel (#103014 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103014 Approved by: https://github.com/msaroufim	2023-06-05 21:19:51 +00:00
Angela Yi	6cb1455857	[export] Change equality constraints to list of tuples (#102998 ) Changed equality constraints to a list of tuples as the dictionary wasn't providing much value -- also makes creating constraints + serialization easier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102998 Approved by: https://github.com/avikchaudhuri	2023-06-05 21:03:02 +00:00
Jeff Daily	3cb0ba2263	[ROCm] MIOpen supports bias with bfloat16 (#95080 ) Removes the condition in ConvParams::use_miopen restricting bfloat16 with bias defined. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95080 Approved by: https://github.com/ngimel	2023-06-05 20:58:28 +00:00
Andrey Talman	1943bd0d7e	[Release] Add FAQ explaining release terms (#102618 ) [Release] Add FAQ explaining release terms This is AI from release 2.0.0 retrospective Fixes: https://github.com/pytorch/pytorch/issues/98009 Co-authored-by: Nikita Shulga <nshulga@meta.com> Co-authored-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102618 Approved by: https://github.com/malfet	2023-06-05 20:26:57 +00:00
Li-Huai (Allan) Lin	1c2dfdf30c	Add renorm forward-ad (#100798 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100798 Approved by: https://github.com/soulitzer	2023-06-05 20:25:35 +00:00
Elias Ellison	d89c719160	Fix torch.compile side panels refs (#102407 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102407 Approved by: https://github.com/msaroufim	2023-06-05 20:08:40 +00:00
Rodrigo Kumpera	76a98abcb2	Rework Inductor support for collectives. (#99765 ) This is done by introducing two new base classes: InPlaceCollectiveKernel and OutOfPlaceCollectiveKernel. They deal with the differences for when InPlaceHint needs to be used. Additionally to that, we introduce `has_side_effects` method to buffers that prevents them from being DCE'd by the scheduduler. This is needed because InPlaceHint nodes both wrap the inputs and are the outputs, which places no users to the collectives themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99765 Approved by: https://github.com/wconstab	2023-06-05 20:06:40 +00:00
Edward Z. Yang	cca7b38564	Don't allow skipping deepcopy (#102973 ) We might mutate it afterwards! This could lead to hard to understand bugs. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102973 Approved by: https://github.com/albanD	2023-06-05 20:01:16 +00:00
Edward Z. Yang	7112880cc1	Preserve leaf-ness and requires_grad-ness in minified repros (#102899 ) Also some minor refactoring Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102899 Approved by: https://github.com/albanD	2023-06-05 19:56:00 +00:00
Ali Moezzi	719584600b	Merge original module attributes with attributes assigned by __setattr__ (#102910 ) Fixes https://github.com/pytorch/pytorch/issues/94478 @davidberard98 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102910 Approved by: https://github.com/Skylion007, https://github.com/Neilblaze, https://github.com/davidberard98	2023-06-05 19:14:07 +00:00
Masaki Kozuki	515c427941	Enable clang-format on foreach / multi_tensor_apply files (#102887 ) As per title. I don't think it's good to have multiple styles of line breaks, indents, etc. in a file. I'll submit a pull request to update https://github.com/pytorch/pytorch/blob/main/.git-blame-ignore-revs once this landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102887 Approved by: https://github.com/albanD	2023-06-05 19:00:07 +00:00
Andrew Or	604a414bfc	[quant][pt2] Fix convert in Conv + BN QAT fusion (#102224 ) Summary: Previously, the test for the convert flow in Conv + BN QAT fusion was not enabled by mistake. However, reenabling this test uncovered several bugs: (1) The replaced nodes returned by subgraph rewriter were not handled correctly. This is because a recent change in the subgraph rewriter (#100556) fixed only the prepare case but not the convert case. This commit brings this fix to the convert case as well and deduplicates some code between the two cases. (2) When folding BN into conv, we used the wrong arg index to get the BN eps value. This resulted in an incorrect conv weight. (3) In FX, we currently do a hack for weighted modules where we observe the weights once in convert in order to ensure we get the right shapes for these weight observers. This caused the numerics to diverge between PT2 and FX. This commit fixes this by skipping this unnecessary hack for `_convert_to_reference_decomposed_fx`. (4) Per channel support was simply missing. This commit adds support for this by matching the quantize_per_channel and dequantize_per_channel ops in addition to the existing ones. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_qat_conv_bn_numerics Reviewed By: jerryzh168 Differential Revision: D46097783 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102224 Approved by: https://github.com/jerryzh168	2023-06-05 18:09:28 +00:00
Tugsbayasgalan Manlaibaatar	4bb2b65ea4	Turn on add_runtime_assertion by default (#102671 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102671 Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri	2023-06-05 16:27:44 +00:00
PyTorch MergeBot	ecb191683e	Revert "enforece dtype (#102802 )" This reverts commit 8e2a86c2a54719fd66a3e612fe8b433fbb1d4522. Reverted https://github.com/pytorch/pytorch/pull/102802 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/102802#issuecomment-1577099676))	2023-06-05 16:21:28 +00:00
Samuel Eisenhandler	9cabdff8bd	Update documentation to read FileSystemReader instead of FileSystemLoader (#102795 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102795 Approved by: https://github.com/wz337	2023-06-05 15:22:49 +00:00
Richard Li	f1f57e1e54	trigger tracing for MTIA events (#102288 ) Summary: trigger tracing for MTIA events on python side when ProfilerActivity.MTIA is specified Test Plan: Test diff: D45437426 ``` hg graft D45437426 ``` - in one terminal ``` cd ~/fbsource/fbcode buck2 run -j 8 \ //infra_asic_fpga/firmware/tools/mad/service:mad_service ``` - in another terminal Pytorch profiler ``` buck run mode/dev-nosan -j 8 //caffe2/torch/fb/acc_runtime/afg/tests:test_afg -- -m kernel_add ``` Differential Revision: D46122853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102288 Approved by: https://github.com/aaronenyeshi	2023-06-05 15:10:31 +00:00
lezcano	2c2e4d5228	Populate the eviction_policy field for load/store properly (#91316 ) This helps with kernels that make use of caching like mid-range softmax which reads the data three times. Selecting `eviction_policy=evict_first` in the last loop of the softmax operation seems to give a 7-10% speed-up vs. selecting `evict_last` which was the previous option. I'll put up some benchmarks soon™. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91316 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-06-05 13:54:36 +00:00
Matthew Hoffman	ee77d2b660	Create public interface for torch.jit (#101678 ) Fixes #92240; this adds all variables in `torch/jit/__init__.py` that also have docs page to `__all__`: https://pytorch.org/docs/stable/jit.html As stated in the tracking issue, this solves pyright errors like this: ```python import torch def foo(x, y): return 2 * x + y traced_foo = torch.jit.trace(foo, (torch.rand(3), torch.rand(3))) # error: "trace" is not exported from module "torch.jit" (reportPrivateImportUsage) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101678 Approved by: https://github.com/albanD	2023-06-05 13:14:32 +00:00
PyTorch MergeBot	f79d2b45fb	Revert "Replace _dynamo.config with an object instead of module (#96455 )" This reverts commit 3864207c2a71a3ba8dc13bcf9582a726a10292cd. Reverted https://github.com/pytorch/pytorch/pull/96455 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/96455#issuecomment-1576162237))	2023-06-05 07:06:14 +00:00
PyTorch MergeBot	258d398eec	Revert "torch.compiler public namespace (#102182 )" This reverts commit b5840f99c3f2ae01b7831fd32b99758180fc22c3. Reverted https://github.com/pytorch/pytorch/pull/102182 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/102182#issuecomment-1576144551))	2023-06-05 06:52:37 +00:00
Nikita Karetnikov	6ac3352a37	[pt2] add meta for `_linalg_slogdet` (#102464 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102464 Approved by: https://github.com/ezyang	2023-06-05 03:17:08 +00:00
XiaobingSuper	ca18053913	inductor: add fake mode tracing for cumsum graph pattern (#102820 ) When running dynamic shape of ```OPTForCausalLM``` path, there has an error: ```TypeError: unsupported operand type(s) for +: 'Node' and 'int'```, this PR will do: 1. For ```pointless_cumsum_replacement```, the sizes may be a node, we should trace the target pattern using example input. 2. For dynamic shape, we should trace a pattern under fake mode in which inputs may have symbolic inputs. After this PR, the dynamic shape of ```OPTForCausalLM``` can work(```python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/huggingface.py --performance --float32 -dcpu --inference -n5 --inductor --dynamic-shapes --only OPTForCausalLM```). Pull Request resolved: https://github.com/pytorch/pytorch/pull/102820 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-06-05 03:09:26 +00:00
Chao Yang	8e2a86c2a5	enforece dtype (#102802 ) Summary: Add a flag to enforce the gather data dtype. In case backward compatibility, make the default as False Test Plan: local and mast Reviewed By: zyan0, strisunshinewentingwang Differential Revision: D46295190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102802 Approved by: https://github.com/mrshenli	2023-06-05 02:04:09 +00:00
Bin Bao	881307abcf	[inductor] Fix a cpp_wrapper issue when fx_passes modified fx graph (#102851 ) Summary: Currently cpp_wrapper for CUDA does it in two passe, which means we need to deepcopy the input module to isolate any fx transformations between the two passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102851 Approved by: https://github.com/jansel	2023-06-05 00:20:38 +00:00
Zhengxu Chen	26bf8894b6	[export] Replicate exportdb examples and tests in oss. (#102769 ) Summary: Initial work to copy source to OSS for exportdb and make sure tests can run properly. Test Plan: test_export Differential Revision: D46369152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102769 Approved by: https://github.com/angelayi	2023-06-04 20:01:57 +00:00
Rohan Varma	a748be93df	[CheckpointWrapper] Warn on reentrant use (#102890 ) We'd like to encourage users to try non-reentrant as much as possible, and identify any gaps this way. Differential Revision: [D46397786](https://our.internmc.facebook.com/intern/diff/D46397786/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102890 Approved by: https://github.com/awgu	2023-06-04 18:31:22 +00:00
Rohan Varma	5b623d6c6a	[Composable] fully_shard load_optim test (#102692 ) Closes https://github.com/pytorch/pytorch/issues/93280 and adds tests for this. Differential Revision: [D46343364](https://our.internmc.facebook.com/intern/diff/D46343364/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102692 Approved by: https://github.com/awgu	2023-06-04 18:31:22 +00:00
Rohan Varma	88ce6215f5	[FSDP/DDP] Unify _cast_forward_inputs (#102680 ) Closes https://github.com/pytorch/pytorch/issues/96380 Differential Revision: [D46342814](https://our.internmc.facebook.com/intern/diff/D46342814/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102680 Approved by: https://github.com/awgu	2023-06-04 18:31:21 +00:00
Rohan Varma	957ea485c4	[FSDP/AC] checkpoint_wrapper acccept auto_wrap_policy (#102672 ) Some feedback for this API is that folks would like to use auto_wrap_policy similar to FSDP instead of having to adapt to the signature of ``check_fn``. Differential Revision: [D46340320](https://our.internmc.facebook.com/intern/diff/D46340320/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102672 Approved by: https://github.com/awgu	2023-06-04 18:31:19 +00:00
Rohan Varma	df40ec82dc	[FSDP][Docs] Document get_state_dict_type (#102658 ) Per title Differential Revision: [D46335317](https://our.internmc.facebook.com/intern/diff/D46335317/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102658 Approved by: https://github.com/fegin, https://github.com/awgu	2023-06-04 18:31:18 +00:00
Rohan Varma	c6d0fe39ec	[FSDP] Document optim_state_dict_config in method (#102657 ) Per title Differential Revision: [D46335318](https://our.internmc.facebook.com/intern/diff/D46335318/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102657 Approved by: https://github.com/fegin	2023-06-04 18:31:16 +00:00
Rohan Varma	beb7131c64	[FSDP] Use INFO instead of DETAIL for warning logs (#102639 ) Since these are just logs and don't introduce any big perf slowdowns, I think we should just enable them in info mode. Differential Revision: [D46328510](https://our.internmc.facebook.com/intern/diff/D46328510/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102639 Approved by: https://github.com/awgu	2023-06-04 18:31:15 +00:00
Rohan Varma	4d516f44a1	[FSDP][ez] Type optimizer correctly (#102637 ) In shardedgradscaler, the optimizer doesn't have to be SGD. Differential Revision: [D46327103](https://our.internmc.facebook.com/intern/diff/D46327103/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102637 Approved by: https://github.com/Skylion007, https://github.com/awgu, https://github.com/fegin	2023-06-04 18:31:13 +00:00
Rohan Varma	e66c498d2d	Log modules FSDP hooks fire for (#102508 ) Under torch_distributed_debug >= INFO and use_orig_params=True, log post backward hook firing to debug things like FSDP + AC integration. Differential Revision: [D46172916](https://our.internmc.facebook.com/intern/diff/D46172916/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102508 Approved by: https://github.com/awgu, https://github.com/fegin	2023-06-04 18:31:12 +00:00
Nikita Karetnikov	757791d1e3	[pt2] add `SymInt` support for `linalg.vander` (#102469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102469 Approved by: https://github.com/Skylion007, https://github.com/lezcano	2023-06-04 09:58:02 +00:00
cyy	87cbfe957a	increase clang-tidy coverage to more c10 source files (#102902 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102902 Approved by: https://github.com/Skylion007	2023-06-04 06:33:01 +00:00
PyTorch UpdateBot	992bffe5a3	[vision hash update] update the pinned vision hash (#102919 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102919 Approved by: https://github.com/pytorchbot	2023-06-04 02:47:36 +00:00
shibo19	9d20b47e47	make device normalization more generic in faketensor (#102519 ) Fixes #ISSUE_NUMBER make the device normalization more generic in faketensor to support devices like "cuda", "foo" and so on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102519 Approved by: https://github.com/albanD	2023-06-04 01:44:21 +00:00
Huy Do	85efacee07	Add a new UNSTABLE category in trymerge (#102784 ) Per title, after https://github.com/pytorch/pytorch/pull/102426 landed, it makes sense to have a new category for UNSTABLE jobs and handle them accordingly in trymerge. * The simple approach is to check for `unstable` in the check (job) name. I plan to roll this out first and then see if we need to cover the more complicated, but less popular case, of unstable build job. Specifically, an unstable build job has no `unstable` in its name * An unstable job is ignored by trymerge. This is the same behavior we have atm when a job is moved to unstable. It's completely ignored * The update to Dr. CI will come later, so that unstable failures would also be hidden like broken trunk or flaky ### Testing Leverage the broken trunk Windows CPU job atm and mark Windows CPU jobs as unstable https://github.com/pytorch/pytorch/issues/102297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102784 Approved by: https://github.com/clee2000	2023-06-04 00:40:27 +00:00
Han Qi	3864207c2a	Replace _dynamo.config with an object instead of module (#96455 ) Summary: Replace _dynamo.config with an object instead of module Current usage patterns of setting and reading fields on config will work unchanged. Only changes needed going forward: 1. import torch._dynamo.config will not work. However, just doing import torch._dynamo is sufficient to access dynamo config as torch._dynamo.config. 2. Files inside of _dynamo folder need to access config via from torch._dynamo.config_util import config instead of from torch._dynamo import config. Because _dynamo/__init__.py imports some of the files so it would be circular import. Test Plan: Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/96455 Approved by: https://github.com/jansel	2023-06-03 23:18:41 +00:00
Mengwei Liu	eebe0ee141	[Executorch][codegen] Add ETKernelIndex for aggregating all kernels for kernel (#102874 ) Summary: keys and change codegen to take ETKernelIndex We are adding support for dtype and dim order specialized kernel registration. This requires us to reorganize `BackendIndex` (which is a `Dict[DispatchKey, Dict[OperatorName, BackendMetadata]]`) to be `Dict[OperatorName, Dict[ETKernelKey, BackendMetadata]]`. This PR adds new data structures in order to support this change: * `ETKernelKey` to retrieve a certain kernel from the registry. * `ETKernelIndex`, the dictionary from operator name to kernel key to kernel mapping. Note that the codegen logic is not changed yet, we need subsequent diffs to actually generate code for different kernel keys. Test Plan: Added tests Reviewed By: Jack-Khuu Differential Revision: D46407096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102874 Approved by: https://github.com/Jack-Khuu, https://github.com/kirklandsign	2023-06-03 17:23:42 +00:00
PyTorch MergeBot	0f672e8c67	Revert "[DTensor][3/N] add DTensor constructor function: full (#101436 )" This reverts commit 2ca75d49a83609b4e25b4b9becc859669e855a8d. Reverted https://github.com/pytorch/pytorch/pull/101436 on behalf of https://github.com/malfet due to Caused internal SEV ([comment](https://github.com/pytorch/pytorch/pull/101436#issuecomment-1575076672))	2023-06-03 17:09:08 +00:00
Michael Lazos	c46af25bb3	Initialize optimizer in dynamo to avoid graph break and tracing slowness (#102640 ) On calls to `_init_group` rather than tracing through it, extract python values from the arguments, and call the initialization. This avoids having to trace this function which is very slow with large parameters, and also avoids graph breaking on it. This is sound in this case because the state is only initialized once in the eager case. Guards on the state and params are generated explicitly rather than via tracing the initialization. Caveats: `_init_group` also gathers various state tensors into lists via mutating list arguments to pass to the functional optimizer implementation. These state tensors exist on the optimizer itself, but we don't know exactly how the gathering is done and which tensors correspond to which attributes of the optimizer module (each optimizer has different states). To rectify this, we keep weak_ptrs to all of the tensors collected in the lists in globals (similar to how parameter keys are stored for dictionaries). These pointers are guaranteed to be alive as long as the optimizer object is alive if the internal state is not interfered with and they are guarded with weakref guards Pull Request resolved: https://github.com/pytorch/pytorch/pull/102640 Approved by: https://github.com/jansel	2023-06-03 15:49:51 +00:00
Jerry Zhang	eb0971cfe9	[quant][pt2e][be] Remove _input_output_share_observers and _reuse_input_obs_or_fq from QuantizationAnnotation (#102854 ) Summary: att, after we support SharedQuantizationSpec we don't need these things anymore, this PR refactors the uses of _input_output_share_observers to SharedQuantizationSpec Test Plan: ``` buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e' ``` Reviewed By: andrewor14 Differential Revision: D46301342 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102854 Approved by: https://github.com/andrewor14	2023-06-03 07:31:09 +00:00
Vinay Kumar Burugu	8215468870	Feature:To add --tolerance option to benchmark scripts (#102218 ) The "tolerance" option evaluates the model on the baseline device in eager mode (default: CPU) compared to the test device (e.g., CUDA, XLA, etc.) and compares the output tensors to determine the absolute tolerance value based on the [formula](https://pytorch.org/docs/stable/generated/torch.allclose.html). It then saves the results in a CSV file. This comparison highlights the tolerance/accuracy difference between XLA and GPU/CPU devices and can also be used to evaluate newer accelerators. This feature aims to identify accuracy failures on the test device (e.g., XLA) and facilitate quick bug triaging. This feature enables the following capabilities: 1. Ability to monitor accuracy issues of backends 2. Provide more informative picture on accuracy beyond pass/ fail status 3. Having a dump of accuracy information will help triage models accordingly The data generated using this feature is in the [spreadsheet](https://docs.google.com/spreadsheets/d/1A8BAzSqfAw0Q5rgzK5Gk__Uy7qhuynh8tedxKnH-t94/edit#gid=0). The spreadsheet data can be used to compile the below summary table: \| Suite \| Max Tolerance \| \| No. of models with high inaccuracy(>=0.005) \| \| Mean Tolerance \| \| \|------------------ \|:-------------:\|:--------:\|:-------------------------------------------:\|:--------:\|:--------------:\|:--------:\| \| \| xla \| inductor \| xla \| inductor \| xla \| inductor \| \| huggingface \| 0.1169 \| 0.0032 \| 1 \| 0 \| 0.0022 \| 0.0005 \| \| timm_models \| 0.0373 \| 2.8892 \| 10 \| 8 \| 0.0028 \| 0.7044 \| \| torchbench \| 3.013 \| 3.0381 \| 6 \| 2 \| 0.0016 \| 0.0016 \| \| All models \| 3.013 \| 3.0381 \| 17 \| 10 \| 0.0028 \| 0.7044 \| I used PyTorch release/2.0 branch and corresponding [commit_pin](https://github.com/pytorch/pytorch/blob/release/2.0/.github/ci_commit_pins/xla.txt) for XLA to generate the above data. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/102218 Approved by: https://github.com/jansel	2023-06-03 06:40:26 +00:00
Lu Fang	1237502213	Introduce fast path for cuda_equal (#102714 ) We introduce the same trick for cuda_equal. Assuming in cuda_equal, the flags are already handled correctly. Added the tests for cuda part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102714 Approved by: https://github.com/ezyang	2023-06-03 05:49:49 +00:00
Zain Rizvi	4254e052fb	[BE] Fix `lintrunner init` on python 3.11 (#102889 ) Makes the `lintrunner init` command work with python 3.11 The old version of numpy would fail to install on python 3.11, where setup would fail to build wheels with the error `AttributeError: fcompiler. Did you mean: 'compiler'?` The latest version of numpy installs just fine however, so switching to that. More details in https://github.com/numpy/numpy/pull/22102 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102889 Approved by: https://github.com/kit1980	2023-06-03 04:54:41 +00:00
Michael Lazos	00f1bb0963	Fix optimizer cuda health check graph break (can be done in the compiler) (#102765 ) - Ignore the health check if we are compiling - Don't disable the function anymore Pull Request resolved: https://github.com/pytorch/pytorch/pull/102765 Approved by: https://github.com/albanD	2023-06-03 03:42:23 +00:00
Yanbo Liang	d92bb036a4	[Dynamo] Fix if condition on UnspecializedNNModuleVariable (#102583 ) Fixes #102315 The root cause is for ```UnspecializedNNModuleVariable``` which extends from ```UserDefinedObjectVariable```, if ```__bool__``` is missing, we should use ```__len__``` to infer a truth value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102583 Approved by: https://github.com/jansel	2023-06-03 03:42:15 +00:00
Kurt Mohler	a84bb2709a	Remove `check` from `_prims_common`, replace with `torch._check*` (#102219 ) Part of #72948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102219 Approved by: https://github.com/lezcano, https://github.com/albanD	2023-06-03 02:23:21 +00:00
David Berard	1035e33b38	[dynamo] test attaching attributes to an OptimizedModule (#102781 ) Test that the following passes: ```python mod = torch.compile(mod) mod.is_compiled = True assert "is_compiled" in dir(mod) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102781 Approved by: https://github.com/yuguo68	2023-06-03 02:20:21 +00:00
Natalia Gimelshein	2fb182e054	speeds up will_fusion_create_cycle (#102770 ) improves #102622 from ~150s to ~15s. The way computing recursive predecessors works is if `nodes = node1.recursive_predecessors` then recursive predecessors of any `n` in `nodes` should still be a subset of `nodes`, so we can shortcut computing intersection of `node.recursive_predecessors - combined_predecessors` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102770 Approved by: https://github.com/Chillee	2023-06-03 01:45:09 +00:00
Pearu Peterson	39b04370db	Preserve coalesce state in sparse COO tensor serialization (#102647 ) Fixes #101186 Also, resolves the "serialization to preserve coalesced-ness" part in https://github.com/pytorch/pytorch/issues/73479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102647 Approved by: https://github.com/mikaylagawarecki	2023-06-03 01:37:52 +00:00
skc7	ec4a107f87	[LLVM] Make changes needed for opaque pointers (#101396 ) Update llvm_codegen module to use opaque pointers feature of llvm. * Set setOpaquePointers to true for llvm context. * Pass Type to emit\Load and emit\Store functions. * Create TypedPointer struct to keep track of Value and its Type. * Introduce OpqTy_ to be used for opaque pointer types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101396 Approved by: https://github.com/jgong5	2023-06-03 01:00:11 +00:00
Mengwei Liu	c304fddf68	[dynamo][numpy] Support graph break for numpy ndarray (#100839 ) Issue: #93684 In previous PRs #95849 #99560 we redirect `numpy.`, `<tensor>.numpy()` calls to `torch_np.` methods and attributes, by creating `NumpyNdarrayVariable` for those calls. We need to handle `NumpyNdarrayVariable` when graph break happens. This PR did 2 things: 1. In `codegen.py` we made sure we can reconstruct the value wrapped by `NumpyNdarrayVariable`, to be `torch_np.ndarray` in the stack whenerver we recompiles the subgraph. 2. In `builder.py` we can wrap the value to be `NumpyNdarrayVariable` and save it as graph input. ----- Starting from commit 6: ## A new design for supporting numpy in dynamo In short the core concept doesn't change: we still convert `numpy` API calls to `torch_np` API calls. However, instead of wrapping a `torch_np.ndarray` in `NumpyNdarrayVariable`, the new design wraps a `torch.Tensor`. The reason for doing this change is because we need to keep `torch.Tensor` everywhere in the captured graph, so that it works well with the backend of dynamo. See discussions in https://github.com/Quansight-Labs/numpy_pytorch_interop/issues/142 for details. ### Flow This is an example showing how do we think about dynamo working on a simple function: ```python def f(x: torch.Tensor, y: torch.Tensor): a, b = x.numpy(), y.numpy() c = np.add(x, y) return torch.from_numpy(c) ``` ``` +------------+ +------------+ torch.Tensor \| \|numpy.ndarray\| \| -------------- .numpy() --------------\| \| \| \| \| \| +------------------+ +------------+ \| numpy.add \|numpy.ndarray\| \|torch.Tensor +------------+ \| --------------\| torch.from_numpy -------------- torch.Tensor \| \|numpy.ndarray\| \| \| \| -------------- .numpy() --------------\| \| +------------------+ \| \| \| \| +------------+ +------------+ +------------+ +----------------+ torch.Tensor \| \|torch.Tensor \| \| -------------- .detach() --------------\| \| \| \| \| \| +----------------+ +------------+ +------------+ \| \|torch_np.ndarray\| \|torch.Tensor\| \|torch.Tensor \| torch_np.add -----------------\| util.to_tensor -------------\| .detach() -------------- +------------+ \| \| \| \| \| \| torch.Tensor \| \|torch.Tensor \| \| +----------------+ +------------+ -------------- .detach() --------------\| \| \| \| \| \| +------------+ \| +----------------+ \| \| wrapper on torch_np.add \| +--------------------------------------------------------+ ``` ### Approach `torch_np` APIs can take both `torch_np.ndarray` as well as `torch.Tensor`. What we need to do is to have a wrapper for these APIs to convert the return value back to `torch.Tensor`. This way only the wrapper is showing up in the captured graph, with `torch.Tensor`s as input and `torch.Tensor` as output. If we have a graph break or we've traced to the end of the program, we need to inspect all the `NumpyNdarrayVariable` in the stack and convert them back to `numpy.ndarray`, to make sure the compiled version is still behaving the same as the eager version. ### Examples Here's an example of the graph generated: ```python def fn(x: np.ndarray, y: np.ndarray): a = x.real b = y.real torch._dynamo.graph_break() return np.add(a, 1), np.add(b, 1) ``` Graph generated: ``` [2023-05-16 10:31:48,737] torch._dynamo.output_graph.__graph: [DEBUG] TRACED GRAPH __compiled_fn_0 <eval_with_key>.0 opcode name target args kwargs ------------- -------------- ---------------------------------------------------------- ---------------------- -------- placeholder l_x_ L_x_ () {} placeholder l_y_ L_y_ () {} call_function from_numpy <built-in method from_numpy of type object at 0x12b1fdc80> (l_x_,) {} call_function from_numpy_1 <built-in method from_numpy of type object at 0x12b1fdc80> (l_y_,) {} call_function attr_wrapper <function attr_wrapper at 0x12e8693a0> (from_numpy, 'real') {} call_function attr_wrapper_1 <function attr_wrapper at 0x12e8693a0> (from_numpy_1, 'real') {} output output output ((),) {} [2023-05-16 10:31:48,908] torch._dynamo.output_graph.__graph: [DEBUG] TRACED GRAPH __compiled_fn_2 <eval_with_key>.1 opcode name target args kwargs ------------- ------------- ---------------------------------------------------------- ------------------------------- -------- placeholder l_a_ L_a_ () {} placeholder l_b_ L_b_ () {} call_function from_numpy <built-in method from_numpy of type object at 0x12b1fdc80> (l_a_,) {} call_function from_numpy_1 <built-in method from_numpy of type object at 0x12b1fdc80> (l_b_,) {} call_function wrapped_add <Wrapped function <original add>> (from_numpy, 1) {} call_function wrapped_add_1 <Wrapped function <original add>> (from_numpy_1, 1) {} output output output ((wrapped_add, wrapped_add_1),) {} ``` ### Changes * `codegen.py`: reconstruct `numpy.ndarray` from `NumpyNdarrayVariable` by adding bytecode to call `utils.to_numpy_helper()`. * `output_graph.py`: getting rid of legacy code that does exactly what `codegen.py` does, which only handling return case but not graph break case. * `utils.py`: added helpers to convert `numpy.ndarray` to `torch.Tensor` and vice versa. Also adding a wrapper class that takes in a function. In `__call__` it calls the function and converts its out to `torch.Tensor` (or a list of it). * `builder.py`: add method to wrap `numpy.ndarray` graph inputs into `NumpyNdarrayVariable`, by calling `torch.numpy` in the proxy. * `misc.py`: `numpy` API calls goes into `NumpyVariable` and we find the function with the same name in `torch_np` module, then wrap it with the wrapper defined in `utils.py`. * `tensor.py`, `torch.py`: proxy `tensor.numpy()` to be `torch.detach()` but wrap it with `NumpyNdarrayVariable`. Similarly, `torch.from_numpy()` -> `torch.detach()` but wrap it with `TensorVariable`. In `NumpyNdarrayVariable`, do the similar `torch_np.ndarray` to `torch.Tensor` wrapping for attributes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100839 Approved by: https://github.com/ezyang	2023-06-03 00:54:25 +00:00
Matthew Hoffman	2491aa53a8	Make DataParallel generic (#102455 ) Fixes #102441 improves type hinting of the module attribute, since it can easily be bound in `DataParallel.__init__` ```python from torch.nn import DataParallel class MyModule(Module): ... my_data_parallel = DataParallel(MyModule(), device_ids=[0, 1, 2]) reveal_type(my_data_parallel) # Type of "my_data_parallel" is "DataParallel[MyModule]" reveal_type(my_data_parallel.module) # Type of "my_data_parallel.module" is "MyModule" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102455 Approved by: https://github.com/Skylion007	2023-06-03 00:33:01 +00:00
Zhengxu Chen	ed113332e5	[jit] Try to mitigate bad_weak_ptr error from type ptrs and print more error message. (#102822 ) Test Plan: CI Differential Revision: D46385190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102822 Approved by: https://github.com/Skylion007	2023-06-02 23:20:36 +00:00
shibo19	af50efca24	add nested/sprase/quantized tensor key for privateuse1 (#102696 ) Fixes #ISSUE_NUMBER add nested/sprase/quantized tensor key for privateuse1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102696 Approved by: https://github.com/bdhirsh	2023-06-02 22:35:52 +00:00
Andrew Or	a1142053f0	[reland][quant][test] Fix broken PT2 import, add warnings (#102819 ) Summary: We are currently silently skipping all PT2 quantization tests due to a recent typo. This commit fixes this and also adds warnings so it'll be easier to debug similar issues in the future. Test Plan: python test/test_quantization.py Differential Revision: D46383546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102819 Approved by: https://github.com/jerryzh168	2023-06-02 22:35:30 +00:00
Edward Z. Yang	5d57a348cd	Graph break on differentiable boolean mask setitem (#102843 ) Fixes https://github.com/pytorch/pytorch/issues/102841 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102843 Approved by: https://github.com/voznesenskym	2023-06-02 22:34:52 +00:00
Valentin Andrei	02dd1f38f2	[pytorch] CUDA kernel for torch.cat on contiguous tensors with wide loads (#102815 ) This PR creates a CUDA kernel for `CatArrayBatchedCopy` that makes use of vectorized memory loads to maximize HBM bandwidth. It also simplifies the kernel code by removing the path handling not-contiguous inputs. It gets called when the following conditions are met: - tensors are contiguous - input data types are of 32bit and 64 bit - all the input are aligned to 16 bytes boundary We tested on a larger set of problem sizes and there is net gain for 32 bit types and marginal gain for 64 bit types. Based on our analysis the 32 bit cats are by far the dominant kernel being called. Results: <img width="1320" alt="Screenshot 2023-06-02 at 8 10 21 AM" src="https://github.com/pytorch/pytorch/assets/23515689/6f083f7c-2e1a-4513-a994-e0cb072d9b5d"> The SASS Code confirms using the wide loads for input tensors and the stores to global memory are unrolled to maximize oversubscription: <img width="1648" alt="Screenshot 2023-06-02 at 8 16 29 AM" src="https://github.com/pytorch/pytorch/assets/23515689/10325ee6-d3a0-402a-af0d-29cd1a32813b"> Test Code: ```python import sys import torch l_inputs = [ ((1024,), 0, 2, 100), ((4096,), 0, 2, 100), ((16384,), 0, 4, 100), ((32000,), 0, 8, 100), ((128 * 1024,), 0, 2, 100), ((256 * 1024,), 0, 3, 100), ((1 * 1024 * 1024,), 0, 2, 100), ((4 * 1024 * 1024,), 0, 2, 100), ((16 * 1024 * 1024,), 0, 2, 100), ((32 * 1024 * 1024,), 0, 2, 100), ((128 * 1024 * 1024,), 0, 2, 50), ((64, 256), 0, 4, 100), ((400, 400), 0, 2, 100), ((640, 1080), 0, 2, 100), ((128, 4096), 1, 2, 100), ((512, 512), 1, 2, 100), ((699, 713), 1, 2, 100), ((1024, 1024), 1, 2, 100), ((2000, 1000), 1, 2, 100), ((4096, 4096), 1, 2, 100), ((16384, 16384), 1, 2, 50), ((384, 256, 16), 1, 2, 100), ((400, 200, 13), 1, 2, 100), ((128, 64, 256), 0, 2, 100), ((512, 256, 256), 1, 2, 100), ((512, 1024, 1024), 2, 2, 10), ((1024, 512, 1024), 2, 2, 10), ((1024, 1024, 512), 2, 2, 10), ((128, 64, 64, 32), 0, 2, 50), ((128, 64, 128, 16), 1, 2, 50), ((100, 45, 45, 32), 3, 2, 50), ((128, 32, 256, 32), 3, 2, 50), ] prof_inputs = [ ((1234567,), 0, 2, 5), ((16 * 1024 * 1024,), 0, 3, 5), ((1013, 1013), 0, 2, 5), ((1024, 1024), 1, 2, 5), ((69, 74, 128), 0, 2, 5), ((128, 128, 128), 2, 2, 5), ] def generate_tensors(dim_tuple, cat_type, num_tensors): if cat_type in [torch.int8, torch.int32, torch.int64]: l_tensors = [ torch.randint( high=torch.iinfo(cat_type).max, size=dim_tuple, dtype=cat_type, device="cuda", ) ] * num_tensors return l_tensors else: l_tensors = [ torch.randn(dim_tuple, dtype=cat_type, device="cuda") ] * num_tensors return l_tensors def test_simple_cat( dim_tuple, cat_dim: int, num_tensors: int, iterations: int, cat_type ): torch.cuda.synchronize() # Allocate a tensor equal to L2 cache size on A100 GPUs l2_cache_flusher = torch.empty( int(80 * (1024*2)), dtype=torch.float, device="cuda" ) # All the tensors in the list get read and written once total_MB = 2 num_tensors for dim in dim_tuple: total_MB = dim total_MB /= 1024 1024 # Get the number of bits per element if cat_type in [torch.int8, torch.int32, torch.int64]: total_MB = torch.iinfo(cat_type).bits / 8 else: total_MB = torch.finfo(cat_type).bits / 8 l_tensors = generate_tensors(dim_tuple, cat_type, num_tensors) c = torch.cat(l_tensors, dim=cat_dim) torch.cuda.synchronize() # Measure correctness l_tensors_cpu = [] for t in l_tensors: l_tensors_cpu.append(t.detach().to("cpu")) c_cpu = torch.cat(l_tensors_cpu, dim=cat_dim) c_cpu_dev = c.detach().to("cpu") if not torch.equal(c_cpu, c_cpu_dev): missmatches = torch.count_nonzero(torch.abs(c_cpu - c_cpu_dev)) print("Error; num missmatches for {0} = {1}".format(dim_tuple, missmatches)) return # Measure a few iterations l_ev_start = [torch.cuda.Event(enable_timing=True)] * iterations l_ev_stop = [torch.cuda.Event(enable_timing=True)] * iterations l_cat_times = [] torch.cuda.synchronize() for i in range(iterations): l2_cache_flusher.zero_() torch.cuda._sleep(1_000_000) l_ev_start[i].record() c = torch.cat(l_tensors, dim=cat_dim) l_ev_stop[i].record() torch.cuda.synchronize() for i in range(iterations): t_cat = l_ev_start[i].elapsed_time(l_ev_stop[i]) / 1000 l_cat_times.append(t_cat) min_cat_time = min(l_cat_times) # return bandwidth in GB/s estimated_bw_GBps = total_MB / min_cat_time / 1024 return estimated_bw_GBps def main(argv): if len(argv) > 0: if "profile" in str(argv[0]): for l_input in prof_inputs: gbps = test_simple_cat( l_input[0], l_input[1], l_input[2], l_input[3], torch.float ) print( "Bandwidth (GB/s) for {0} fp32 \| {1:.2f}".format( (l_input[0], l_input[1]), gbps ) ) return for l_input in l_inputs: gbps_int8 = test_simple_cat( l_input[0], l_input[1], l_input[2], l_input[3], torch.int8 ) gbps_fp16 = test_simple_cat( l_input[0], l_input[1], l_input[2], l_input[3], torch.float16 ) gbps_fp32 = test_simple_cat( l_input[0], l_input[1], l_input[2], l_input[3], torch.float32 ) gbps_int32 = test_simple_cat( l_input[0], l_input[1], l_input[2], l_input[3], torch.int32 ) gbps_fp64 = test_simple_cat( l_input[0], l_input[1], l_input[2], l_input[3], torch.float64 ) gbps_long = test_simple_cat( l_input[0], l_input[1], l_input[2], l_input[3], torch.long ) print( "Bandwidth (GB/s) for {0} int8;fp16;fp32;int32;fp64;long\|{1:.2f}\|{2:.2f}\|{3:.2f}\|{4:.2f}\|{5:.2f}\|{6:.2f}".format( (l_input[0], l_input[1]), gbps_int8, gbps_fp16, gbps_fp32, gbps_int32, gbps_fp64, gbps_long, ) ) if __name__ == "__main__": main(sys.argv[1:]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102815 Approved by: https://github.com/ngimel, https://github.com/malfet	2023-06-02 22:33:29 +00:00
soulitzer	896d997dd0	Remove incorrect THP{Cpp,}Function_traverse PyObject traversals (#102860 ) Fixes https://github.com/pytorch/pytorch/issues/102174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102860 Approved by: https://github.com/albanD	2023-06-02 22:05:25 +00:00
soulitzer	9866408167	Multihooks should not keep tensor alive in closure (#102859 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102859 Approved by: https://github.com/albanD	2023-06-02 22:05:25 +00:00
cyy	77f2883c41	[Reland2] fix missing-prototypes warnings in torch_cpu (Part 4) (#102228 ) This PR relands the changes introduced in PR https://github.com/pytorch/pytorch/pull/100849. The old PR turnd nnc_* functions into static. We now add declarations for them and hope that inter builds will pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102228 Approved by: https://github.com/albanD	2023-06-02 22:04:44 +00:00
Shunting Zhang	86c7652503	[inductor] layout optimization for conv (#99773 ) convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much. Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16) - TB: 1.64x -> 1.69x - HF: 1.79x -> 1.78x (random noise) - TIMM: 1.51x -> 1.65x Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: https://github.com/pytorch/pytorch/issues/102670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99773 Approved by: https://github.com/jansel	2023-06-02 21:08:18 +00:00
Michael Lazos	4da88447ea	Disable grouping by dtype and device if compiling (#102771 ) Disable grouping if we are compiling, this happens during lowering Pull Request resolved: https://github.com/pytorch/pytorch/pull/102771 Approved by: https://github.com/janeyx99	2023-06-02 21:04:49 +00:00
cyy	a8c1967cee	fix an asan warning of container overflow (#102735 ) The last substr in QualifiedName seems having container overflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102735 Approved by: https://github.com/Skylion007	2023-06-02 20:51:03 +00:00
Andrii Grynenko	a6a030a8eb	[data_loader] Enable overriding signal handler in DataLoader.cpp (#101816 ) Summary: Custom signal handlers (e.g. with more logging) can help in debugging crashes. Test Plan: builds Reviewed By: drej82 Differential Revision: D45934625 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101816 Approved by: https://github.com/drej82	2023-06-02 20:07:53 +00:00
PyTorch MergeBot	a7efa0ce35	Revert "Remove `check` from `_prims_common`, replace with `torch._check*` (#102219 )" This reverts commit fb79d43649d3755cdd8d87897fdcf12447530896. Reverted https://github.com/pytorch/pytorch/pull/102219 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/5158949959/jobs/9293466925 ([comment](https://github.com/pytorch/pytorch/pull/102219#issuecomment-1574245414))	2023-06-02 20:00:48 +00:00
David Berard	c36d235db0	Revert "implement __dir__ for dynamo (#102480 )" (#102766 ) This reverts commit b02f48b18152ddfcf5fcbefb68f6b66a6c44b37f. If a user does this: ``` mod = torch.compile(mod) mod.is_compiled = True assert "is_compiled" in dir(mod) ``` it will fail after #102480. Differential Revision: [D46368712](https://our.internmc.facebook.com/intern/diff/D46368712) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102766 Approved by: https://github.com/msaroufim	2023-06-02 19:40:44 +00:00
shaoyf42	fc218a8a13	Fix typos in README of DTensor (#102813 ) Fix typos in README of DTensor. But there is still a problem to be fixed. I reported an error when I tried to use distribute_module with shard_params. I show the specific error message in issue https://github.com/pytorch/pytorch/issues/102812. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102813 Approved by: https://github.com/wanchaol	2023-06-02 19:27:23 +00:00
Sergii Dymchenko	659f947583	Try to use a bigger runner for android-emulator-build-test (#102855 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102855 Approved by: https://github.com/seemethere, https://github.com/malfet	2023-06-02 19:22:28 +00:00
Kurt Mohler	fb79d43649	Remove `check` from `_prims_common`, replace with `torch._check*` (#102219 ) Part of #72948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102219 Approved by: https://github.com/lezcano, https://github.com/albanD	2023-06-02 19:13:45 +00:00
Kimish Patel	2296ee08fa	[PT2][Quant][BE] Test refactor to be organize them better (#102704 ) Collected most of the test modules under TestHelperModules. This allows reuse of modules when possible. Probably we can refactor a bit more but left some qat related helper modules in their respective tests Differential Revision: [D46267687](https://our.internmc.facebook.com/intern/diff/D46267687/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102704 Approved by: https://github.com/andrewor14	2023-06-02 18:40:05 +00:00
Sergii Dymchenko	9978850cc0	Update list of bots in upload_external_contrib_stats.py (#102786 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102786 Approved by: https://github.com/PaliC	2023-06-02 18:34:22 +00:00
PyTorch MergeBot	fdd6375a80	Revert "fix alert upload action (#102840 )" This reverts commit 7af47f139d06e365f0ef6bad0382c16c29a0e5bb. Reverted https://github.com/pytorch/pytorch/pull/102840 on behalf of https://github.com/PaliC due to does not actually work e2e ([comment](https://github.com/pytorch/pytorch/pull/102840#issuecomment-1574137743))	2023-06-02 18:24:29 +00:00
Edward Z. Yang	624257890e	Reenable hf_T5_generate (#102818 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102818 Approved by: https://github.com/albanD	2023-06-02 17:59:53 +00:00
Kimish Patel	a53acafd2b	[PT2][Quant] Enable dynamic quantization (#102703 ) Enable dynamic quantization of linear layers. Differential Revision: [D46235070](https://our.internmc.facebook.com/intern/diff/D46235070/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102703 Approved by: https://github.com/andrewor14	2023-06-02 17:52:14 +00:00
PaliC	7af47f139d	fix alert upload action (#102840 ) <!-- copilot:all --> Test run found at https://github.com/pytorch/pytorch/actions/runs/5156296463/jobs/9287113913 ### <samp>🤖 Generated by Copilot at a08e8ec</samp> ### Summary 🛠️🚀🐛 <!-- 1. 🛠️ for improving and fixing the workflow 2. 🚀 for speeding up the checkout step by fetching only the latest commit 3. 🐛 for correcting the syntax error in the run command --> Improve and fix the workflow for uploading alerts to the dashboard. Optimize the checkout step and fix the `run` command in `.github/workflows/upload_alerts.yml`. > _To upload alerts to the dashboard_ > _The workflow needed a quick fix_ > _It fetched only one commit_ > _And ran the script with the right syntax_ > _Now it works as smooth as a fish_ ### Walkthrough * Fix syntax error in `run` command that invokes `create_alerts.py` script ([link](https://github.com/pytorch/pytorch/pull/102840/files?diff=unified&w=0#diff-946b3ad914f86182b35d4b6db415ddc39393c3017ef8fdaeee2b0e866ea831d6L23-R25)) * Add `with` option to `actions/checkout@v2` step to specify `fetch-depth: 1` and improve workflow performance ([link](https://github.com/pytorch/pytorch/pull/102840/files?diff=unified&w=0#diff-946b3ad914f86182b35d4b6db415ddc39393c3017ef8fdaeee2b0e866ea831d6R15-R16)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102840 Approved by: https://github.com/malfet	2023-06-02 17:49:00 +00:00
Edward Z. Yang	b740d3b014	Add comptime.breakpoint (#102758 ) This sets a pdb breakpoint to fire whenever we compile this Python code in Dynamo. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102758 Approved by: https://github.com/zou3519, https://github.com/voznesenskym	2023-06-02 17:44:16 +00:00
Kimish Patel	2301b624ae	[PT2][Quant] Update quconfig to contain input/qoutput activation qspec (#102702 ) As title Differential Revision: [D46342823](https://our.internmc.facebook.com/intern/diff/D46342823/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102702 Approved by: https://github.com/andrewor14	2023-06-02 17:41:46 +00:00
Nikita Shulga	6a24cfd74c	Fix merge rules for XLA pin updates (#102844 ) https://github.com/pytorch/pytorch/pull/102446 moved the job to 12xlarge runner, but merge rule still refer to it as 4xlarge, which results in merge timeouts, for example see https://github.com/pytorch/pytorch/actions/runs/5150076112/jobs/9273821855 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102844 Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt	2023-06-02 17:23:51 +00:00
Kimish Patel	6492b7d22e	[PT2][Quant][BE] Refactor qnnpack_quantizer.py (#102701 ) This diff refactors annotate functions so as to couple annotate functions with corresponding quantization configs that they support. This will help in dynamic quantization which is only supported for linear layers Differential Revision: [D46235071](https://our.internmc.facebook.com/intern/diff/D46235071/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102701 Approved by: https://github.com/jerryzh168	2023-06-02 17:14:56 +00:00
Huy Do	c64aae4287	Move ROCm distributed jobs back to periodic (#102790 ) Unstable jobs can now be handled by creating issues like https://github.com/pytorch/pytorch/issues/102789. There is no need to manually move them to unstable workflow anymore ### Testing ROCm distributed jobs show up as `unstable` https://hud.pytorch.org/pr/pytorch/pytorch/102790#5150329587 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102790 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2023-06-02 16:39:35 +00:00
Edward Z. Yang	8bbef821c3	Add some unit tests from cm3leon involving repeat_interleave (#102733 ) These actually were fixed by https://github.com/pytorch/pytorch/pull/102570 but that PR doesn't test guard-freeness, so here you go. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102733 Approved by: https://github.com/zou3519	2023-06-02 15:35:35 +00:00
Edward Z. Yang	7c00d45312	Reenable cm3leon_generate (#102793 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102793 Approved by: https://github.com/albanD, https://github.com/awgu	2023-06-02 15:15:26 +00:00
PyTorch UpdateBot	09b5b73b90	[xla hash update] update the pinned xla hash (#101388 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101388 Approved by: https://github.com/pytorchbot, https://github.com/malfet	2023-06-02 14:55:14 +00:00
PyTorch MergeBot	8a52b5440e	Revert "upload alerts to rockset/aws through github workflow (#102646 )" This reverts commit ddd741f38520804db5559b08b31ef0742457ce0f. Reverted https://github.com/pytorch/pytorch/pull/102646 on behalf of https://github.com/malfet due to It did not work, how was it tested, see `ddd741f385` ([comment](https://github.com/pytorch/pytorch/pull/102646#issuecomment-1573862275))	2023-06-02 14:52:26 +00:00
Mark Saroufim	b5840f99c3	torch.compiler public namespace (#102182 ) # torch.compiler public API ## Goal The goal of this document is to describe the public facing API for torchdynamo and torchinductor. Today both dynamo and torchinductor are in `torch/_dynamo` and `torch/_inductor` namespace with the only public function `torch.compile()` which is directly placed in `torch/__init__.py` This poses a few problems for users trying to take dependencies on PyTorch 2.0 1. Unclear BC guarantees 2. No builtin discovery mechanism outside of reading the source code 3. No hard requirements for docstrings or type annotations Most importantly it mixes two personas the PyTorch 2.0 developer vs the PyTorch 2.0 customer so this is an attempt to address this. We draw a lot of inspiration from the `functorch` migration to the `func` namespace. ## Alternate names We did discuss some other alternative names 1. `torch.compile` -> problem is this would break BC on the existing `torch.compile` function 2. `torch.dynamo` -> `dynamo` is so far not something we've deliberately hidden from users but problem is now figuring out what it's `_dynamo` vs `dynamo` might be confusing 3. `torch.compiler` -> 1 would be better but to keep BC this is a good compromise # The general approach ## Proposal 1 In https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/__init__.py We have function called `reset()`, this function is essential if users are trying to `torch.compile()` a model under different settings ```python # in _dynamo/ def reset(): do_reset_stuff() ``` Instead we propose ```python # in compiler/ def reset(): do_reset_stuff() # As in copy paste the logic from _dynamo.reset # in _dynamo/ import warnings import inspect def reset(): function_name = inspect.currentframe().f_code.co_name warnings.warn(f"{function_name} is deprecated, use compiler.{function_name} instead", DeprecationWarning) return compiler.reset() ``` ## Proposal 2 ```python # in compiler/ def reset(): “”” Docstrings here “”” _dynamo.reset() # in _dynamo/ No changes ``` Consensus so far seems to be proposal 2 since fewer warnings will be less jarring and it’ll make it quite easy to merge the public API ## Docstrings The above was an example of a function that has no inputs or outputs but there are other functions which could use an improvement in their docstrings, for example allow_in_graph actually works over lists of functions but that’s not mentioned anywhere in the example only if you read the source code. def allow_in_graph(fn): """ Customize which functions TorchDynamo will include in the generated graph. Similar to `torch.fx.wrap()`. Parameters: fn (callable or list/tuple): The function(s) to be allowed in the graph. Returns: callable or list/tuple: The input function(s) included in the graph. Examples: Customize inclusion of a single function: :: torch._dynamo.allow_in_graph(my_custom_function) Customize inclusion of multiple functions: :: torch._dynamo.allow_in_graph([my_custom_function1, my_custom_function2]) @torch._dynamo.optimize(...) def fn(a): x = torch.add(x, 1) x = my_custom_function(x) x = torch.add(x, 1) return x fn(...) Notes: The `allow_in_graph` function allows customization of which functions TorchDynamo includes in the generated graph. It can be used to include specific functions that are not automatically captured by TorchDynamo. If `fn` is a list or tuple, `allow_in_graph` will be called recursively on each element in the sequence. Once a function is allowed in the graph using `allow_in_graph`, it will be captured in the graph generated by TorchDynamo. This customization enables more fine-grained control over the functions included in the graph. Note that `allow_in_graph` expects the input `fn` to be a callable. """ if isinstance(fn, (list, tuple)): return [allow_in_graph(x) for x in fn] assert callable(fn), "allow_in_graph expects a callable" allowed_functions._allowed_function_ids.add(id(fn)) allowed_functions._disallowed_function_ids.remove(id(fn)) return fn So to make the API public, we’d have to write similar docstrings for all public functions we’d like to create. The benefit of this approach is that 1. No BC risks, internal and external users relying on our tooling can slowly wean off the private functions. 2. We will also have to write correct docstrings which will automatically make our documentation easier to maintain and render correctly on pytorch.org 3. We already have some BC guarantees already, we don’t kill OptimizedModule, we rejected the PR to change the config system The con of this approach is that Will be stuck with some potentially suboptimal functions/classes that you can’t kill ## Testing strategy If the approach is to mostly make a public function call an already tested private function then all we need to do is ensure that the function signatures don't change ## Which functions should be in the public API Our heuristic for deciding whether something should be public or not is are users already relying on it for lack of other options or have we recommended some non public functions for users to debug their PT 2.0 programs. Heuristic for not putting something in public is that it’s an experimental subsystem with the goal of turning it on by default, it’s very core dev centric, meta centric, a bunch of different configs that should be batched into a single user facing one, or something that needs to be renamed because the name is confusing #### Top level `torch.compile()` -> already is a public API it does require some minor improvements like having configs be passed in to any backend and not just inductor (EDIT: This was already done https://github.com/pytorch/pytorch/pull/99645l) and renaming `mode=reduce-overhead` to `mode=cudagraph` To make sure that PT 2.0 is supported with a given pytorch version users can create a new public function and this would replace the need for `try/except` blocks around `import torch._dynamo` that has been populating user code. ```python def pt2_enabled(): if hasattr(torch, 'compile'): return True else: return False ``` For all of the below they will be translated to `torch.compiler.function_name()` #### From _dynamo As a starting point we looked at https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/__init__.py and we suggest redefining these functions in `pytorch/torch/compiler/__init__.py` It might also make sense to split them over multiple files and import them in `__init__.py` but because the number of functions is small it'd probably be fine to add them all into a single compiler/__init__.py until this list becomes larger 1. `reset()` 2. `allow_in_graph()` 10. `list_backends()` 12. `compile()`: torch.compile() would be mostly a shell function passing arguments to torch.compiler.compile() 13. `assume_constant_result()`: TODO: Double check how this is useful 15. `torch._dynamo.disable()` Some notable omissions 11. `explain()`: We need to clean up the output for this function, make it a data class and pretty printable 1. `forbid_in_graph()`: Considered adding this but should instead consolidate on `disallow_in_graph` 2. `optimize_assert()`: Already covered by `torch.compile(fullgraph=True)` 3. `check_if_dynamo_supported()`: this would be supplanted by pt2_enabled() 4. `compilation_metrics`, `graph_breaks_reasons` ..: would all be accessed via `torch.compiler.explain()` 5. `replay` does not seem useful to end customers 6. . `graph_break()`: Mostly useful for debugging or unit tests 9. `register_backend()`: End users will just pass a string backend to torch.compile, only devs will create new backends 10. `export()` : Eventually this needs to public but for now it’s not ready so just highlighting that it will be in the public API eventually 11. `disallow_in_graph()`: Usage is limited 12. `mark_static()`: we can keep this private until dynamic=True is recommended in stable 13. `mark_dynamic()`: we can keep this private until dynamic=True is recommended in trunk 14. 8. `OptimizedModule`: This is the only class that we'd expose but is crucial since users are running code like `if isinstance(mod, OptimizedModule): torch.save(mod._orig_mod)` EDIT: because we fixed pickling we no longer need to expose this 15. `is_compiling()`: Still not clear how this useful to end users There are also config variables which we need to expose https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/config.py Some of our configs are useful dev flags, others are to gate experimental functionality and others are essential debugging tools and we seperate out the essential debugging and logging tools to a public facing config. TODO: I still need to think of a good way of porting the config in a BC way here are some ideas 1. Just make all passes available and controllable via `torch.compile(options={})` but only show docstrings for the ones users should care about. The current problem with our config system is we have 3 ways of setting them once via `options={}`, environment variables and variables in `config.py`, it'd be worth settling on one source of truth and have that be the public API. The configs we should make public are 1. `log_file_name` 2. `verbose` 3. `cache_size_limit` 4. `repro_level` and `repro_after`: Although we can rename these to minifier and give human readable names to the levels Everything else should stay private in particular 1. `print_graph_breaks`, `print_specializations`: should be supplanted by `explain()` for public users 2. dynamic shape configs : Users should only have to worry about `torch.compile(dynamic=True/False)` 3. The distributed flags, hook or guard configs: If we tell a user to use FSDP and DDP then the flag should be enabled by default or be in a private namespace 4. The fbcode flags: Obviously no need to be user facing 5. Skip/Allow lists: Not something normal users should play around with #### From _inductor Very little of inductor should be exposed in a public facing API, our core audience as in people writing models mostly just need information on what certain passes mean and how to control them a high level and they can do this with `torch.compile(options={})` so the goal here should be more to make available passes clearer and ideally consolidate them into `torch.compile()` docstrings or modes. There are some exceptions though from https://github.com/pytorch/pytorch/blob/main/torch/_inductor/__init__.py 1. `list_mode_options()` 2. `list_options()`: this needs an additional pass to hide internal or debug options For both of these we’d rename them to compiler.inductor_list_mode_options and compiler.inductor_list_options() since they would be in the same init file as the one for dynamo Notable omissions 1. `_inductor.compile()`: Because of users are coming in with their own fx graph, they are likely developers 2. `_inductor.aot_compile()`:Again this is about capturing and modifying fx graphs so users APIs don't need to be public However the configs are a slightly different story, because we can choose to either 1. Make all configs public 2. Make some configs public and keep most of the private ones. If public config is set it should override the private version 3. Make all configs controllable via `torch.compile(options={})` but make list_options() hide more things For now 3 seems like the most reasonable choice with some high level configs we’ll keep like TORCH_COMPILE_DEBUG Regardless here's what should probably be public or advertised more 1. `disable_progress` and verbose_progress: Combine and enable by default 2. `fallback_random`: We could make the case this shouldn't be public if a top level deterministic mode enables this 3. `profile_bandwidth`: Or could make the case that this should be in TORCH_COMPILE_DEBUG Notable omissions 1. Any config that would generally improve performance for most that we should probably enable by default but might be disabled in the short term because of stability: example `epilogue_fusion`, `pattern_matcher`, `reordering` 2. Autotuning flags: Should just sit behind `torch.compile(mode="max-autotune")` like `max_autotune`, `max_autotune_gemm` 3. `coordinate_descent_tuning`: This one I'm a but mixed about, maybe it just also fall into `mode="max-autotune"` 4. `trace`: `TORCH_COMPILE_DEBUG` is the best flag for all of this 5. `triton.cudagraphs`: Default should be `torch.compile(mode="reduce-overhead")` - I'd go further and rename the `mode=cudagraph` and we can keep reduce-overhead for BC reasons 6. `triton_unique_kernel_names`: Mostly useful for devs debugging 7. `dce`: which doesnt really do anything 8. `shape_padding`: Elias is working on enabling this by default in which case we also remove it ## Mechanics This PR would include the public functions with their docstrings Another PR will take a stab at the configs And for work where the APIs are still being cleaned up whether its minifier or escape hatches, export or dynamic shapes, aot_inductor etc.. we’ll keep them private until a public commitment can be made Pull Request resolved: https://github.com/pytorch/pytorch/pull/102182 Approved by: https://github.com/jansel	2023-06-02 14:38:55 +00:00
Weiming Zhao	b76af5f9a6	Fix broken link in Dynamo's guards doc (#102183 ) (#102185 ) This PR fixes broken link for the code referenced in the guards doc. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/102185 Approved by: https://github.com/mikaylagawarecki, https://github.com/ezyang	2023-06-02 14:36:28 +00:00
Brian Hirsh	f22148f0ed	aotautograd: fix mutation bug when input is noncontiguous (#102767 ) Fixes https://github.com/pytorch/pytorch/issues/93363. See the comment here for details: https://github.com/pytorch/pytorch/issues/93363#issuecomment-1572647261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102767 Approved by: https://github.com/ezyang	2023-06-02 14:31:06 +00:00
Richard Zou	80f59cc61a	Change some py_context_manager_DEPRECATED to py_context_manager (#102643 ) I confirmed that there are no usages of these APIs on github code search or internally. There may still be usages (hence the BC-breaking label), but I expect none to very few. There are some leftover py_context_manager_DEPRECATED that will likely stay that way for a while because: - they are used outside of the pytorch repo (`_AutoDispatchBelowAutograd`, `_DisableTorchDispatch`, `_InferenceMode`) - they are high risk (all of the torch_function / torch_dispatch related stuff) - PyTorch requires that the object behaves like a "Python RAII guard" (`_DisableFuncTorch`, `_MultithreadingEnabled`) This is probably the last PR in the context manager cleanup series. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/102643 Approved by: https://github.com/bdhirsh	2023-06-02 14:29:04 +00:00
Aleksandar Samardžić	51e0f9e858	Add missing decompositons/lowerings for logical/bitwise operators (#102566 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102566 Approved by: https://github.com/lezcano, https://github.com/alexsio27444, https://github.com/jgong5	2023-06-02 14:27:17 +00:00
Richard Zou	3897c479af	Add API to construct the functional variant of an op (#102293 ) `register_functional_op`: - constructs the functional variant of an op - registers a functionalization kernel to the op To get this to work: - `register_functional_op` makes assumptions that it checks about the op's schema. In particular, the op is not allowed to return anything it mutates. We can relax these constraints in the future. - We add a "boxed" python functionalization kernel that handles this case. I'm not actually sure (or convinced) this should be public API or how it should work. If we want this to be public, then it should probably be a torch.library API, but does that also mean we should give the same lifetime guarantees? If so, then it would be up to the user to construct a Library object to actually register the functional variant onto. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/102293 Approved by: https://github.com/bdhirsh	2023-06-02 13:36:50 +00:00
Richard Zou	eaeea62ee4	Make TestPythonRegistration clean up after itself (#102292 ) We did this for TestCustomOp, now we are applying the same thing to TestPythonRegistration. This PR: - changes TestPythonRegistration to register new ops under a single namespace (self.test_ns) - clean up the namespace by deleting it from torch.ops after each test is done running. This avoids a problem where if an op is re-defined, torch.ops.myns.op crashes because we do some caching. The workaround in many of these tests have been to just create an op with a different name, but this PR makes it so that we don't need to do this. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/102292 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2023-06-02 13:36:50 +00:00
Richard Barnes	72cdbf6a3f	Fix spurious "missing return" error in irange.h (#102785 ) Summary: Fixes: ``` warning: missing return statement at end of non-void function ``` This warning is cluttering a lot of compilation logs! Test Plan: Sandcastle Differential Revision: D46374554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102785 Approved by: https://github.com/Skylion007	2023-06-02 09:23:29 +00:00
dujinhang	2e8ce910bb	[Profiler][1/N] add profiler support for custom device. (#101554 ) 1. `torch.autograd.profiler` interface parameters changed. (use `self.use_device` instead of `self.use_cuda` facilitates access by other devices and integrate it in subsequent pr) 2. Modify `ProfilerEventStub`(aka `std::shared_ptr<CUevent_st>`) to `ProfilerVoidEventStub`(aka `std::shared_ptr<void>`) so that `ProfilerStubs` can be inherited by any `{device}Methods`. In addition, `cuda_event_start_` is renamed to `device_event_start_` , cuda and other devices can use this event pointer if needed. 4. custom device support using legacy profiling(add `ProfilerState::KINETO_PRIVATEUSE1_FALLBACK` option) 5. add `privateuse1Stubs` register (parse results and test cases are added in subsequent pr) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101554 Approved by: https://github.com/aaronenyeshi	2023-06-02 09:19:19 +00:00
XiaobingSuper	1204463bd0	inductor: fix bfloat16 reduction crash issue which store float value to bfloat16 (#102719 ) For bfloat16 reduction, there has an wrong store issue which store float value as bfloat16: Before: ``` extern "C" void kernel(const bfloat16* in_ptr0, bfloat16* out_ptr0, float* out_ptr1) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L); i0+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={{-std::numeric_limits<float>::infinity()}}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); auto tmp_acc0_vec = at::vec::Vectorized<float>(tmp_acc0); for(long i1=static_cast<long>(0L); i1<static_cast<long>(32L); i1+=static_cast<long>(1L)) { auto tmp0 = load_bf16_as_float(in_ptr0 + static_cast<long>(i0 + (16Li1))); auto tmp1 = (tmp0); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp1); } tmp_acc0_vec.store(out_ptr0 + static_cast<long>(i0)); } } } #pragma omp single { { for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L); i0+=static_cast<long>(16L)) { auto tmp0 = load_bf16_as_float(out_ptr0 + static_cast<long>(i0)); auto tmp1 = (tmp0); tmp1.store(out_ptr1 + static_cast<long>(i0)); } } } } } ''') ``` after: ``` extern "C" void kernel(const bfloat16 in_ptr0, bfloat16* out_ptr0, float* out_ptr1) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L); i0+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={{-std::numeric_limits<float>::infinity()}}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); auto tmp_acc0_vec = at::vec::Vectorized<float>(tmp_acc0); for(long i1=static_cast<long>(0L); i1<static_cast<long>(32L); i1+=static_cast<long>(1L)) { auto tmp0 = load_bf16_as_float(in_ptr0 + static_cast<long>(i0 + (16L*i1))); auto tmp1 = (tmp0); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp1); } store_float_as_bf16(out_ptr0 + static_cast<long>(i0), tmp_acc0_vec); } } } #pragma omp single { { for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L); i0+=static_cast<long>(16L)) { auto tmp0 = load_bf16_as_float(out_ptr0 + static_cast<long>(i0)); auto tmp1 = (tmp0); tmp1.store(out_ptr1 + static_cast<long>(i0)); } } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102719 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-06-02 08:34:29 +00:00
Damian Szwichtenberg	c537acf46f	Make 1D integer sorting work in parallel (#100081 ) This patch reuses `radix_sort` from fbgemm and makes `torch.(arg)sort` work in parallel for tensors filled with integers. In GNN workloads we often use `torch.(arg)sort`, for example, to calculate permutation from CSR to CSC storage format. Till now, sorting one-dimensional data was performed sequentially. Recently, `radix_sort` implementation from FBGEMM was moved to common utilities and was also enhanced, to cover negative numbers ([pytorch/FBGEMM#1672](https://github.com/pytorch/FBGEMM/pull/1672)). This gives us an opportunity to reuse `radix_sort` to accelerate 1D integer sorting in PyTorch. Benchmark results, measured on a single socket, 56C machine: Before (int64): ``` size: 64000, average run time (from 100 runs): 6.592ms size: 128000, average run time (from 100 runs): 9.798ms size: 256000, average run time (from 100 runs): 19.199ms size: 512000, average run time (from 100 runs): 36.394ms size: 1024000, average run time (from 100 runs): 70.371ms size: 2048000, average run time (from 100 runs): 137.752ms size: 4096000, average run time (from 100 runs): 287.257ms ``` After(int64): ``` size: 64000, average run time (from 100 runs): 1.553ms size: 128000, average run time (from 100 runs): 1.853ms size: 256000, average run time (from 100 runs): 2.873ms size: 512000, average run time (from 100 runs): 4.323ms size: 1024000, average run time (from 100 runs): 7.184ms size: 2048000, average run time (from 100 runs): 14.250ms size: 4096000, average run time (from 100 runs): 29.374ms ``` Notes: Average speedup from measured tensor sizes is 7.7x. For smaller types (e.g. int32/int16), even higher speedup is observed, as fewer passes are required. Depends on #100236. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100081 Approved by: https://github.com/mingfeima, https://github.com/ngimel	2023-06-02 07:41:28 +00:00
Michael Lazos	c75e064dd6	Disallow _foreach_utils.py, but allow it to be inlined (#102221 ) This function should not be allowed, but should be inlineable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102221 Approved by: https://github.com/anijain2305	2023-06-02 05:14:09 +00:00
AllenTiTaiWang	1ca2e993af	[ONNX] Support aten::logit (#102377 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102377 Approved by: https://github.com/BowenBao	2023-06-02 03:39:35 +00:00
PaliC	683753fb0f	upload external pr kpi for 10 days in the past (#102780 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 963044b</samp> The pull request improves the reliability and completeness of the external contribution stats collection and upload. It adds a `time` delay to avoid API rate limit errors in `upload_external_contrib_stats.py`, and changes the order and date range of the commands in `nightly-rockset-uploads.yml`. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 963044b</samp> > _Oh we are the coders of the open source sea_ > _And we pull and we push with the `git` command_ > _We upload the stats of the external PRs_ > _With a ten-day range and a `time` delay_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/102780 Approved by: https://github.com/kit1980	2023-06-02 03:00:38 +00:00
PaliC	ddd741f385	upload alerts to rockset/aws through github workflow (#102646 ) <!-- copilot:all --> ### <samp>🤖 Generated by Copilot at 943f854</samp> ### Summary :clock15:⬆️☁️ <!-- 1. :clock15: - This emoji represents the 15-minute interval of the cron schedule, and also suggests the idea of time-based triggers or events. 2. ⬆️ - This emoji represents the upload action of the workflow, and also suggests the idea of moving data from one place to another. 3. ☁️ - This emoji represents the AWS/Rockset destination of the alerts, and also suggests the idea of cloud-based services or platforms. --> Add a new workflow to upload alerts to a database. The workflow `.github/workflows/upload_alerts.yml` runs periodically on a cron schedule and uses AWS/Rockset as the backend. > _`workflow` file added_ > _upload alerts to the cloud_ > _every quarter hour_ ### Walkthrough * Add a new workflow to upload alerts to AWS/Rockset every 15 minutes ([link](https://github.com/pytorch/pytorch/pull/102646/files?diff=unified&w=0#diff-946b3ad914f86182b35d4b6db415ddc39393c3017ef8fdaeee2b0e866ea831d6R1-R46)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102646 Approved by: https://github.com/huydhn	2023-06-02 02:24:19 +00:00
Edward Z. Yang	4d055ee5a1	RelaxUnspecConstraint some more (#102729 ) One annoyance with mark_dynamic is if you use it on a user specified tensor input (the idea being that you want to compile a function and have it be polymorphic in size), you will get an error if the user ever sends you a 0/1 size input, because of course we are probably going to specialize it. So I relax the constraint even more: even if we find it's constant, if the value is 0/1, that's no big deal. There's some irritating code duplication that I don't entirely know how to resolve. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102729 Approved by: https://github.com/avikchaudhuri, https://github.com/voznesenskym	2023-06-02 02:11:01 +00:00
Ke Wen	9fbfaaa57f	[c10d] Add flag value for direct teardown without comm abort (#102599 ) It was recently reported that `ncclCommAbort` itself may hang in some NCCL versions. For example, https://github.com/NVIDIA/nccl/issues/829. In that case, it may be desirable to directly tear down the program without properly aborting the NCCL communicator, so that user does not wait for hours before noticing a hang. This PR adds new value 3 for env `NCCL_ASYNC_ERROR_HANDLING` that skips the comm abort, and directly throws error in case of exception (timeout, async error, etc) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102599 Approved by: https://github.com/fegin	2023-06-02 00:40:28 +00:00
Rodrigo Kumpera	5be1088ed6	[c10d] Bridge c10d and gloo stores. (#102641 ) This relands #100633 with fixes for internal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102641 Approved by: https://github.com/rohan-varma, https://github.com/fduwjj	2023-06-02 00:07:18 +00:00
chunyuan	4c9992d5ed	Inductor cpp wrapper: cache the wrapper (#89743 ) If the wrapper code has been built, directly load the .so file to avoid recompilation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89743 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-02 00:02:39 +00:00
Nikita Shulga	0b7320315a	[CI] Move libtorch-debug CUDA build to CUDA-12.1 (#102756 ) To avoid nvcc segfaults, compile without `--source-in-ptx` option on CUDA-12.1+ <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 984e4b2</samp> > _Sing, O Muse, of the daring deeds of PyTorch, the swift and fiery_ > _framework that harnesses the power of CUDA, the blazing tool of Nvidia._ > _How they faced a mighty challenge when CUDA, the ever-shifting,_ > _released a new version, twelve point one, that broke their code and caused them grief._ Fixes https://github.com/pytorch/pytorch/issues/102372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102756 Approved by: https://github.com/atalman	2023-06-01 23:11:07 +00:00
William Wen	da963d793b	Fix aten.copy device mismatch bug in FakeTensor (#102664 ) Fixes `pytest ./generated/test_yizhou_wang_RODNet.py -k test_000` failure in https://github.com/pytorch/pytorch/issues/92670. FakeTensor would raise an error upon trying to run `aten.copy` with inputs with different devices, although this is allowed behavior. Also fix `aten.slice_scatter`, since it also takes args with different devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102664 Approved by: https://github.com/yanboliang	2023-06-01 23:05:20 +00:00
Catherine Lee	c7873522c2	Add print statements to debug sharding error (#102713 ) sharding on rocm is broken, i cant replicate on dummy PRs even though it seems to happen pretty often on main, so adding this to increase my sample size. Hopefully this is enough print statements... Pull Request resolved: https://github.com/pytorch/pytorch/pull/102713 Approved by: https://github.com/huydhn	2023-06-01 22:38:28 +00:00
Ashwin Hari	cf0aa38005	Allow ORT backend for DTensor (#101914 ) fixes #101911 Currently, `DTensor` supports cuda and cpu. This PR makes some changes for easier integration with the ort backend. * `Backend.NAME` attribute now has value `name` instead of `NAME` for backends registered through `register_backend(name)`; this matches the pattern for backends with built-in support like nccl. * remove unused `_check_for_nccl_backend` function * add test case that moves parameters to device in the `partition_fn` - a scenario that's useful for big models Pull Request resolved: https://github.com/pytorch/pytorch/pull/101914 Approved by: https://github.com/wanchaol	2023-06-01 22:37:09 +00:00
PyTorch MergeBot	72ed22e806	Revert "[Pytorch] Add Vulkan support for aten::unsqueeze, 1d->2d, 3d->4d (#102042 )" This reverts commit c9ae705a22d9b92e28d655ed3960d488aef04c0e. Reverted https://github.com/pytorch/pytorch/pull/102042 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/102042#issuecomment-1572840752))	2023-06-01 21:58:32 +00:00
PyTorch MergeBot	8b03a59e4d	Revert "[quant][test] Fix broken PT2 import, add warnings (#102644 )" This reverts commit f18b9f86ba1343270d790d2b66e1903af1a7df5c. Reverted https://github.com/pytorch/pytorch/pull/102644 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/102644#issuecomment-1572818537))	2023-06-01 21:36:27 +00:00
Shiyan Deng	f15af19877	initialize max_stream_priorities in getStreamFromPool(bool) (#102739 ) Summary: `getStreamFromPool(bool, signed char)` overload doesn't initialize `max_stream_priorities`. So if we call `getStreamFromPool(true)` we would hit the following error ``` terminate called after throwing an instance of 'c10::Error' what(): Expected cuda stream priority to be less than or equal to 0, got 1 ``` Differential Revision: D46358087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102739 Approved by: https://github.com/ngimel	2023-06-01 21:05:56 +00:00
Nikita Shulga	67792e175c	Add `-debug` suffix to trunk libtorch builds (#102764 ) Cause that's what they are according to `30558c2896/.ci/pytorch/build.sh (L307)` <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 40cd88d</samp> > _`libtorch` debug_ > _Build with symbols for Linux_ > _Winter of errors_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/102764 Approved by: https://github.com/atalman	2023-06-01 21:02:27 +00:00
Masaki Kozuki	401109a243	Use int64_t for indexing in `multi_tensor_apply` (#101760 ) Fixes #101449 I found it better to either imitate the combo of `TensorIterator::can_use_32bit_indexing` and `TensorIterator::with_32bit_indexing` or adroitly choose the index type depending on `Tensor::numel` in the future. --- Used `nsys nvprof` to casually see the effect of `int64_t` indexing: ```python import torch params = [ {"params": [torch.randn(32, 32, device="cuda") for _ in range(100)]}, {"params": [torch.randn(32, 32, device="cuda") for _ in range(100)]}, ] grads = [ [torch.randn(32, 32, device="cuda") for _ in range(100)], [torch.randn(32, 32, device="cuda") for _ in range(100)], ] optimizer = torch.optim.Adam(params, fused=True) for _ in range(100): for i, param_groups in enumerate(params): for p, g in zip(param_groups["params"], grads[i]): p.grad = g optimizer.step() optimizer.zero_grad() ``` Environment ``` Collecting environment information... PyTorch version: 2.1.0a0+gitf994d0b Is debug build: False CUDA used to build PyTorch: 12.1 Python version: 3.10.9 (main, May 17 2023, 00:46:40) [GCC 11.3.0] (64-bit runtime) CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB ``` --- - `multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensor` -> 1.02x - `multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…` -> 1.04x Current main branch: ``` Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- -------- -------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- 64.9 5787610 600 9646.0 9632.0 9503 9888 52.9 void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorLi… ... 8.1 720575 200 3602.9 3584.0 3551 4320 63.4 void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in… ``` this PR: ``` Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- -------- -------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- 65.0 5876847 600 9794.7 9792.0 9632 10080 58.1 void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorLi… ... 8.3 748313 200 3741.6 3744.0 3711 4479 60.0 void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in… ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101760 Approved by: https://github.com/ngimel	2023-06-01 20:55:09 +00:00
PaliC	b8e2e0e907	check users are actually recieved in upload to s3 (#102760 ) <!-- copilot:all --> ### <samp>🤖 Generated by Copilot at 5927156</samp> ### Summary 🔁🧹📊 <!-- 1. 🔁 - This emoji represents the retry logic that is added to the script, which loops until the command succeeds or reaches the maximum number of attempts. 2. 🧹 - This emoji represents the cleanup and simplification of the code, which removes clutter and makes it easier to understand and maintain. 3. 📊 - This emoji represents the data analysis and visualization that is enabled by uploading the external contribution stats to Rockset, which allows for exploring and sharing insights on the open source community. --> This pull request improves the `upload_external_contrib_stats.py` script and the `nightly-rockset-uploads.yml` workflow. It makes the script more efficient and robust, and increases the retry logic for the Rockset upload command. > _Oh we are the coders of the open source sea_ > _And we upload stats to Rockset with glee_ > _But sometimes the network is slow or breaks down_ > _So we retry the command and we don't let it drown_ ### Walkthrough * Increase the number of retries for uploading external contribution stats to Rockset to avoid failures ([link](https://github.com/pytorch/pytorch/pull/102760/files?diff=unified&w=0#diff-a0d80a44a0694ddbddd6d8cf9484f5b850268a34117c8caf1fc071ad59895f9fL35-R35)) * Simplify the logic of uploading external contribution stats to Rockset by removing the loop and adding assertions and print statements ([link](https://github.com/pytorch/pytorch/pull/102760/files?diff=unified&w=0#diff-ac022823c08d71df6cc85aae7f2ca50a1ec71e5f9eb9371ac563c12cf52b750cL137-R146)) * Remove unused import of `read_from_s3` from `upload_external_contrib_stats.py` to clean up the code ([link](https://github.com/pytorch/pytorch/pull/102760/files?diff=unified&w=0#diff-ac022823c08d71df6cc85aae7f2ca50a1ec71e5f9eb9371ac563c12cf52b750cL11-R11)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102760 Approved by: https://github.com/kit1980	2023-06-01 20:53:03 +00:00
Xiao Wang	6340aa5d58	Skip test test_triton_bsr_dense_bmm if not TEST_WITH_TORCHINDUCTOR [v2] (#102660 ) Test was originally skipped in https://github.com/pytorch/pytorch/pull/98462 Not sure why it was removed in https://github.com/pytorch/pytorch/pull/94825 Now the test hits CUDA illegal memory access on H100 again after https://github.com/pytorch/pytorch/pull/101163 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102660 Approved by: https://github.com/zou3519	2023-06-01 20:36:45 +00:00
Nikita Shulga	ca470fc59f	[BE] Make `test_no_triton_on_import` simple (#102674 ) Do not try to parse raised exception for no good reason Add short description Reduce script to a single line <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at ea4164e</samp> > _`test_no_triton_on_import`_ > _Cleans up the code, adds docs_ > _No hidden errors_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/102674 Approved by: https://github.com/cpuhrsch, https://github.com/albanD	2023-06-01 20:31:18 +00:00
Edward Z. Yang	90b1b17c9f	Fix string concatenation with non-string (#102728 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102728 Approved by: https://github.com/Skylion007	2023-06-01 20:02:03 +00:00
Rodrigo Kumpera	ca1c1fdc91	[C10D] Implement Store fallbacks for append, multi_get and multi_set. (#100768 ) These fallbacks exposed some issue in quite a few spots in our bindings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100768 Approved by: https://github.com/fduwjj	2023-06-01 19:58:47 +00:00
Bin Bao	59532bd6f1	[inductor] Fix a cpp wrapper codegen issue for _scaled_dot_product_efficient_attention (#102624 ) Summary: This fixes a cpp_wrapper coverage drop on TIMM models as shown in recent inference dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102624 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-06-01 19:52:37 +00:00
Angela Yi	bd0a4e2d83	Serialize pytree to string v2 (#102708 ) v2 of https://github.com/pytorch/pytorch/pull/102577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102708 Approved by: https://github.com/avikchaudhuri	2023-06-01 19:51:28 +00:00
Nikita Shulga	fb0729054b	Revert "[Executorch][codegen] Add ETKernelIndex for aggregating all kernels for kernel (#102565 )" This reverts commit 019c38624cdd079fbed04a561eebde45c6fa3b1f / https://github.com/pytorch/pytorch/pull/102565 as it breaks ExecutorchBuilds.	2023-06-01 12:35:23 -07:00
Devashish Shankar	9d9ce19d12	[split cat fx passes] Normalize squeeze (#102294 ) Summary: Sometimes, squeeze can be a "call_method" instead of a "call_function". Normalizing it will make it amenable to pattern matching by passes like "split->squeeze" Test Plan: * CI tests Differential Revision: D46031846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102294 Approved by: https://github.com/jansel	2023-06-01 19:05:20 +00:00
Andrew Or	f18b9f86ba	[quant][test] Fix broken PT2 import, add warnings (#102644 ) Summary: We are currently silently skipping all PT2 quantization tests due to a recent typo. This commit fixes this and also adds warnings so it'll be easier to debug similar issues in the future. Test Plan: python test/test_quantization.py Differential Revision: D46329480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102644 Approved by: https://github.com/jerryzh168	2023-06-01 19:02:36 +00:00
Jack Taylor	87c976b69d	Remove deprecated HIP flags (#102271 ) Removes the outdated HIP flags appended to HIP_CXX_FLAGS The will help remove the following warnings in the pytorch build log ``` [6238/6889] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/cudnn/hip/Conv_v8.cpp.o cc1plus: warning: command line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++ cc1plus: warning: unrecognized command line option ‘-Wno-unused-command-line-argument’ cc1plus: warning: unrecognized command line option ‘-Wno-exceptions’ cc1plus: warning: unrecognized command line option ‘-Wno-inconsistent-missing-override’ cc1plus: warning: unrecognized command line option ‘-Wno-macro-redefined’ ``` This also updates the gloo submodule commit to include the similar change made to gloo. `597accfd79` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102271 Approved by: https://github.com/malfet	2023-06-01 18:58:48 +00:00
Bert Maher	30558c2896	[functorch] Get test_functionalize to run on FB infra (#102695 ) A few bits of weirdness needed to happen here: - skipIfRocm doesn't work as a unittest class decorator; it returns a function, and the test discovery logic looks for things that inherit from TestCase. So I wrapped the individual test methods instead. - Inside fbcode, our test runner (buck + tpx) discovers and runs tests using two separate processes, so it's important to use @wraps on the generated class to make it "look like" a regular test. Differential Revision: [D46344980](https://our.internmc.facebook.com/intern/diff/D46344980/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D46344980/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/102695 Approved by: https://github.com/zou3519	2023-06-01 18:47:09 +00:00
Huy Do	08150ee020	Mark job as unstable dynamically (#102426 ) Allow CI jobs to be marked as unstable dynamically. This use the same mechanism to disable job but with a different issue title `UNSTABLE JOB_NAME`. The action will output a `is-unstable` flag to let the CI know if the current job it's running is unstable. This is similar to the way `keep-going` flag is exposed. Once this is merged, I will follow up with another PR to actually use `is-unstable` flag in CI. ### Testing * https://github.com/pytorch/pytorch/issues/102297 * `is-unstable` set https://github.com/pytorch/pytorch/actions/runs/5114544576/jobs/9194921978#step:9:172 * Windows CPU jobs are named unstable https://github.com/pytorch/pytorch/actions/runs/5114544576/jobs/9195186715 * https://github.com/pytorch/pytorch/issues/102298 * `is-unstable set https://github.com/pytorch/pytorch/actions/runs/5114543738/jobs/9195036258#step:11:139 * Dynamo jobs are named unstable https://github.com/pytorch/pytorch/actions/runs/5114543738/jobs/9195036258 * https://github.com/pytorch/pytorch/issues/102299 * `is-unstable` set https://github.com/pytorch/pytorch/actions/runs/5114544576/jobs/9194922158#step:9:190 * MacOS test jobs are named unstable https://github.com/pytorch/pytorch/actions/runs/5114544576/jobs/9195007882 * https://github.com/pytorch/pytorch/issues/102433 * `is-unstable` set https://github.com/pytorch/pytorch/actions/runs/5114544572/jobs/9198630766#step:13:265 * https://github.com/pytorch/pytorch/issues/102425 (open temporarily during testing) * Disabling CI jobs still works correctly https://github.com/pytorch/pytorch/actions/runs/5114543738/jobs/9194904007 (backwards_compat) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102426 Approved by: https://github.com/ZainRizvi	2023-06-01 18:38:09 +00:00
Jerry Zhang	ce8d31551b	[quant][be] Change return type for zero_point to be int32 Tensor (#102234 ) Summary: This is probably a typo Test Plan: CI Reviewed By: salilsdesai Differential Revision: D46172706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102234 Approved by: https://github.com/salilsdesai	2023-06-01 18:30:44 +00:00
lucylq	c9ae705a22	[Pytorch] Add Vulkan support for aten::unsqueeze, 1d->2d, 3d->4d (#102042 ) Summary: Add 1d->2d, 3d->4d unsqueeze Unsqueeze operator: https://pytorch.org/docs/stable/generated/torch.unsqueeze.html#torch.unsqueeze Test Plan: Unsqueeze tests: ``` lfq@lfq-mbp xplat % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="unsqueeze" Downloaded 0/44 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules) Building: finished in 38.6 sec (100%) 523/523 jobs, 8/523 updated Total time: 38.6 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = unsqueeze [==========] Running 9 tests from 1 test suite. [----------] Global test environment set-up. [----------] 9 tests from VulkanAPITest [ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim0 [ OK ] VulkanAPITest.unsqueeze_1dto2d_dim0 (76 ms) [ RUN ] VulkanAPITest.unsqueeze_1dto2d_dim1 [ OK ] VulkanAPITest.unsqueeze_1dto2d_dim1 (2 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim0 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim0 (9 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim1 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim1 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_2dto3d_dim2 [ OK ] VulkanAPITest.unsqueeze_2dto3d_dim2 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim0 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim0 (2 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim1 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim1 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim2 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim2 (1 ms) [ RUN ] VulkanAPITest.unsqueeze_3dto4d_dim3 [ OK ] VulkanAPITest.unsqueeze_3dto4d_dim3 (1 ms) [----------] 9 tests from VulkanAPITest (98 ms total) [----------] Global test environment tear-down [==========] 9 tests from 1 test suite ran. (98 ms total) [ PASSED ] 9 tests. ``` clang-format on the glsl files Differential Revision: D46057585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102042 Approved by: https://github.com/SS-JIA	2023-06-01 18:15:04 +00:00
Peter Bell	2f96981e5a	[inductor] Reduce duplication of reduction combine functions (#99661 ) Currently reduction bodies are duplicated in several different places. This reduces duplication by `combine_fn` definition used in `_unroll_reduction_fn` and using it in the triton codegen. For cpp this also makes better use of `reduction_combine{,_vec}` by using them to generate the `omp declare reduction` line and the `vec_reduce_all` call. For triton the only change is that that the combine step gets spread over two lines, e.g. instead of: ```python _tmp1 = tl.where(rmask & xmask, triton_helpers.maximum(_tmp1, tmp0), _tmp1) ``` we get ```python tmp2 = triton_helpers.maximum(_tmp1, tmp0) _tmp1 = tl.where(rmask & xmask, tmp2, _tmp1) ``` For cpp the only change is that inplace reduction operations are now written as an out-of-place operation and an assignment, e.g. instead if ```cpp omp_out += omp_in ``` we generate ```cpp omp_out = omp_out + omp_in ``` Which is a purely cosmetic change Pull Request resolved: https://github.com/pytorch/pytorch/pull/99661 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-06-01 18:02:17 +00:00
Jerry Zhang	d930bfc419	[quant][pt2e][be] Add QuantizationSpecBase (#102582 ) Summary: Make all quantization spec to inherit from the same base class in order to simplify the typing for QuantizationAnnotation Test Plan: ``` buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e' ``` Reviewed By: kimishpatel Differential Revision: D46173954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102582 Approved by: https://github.com/andrewor14	2023-06-01 17:55:22 +00:00
Shiyan Deng	685505353a	Back out "Add PyObject preservation for UntypedStorage (#97470 )" (#102553 ) Summary: Original commit changeset: c24708d18ccb Original Phabricator Diff: D46159983 Test Plan: SL tests and CI Differential Revision: D46284986 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102553 Approved by: https://github.com/DanilBaibak	2023-06-01 17:23:43 +00:00
Rodrigo Kumpera	32360b48e8	[C10D] Rewrite TCPStore client send path to minimize amount of syscalls. (#100742 ) Accumulate data in a local buffer prior to sending it. This reduces the number of syscalls and network packets. We flush every 1440 bytes to cap the amount of temporaty memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100742 Approved by: https://github.com/fduwjj	2023-06-01 16:58:46 +00:00
PyTorch MergeBot	9d77949b9e	Revert "add foreach support for custom device (#102047 )" This reverts commit b088ff467794bc1125133fb0428749d5bcd6ae3a. Reverted https://github.com/pytorch/pytorch/pull/102047 on behalf of https://github.com/malfet due to Broke inductor, see `b088ff4677` ([comment](https://github.com/pytorch/pytorch/pull/102047#issuecomment-1572368942))	2023-06-01 16:33:03 +00:00
Richard Zou	74f10b9ea5	Switch most Python RAII guard usages to context manager (#102642 ) There are some I can't easily switch due to reasons like: - Dynamo modelling the guard - BC concerns (for torch.autograd.set_multithreading_enabled) Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/102642 Approved by: https://github.com/albanD	2023-06-01 16:28:37 +00:00
Edward Z. Yang	dcf0c5fb6e	Use safe_is_leaf to test leafness (#102706 ) This fixes one of the problems in https://github.com/pytorch/pytorch/issues/101160#issuecomment-1570376548 but I don't have a test case because the full example is fairly difficult to minify. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102706 Approved by: https://github.com/bdhirsh	2023-06-01 16:02:12 +00:00
shibo19	d9c8f9a00d	add storage dtype for custom device (#102481 ) Fixes #ISSUE_NUMBER 1、add `isinstance` check with dtyped storage for custom device 2、add `storage.type()` support for custom device Pull Request resolved: https://github.com/pytorch/pytorch/pull/102481 Approved by: https://github.com/albanD	2023-06-01 12:46:19 +00:00
blzheng	e59db08699	inductor: eliminate meaningless copy (#102089 ) This pr aims to eliminate meaningless load/store pairs in generate code. HF models on CPU are expected to gain 2~4% E2E training performance improvement. Taking the following case as an example, the generated kernel named cpp_fused_permute_1 does nothing but load and store in_out_ptr0. Example code: ``` @torch._dynamo.optimize("inductor") def fn(permute_6, view_10): permute_5 = torch.ops.aten.permute.default(view_10, [0, 2, 1, 3]) clone_2 = torch.ops.aten.clone.default(permute_5, memory_format = torch.contiguous_format) view_11 = torch.ops.aten.view.default(clone_2, [1024, -1, 32]) bmm = torch.ops.aten.bmm.default(view_11, permute_6) permute_339 = torch.ops.aten.permute.default(view_11, [0, 2, 1]) return (bmm, permute_339) permute_6 = rand_strided((1024, 32, 128), (4096, 1, 32), device='cpu', dtype=torch.float32) view_10 = rand_strided((64, 128, 16, 32), (65536, 512, 32, 1), device='cpu', dtype=torch.float32) out = fn(permute_6, view_10) ``` Output code (Before this pr): ``` aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() cpp_fused_clone_0 = async_compile.cpp(''' #include "/tmp/torchinductor_bzheng/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { #pragma omp parallel num_threads(80) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(16L); i1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i2=static_cast<long>(0L); i2<static_cast<long>(128L); i2+=static_cast<long>(1L)) { for(long i3=static_cast<long>(0L); i3<static_cast<long>(32L); i3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i3 + (32Li1) + (512Li2) + (65536Li0))); tmp0.store(out_ptr0 + static_cast<long>(i3 + (32Li2) + (4096Li1) + (65536Li0))); } } } } } } } ''') cpp_fused_permute_1 = async_compile.cpp(''' #include "/tmp/torchinductor_bzheng/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h" extern "C" void kernel(float* in_out_ptr0) { #pragma omp parallel num_threads(80) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(4194304L); i0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + static_cast<long>(i0)); tmp0.store(in_out_ptr0 + static_cast<long>(i0)); } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1 = args args.clear() buf0 = empty_strided((64, 16, 128, 32), (65536, 4096, 32, 1), device='cpu', dtype=torch.float32) cpp_fused_clone_0(c_void_p(arg1_1.data_ptr()), c_void_p(buf0.data_ptr())) del arg1_1 buf1 = empty_strided((1024, 128, 128), (16384, 128, 1), device='cpu', dtype=torch.float32) extern_kernels.bmm(as_strided(buf0, (1024, 128, 32), (4096, 32, 1)), arg0_1, out=buf1) del arg0_1 buf2 = as_strided(buf0, (1024, 32, 128), (4096, 1, 32)); del buf0 # reuse cpp_fused_permute_1(c_void_p(buf2.data_ptr())) return (buf1, buf2, ) ``` Output code (After this pr): ``` aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() cpp_fused_clone_0 = async_compile.cpp(''' #include "/tmp/torchinductor_bzheng/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { #pragma omp parallel num_threads(80) { { #pragma omp for for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(16L); i1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i2=static_cast<long>(0L); i2<static_cast<long>(128L); i2+=static_cast<long>(1L)) { for(long i3=static_cast<long>(0L); i3<static_cast<long>(32L); i3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i3 + (32Li1) + (512Li2) + (65536Li0))); tmp0.store(out_ptr0 + static_cast<long>(i3 + (32Li2) + (4096Li1) + (65536Li0))); } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1 = args args.clear() buf0 = empty_strided((64, 16, 128, 32), (65536, 4096, 32, 1), device='cpu', dtype=torch.float32) cpp_fused_clone_0(c_void_p(arg1_1.data_ptr()), c_void_p(buf0.data_ptr())) del arg1_1 buf1 = empty_strided((1024, 128, 128), (16384, 128, 1), device='cpu', dtype=torch.float32) extern_kernels.bmm(as_strided(buf0, (1024, 128, 32), (4096, 32, 1)), arg0_1, out=buf1) del arg0_1 return (buf1, as_strided(buf0, (1024, 32, 128), (4096, 1, 32)), ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102089 Approved by: https://github.com/jgong5, https://github.com/ngimel	2023-06-01 11:03:32 +00:00
Xia, Weiwen	ce9923a1cb	[Quant][PT2E][Inductor] Lower quantized conv to Inductor (#101164 ) Summary Enable the lowering path for reference quantized conv after PT2E to Inductor. The pattern `decomposed dequantize -> aten.convolution -> decomposed quantize` is fused to `quantized.functional.conv1d/2d/3d` and Inductor makes external calls to these ops. This PR focuses on functionality only. The implementation is expected to have low performance. Code example: ```Python class M(torch.nn.Module): def __init__(self): super().__init__() self.conv = nn.Conv2d(3, 6, 2, stride=2, padding=0, dilation=1) def forward(self, x): return nn.functional.gelu(self.conv(x)) m = M().eval() example_inputs = (torch.randn(2, 3, 6, 6),) exported_model, guards = torchdynamo.export( m, copy.deepcopy(example_inputs), aten_graph=True, tracing_mode="real", ) qconfig = get_default_qconfig("x86") qconfig_mapping = QConfigMapping().set_global(qconfig) backend_config_inductor = get_x86_inductor_pt2e_backend_config() prepared_model = prepare_pt2e( exported_model, qconfig_mapping, example_inputs, backend_config_inductor ) prepared_model(example_inputs) converted_model = convert_pt2e(prepared_model) run = compile_fx(converted_model, example_inputs) ``` Output code by Inductor ```python from ctypes import c_void_p, c_long import torch import math import random import os import tempfile from torch._inductor.hooks import run_intermediate_hooks from torch._inductor.utils import maybe_profile from torch import empty_strided, as_strided, device from torch._inductor.codecache import AsyncCompile from torch._inductor.select_algorithm import extern_kernels aten = torch.ops.aten assert_size_stride = torch._C._dynamo.guards.assert_size_stride async_compile = AsyncCompile() kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_weiwen/5d/c5dsrjrcd4jlzryilhxl5hdvcrzsoek52xzzqqy57hcoezvxxxwm.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const long* in_ptr2, unsigned char* out_ptr0) { { #pragma GCC ivdep for(long i0=static_cast<long>(0L); i0<static_cast<long>(2L); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(3L); i1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i2=static_cast<long>(0L); i2<static_cast<long>(36L); i2+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i2 + (36Li1) + (108Li0))]; auto tmp1 = in_ptr1[static_cast<long>(0L)]; auto tmp7 = in_ptr2[static_cast<long>(0L)]; auto tmp2 = 1 / tmp1; auto tmp3 = static_cast<float>(1.0); auto tmp4 = decltype(tmp2)(tmp2 * tmp3); auto tmp5 = decltype(tmp0)(tmp0 * tmp4); auto tmp6 = std::nearbyint(tmp5); auto tmp8 = static_cast<float>(tmp7); auto tmp9 = tmp6 + tmp8; auto tmp10 = static_cast<float>(0); auto tmp11 = max_propagate_nan(tmp9, tmp10); auto tmp12 = static_cast<float>(127); auto tmp13 = min_propagate_nan(tmp11, tmp12); auto tmp14 = static_cast<unsigned char>(tmp13); out_ptr0[static_cast<long>(i1 + (3Li2) + (108Li0))] = tmp14; } } } } } ''') kernel_cpp_1 = async_compile.cpp(''' #include "/tmp/torchinductor_weiwen/5d/c5dsrjrcd4jlzryilhxl5hdvcrzsoek52xzzqqy57hcoezvxxxwm.h" extern "C" void kernel(const unsigned char* in_ptr0, const long* in_ptr1, const float* in_ptr2, float* out_ptr0) { { #pragma GCC ivdep for(long i0=static_cast<long>(0L); i0<static_cast<long>(2L); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(6L); i1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i2=static_cast<long>(0L); i2<static_cast<long>(9L); i2+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (6Li2) + (54Li0))]; auto tmp2 = in_ptr1[static_cast<long>(0L)]; auto tmp5 = in_ptr2[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(tmp0); auto tmp3 = static_cast<float>(tmp2); auto tmp4 = tmp1 - tmp3; auto tmp6 = decltype(tmp4)(tmp4 * tmp5); auto tmp7 = static_cast<float>(0.5); auto tmp8 = decltype(tmp6)(tmp6 * tmp7); auto tmp9 = static_cast<float>(0.7071067811865476); auto tmp10 = decltype(tmp6)(tmp6 * tmp9); auto tmp11 = std::erf(tmp10); auto tmp12 = static_cast<float>(1); auto tmp13 = tmp11 + tmp12; auto tmp14 = decltype(tmp8)(tmp8 * tmp13); out_ptr0[static_cast<long>(i2 + (9Li1) + (54Li0))] = tmp14; } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1 = args args.clear() buf0 = torch.ops.quantized_decomposed.quantize_per_channel.default(arg0_1, arg4_1, arg5_1, 0, -128, 127, torch.int8) del arg0_1 buf1 = buf0 assert_size_stride(buf1, (6, 3, 2, 2), (12, 4, 2, 1)) del buf0 buf2 = empty_strided((2, 3, 6, 6), (108, 1, 18, 3), device='cpu', dtype=torch.uint8) kernel_cpp_0(c_void_p(arg8_1.data_ptr()), c_void_p(arg2_1.data_ptr()), c_void_p(arg3_1.data_ptr()), c_void_p(buf2.data_ptr())) del arg8_1 buf2 = torch._make_per_tensor_quantized_tensor(buf2, arg2_1, arg3_1) buf1 = torch._make_per_channel_quantized_tensor(buf1, arg4_1, arg5_1, 0) buf3 = torch.ao.nn.quantized.functional.conv2d(buf2, buf1, arg1_1, (2, 2), (0, 0), (1, 1), 1, 'zeros', arg6_1, arg7_1, torch.uint8) assert_size_stride(buf3, (2, 6, 3, 3), (54, 1, 18, 6)) del arg1_1 del arg2_1 del arg3_1 del arg4_1 del arg5_1 del buf1 del buf2 buf4 = empty_strided((2, 6, 3, 3), (54, 9, 3, 1), device='cpu', dtype=torch.float32) kernel_cpp_1(c_void_p(buf3.data_ptr()), c_void_p(arg7_1.data_ptr()), c_void_p(arg6_1.data_ptr()), c_void_p(buf4.data_ptr())) del arg6_1 del arg7_1 return (buf4, ) def benchmark_compiled_module(times=10, repeat=10): from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided((6, 3, 2, 2), (12, 4, 2, 1), device='cpu', dtype=torch.float32) arg1_1 = rand_strided((6, ), (1, ), device='cpu', dtype=torch.float32) arg2_1 = rand_strided((), (), device='cpu', dtype=torch.float32) arg3_1 = rand_strided((), (), device='cpu', dtype=torch.int64) arg4_1 = rand_strided((6, ), (1, ), device='cpu', dtype=torch.float32) arg5_1 = rand_strided((6, ), (1, ), device='cpu', dtype=torch.int64) arg6_1 = rand_strided((), (), device='cpu', dtype=torch.float32) arg7_1 = rand_strided((), (), device='cpu', dtype=torch.int64) arg8_1 = rand_strided((2, 3, 6, 6), (108, 36, 6, 1), device='cpu', dtype=torch.float32) return print_performance(lambda: call([arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1]), times=times, repeat=repeat) if __name__ == "__main__": from torch._inductor.utils import compiled_module_main compiled_module_main('None', benchmark_compiled_module) ``` Test plan python test/test_quantization.py TestQuantizePT2EFXX86Inductor.test_inductor_qconv_lowering Pull Request resolved: https://github.com/pytorch/pytorch/pull/101164 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-01 10:22:02 +00:00
shibo19	b088ff4677	add foreach support for custom device (#102047 ) Fixes #ISSUE_NUMBER for custom device, we want to support foreach, so I add a func that we could set other device type, and the default value is cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102047 Approved by: https://github.com/janeyx99	2023-06-01 06:22:44 +00:00
Yanbo Liang	9fa82c90f7	[Dynamo] Correct UserDefinedObjectVariable.var_getattr on function/method type (#102580 ) Fixes #102329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102580 Approved by: https://github.com/jansel	2023-06-01 05:04:13 +00:00
fduwjj	92923aca61	[TP] Use Stride inferred from local tensor in to_local bwd (#102630 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102630 Approved by: https://github.com/wanchaol	2023-06-01 04:30:24 +00:00
Angela Yi	7a569f86a0	[export] Cleanup constraints (#102666 ) Redo of https://github.com/pytorch/pytorch/pull/102432 because idk how to push to that other branch... Pull Request resolved: https://github.com/pytorch/pytorch/pull/102666 Approved by: https://github.com/zhxchen17	2023-06-01 04:22:31 +00:00
xndcn	bebb8b7c1e	[inductor] use native fetch_add function for trivial types (#101931 ) floating-point is supported by std::atomic::fetch_add since C++20. However, this code path is not activated yet because cpp_flags in codecache.py is hard-coded to "-std=c++17" Pull Request resolved: https://github.com/pytorch/pytorch/pull/101931 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-06-01 03:47:56 +00:00
Ke Wen	a548fab8a8	Add size info to collective logs (#100413 ) Previous timeout log does not print size info. Making it hard to debug hang caused by message size mismatch. (Reason is that when copying `WorkNCCL` object during work enqueue, we don't copy `outputs_` due to reference concern, hence `output.size()` is never triggered.) This PR logs sizes using separate fields, hence not relying on `outputs_`. New timeout log: ``` [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=209715200, NumelOut=1677721600, Timeout(ms)=10000) ran for 10957 milliseconds before timing out. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100413 Approved by: https://github.com/kumpera	2023-06-01 03:39:30 +00:00
Wanchao Liang	c5d4ee2d73	[dtensor][simple] fix some comments (#102661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102661 Approved by: https://github.com/fduwjj, https://github.com/XilunWu	2023-06-01 03:23:19 +00:00
XiaobingSuper	49cd184f89	inductor: improve the index range check for index_expr vec check (#102263 ) Fix https://github.com/pytorch/pytorch/issues/102065. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102263 Approved by: https://github.com/lezcano, https://github.com/peterbell10, https://github.com/jgong5	2023-06-01 03:07:14 +00:00
Nikita Shulga	49d0d1d79f	Update XLA pin (#102446 ) Updating the pin to the same hash as https://github.com/pytorch/pytorch/pull/100922 On the XLA side, build have switch from CMake to bazel, which requires number of changes on PyTorch side: - Copy installed headers back to the `torch/` folder before starting the build - Install `torch/csrc/lazy/python/python_utils.h` - Define `LD_LIBRARY_PATH` TODO: - Enable bazel caching - Pass CXX11_ABI flag to `//test/cpp:all` to reuse build artifacts from `//:_XLAC.so` <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at cd4768b</samp> > _To fix the XLA tests that were failing_ > _We updated the submodule and scaling_ > _We added `python_util.h`_ > _And copied `torch` as well_ > _And set `LD_LIBRARY_PATH` for linking_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/102446 Approved by: https://github.com/huydhn	2023-06-01 02:04:07 +00:00
Zachary DeVito	b9294c7ca2	Allow more inserts before reIndexTopology (#102312 ) Summary: Currently if you are inserting into JIT IR at the same point in the middle of the graph, it only allows for 40 inserts before it has to reindex. Reindexing is N**2 behavior, which can lead to slow load times. This changes it so that it keeps track of how many insertions happen at single point (like when a function is being inlined) to predict how many future insertions will happen there. It then adjusts how it assigns topology to make sure there is enough room for those predicted insertions. In practice this will allow around 2M inserts at a single point before it reindexes. Test Plan: test_jit.py Differential Revision: [D46206617](https://our.internmc.facebook.com/intern/diff/D46206617) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102312 Approved by: https://github.com/eellison	2023-06-01 01:17:55 +00:00
Justin Yip	6b8e68ce7e	[pytorch-vulkan] aten::uniform (#102431 ) Summary: aten::uniform implementation. the randomization function didn't use Perlin, as the outcome distribution is not uniform. choose to use PCG (https://www.reedbeta.com/blog/hash-functions-for-gpu-rendering/) instead. Test Plan: ``` yipjustin@yipjustin-mac fbsource % buck run -c pt.vulkan_full_precision=1 --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -- --gtest_filter="uniform" Downloaded 0/47 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules) Building: finished in 40.0 sec (100%) 524/524 jobs, 10/524 updated Total time: 40.0 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = uniform [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from VulkanAPITest [ RUN ] VulkanAPITest.uniform [ OK ] VulkanAPITest.uniform (54 ms) [----------] 1 test from VulkanAPITest (54 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (54 ms total) [ PASSED ] 1 test. ``` Differential Revision: D46170098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102431 Approved by: https://github.com/SS-JIA	2023-06-01 01:10:50 +00:00
PandaNinjas	20ca994a3e	Use size in python list (#102538 ) Resubmission of #101922 Description copied verbatim Potentially fixes the second issue described in https://github.com/pytorch/pytorch/issues/87159. In python_list.h, int64_t is used when diff_type is better suited. On 32 bit systems, int64_t isn't a proper signed size type, which may cause the compilation error described in https://github.com/pytorch/pytorch/issues/87159. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102538 Approved by: https://github.com/albanD	2023-06-01 00:46:29 +00:00
chunyuan	0d2e7a1888	support ConvBinaryInplace in Inductor cpp wrapper (#101394 ) This PR has changed the OP schema since `at::Tensor&` should be the FirstArg: `87f9160b67/aten/src/ATen/core/boxing/impl/boxing.h (L305-L341)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101394 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire	2023-06-01 00:22:29 +00:00
Nikita Shulga	cdfba6fca7	Add ngimel to Core Reviewers (#102668 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102668 Approved by: https://github.com/ezyang	2023-06-01 00:21:10 +00:00
Zain Rizvi	c84f246c83	Improve time savings calculation math for test reordering (#102411 ) Use a more accurate method that accounts for tests being run in parallel Right now we still log results to the console, but later it'll get logged to Rockset for better tracking Pull Request resolved: https://github.com/pytorch/pytorch/pull/102411 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-05-31 23:51:27 +00:00
PaliC	693114c0a2	Adds script to generate alerts for failing jobs (#102002 ) Copies over bits of the script from test-infra to grab the relevant parts an alert and turns them into a json. Generally copied over from check_alerts in pytorch/test-infra <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 1789c36</samp> > _`Python 3` shebang_ > _added for compatibility_ > _a good practice / spring_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/102002 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2023-05-31 23:20:31 +00:00
Will Constable	398a5f4d4a	Clean up mypy (#102555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102555 Approved by: https://github.com/Skylion007	2023-05-31 23:16:49 +00:00
shaoyf42	8d7e082300	[c10d] Add is_backend_available for c10d backend. (#101945 ) Add is_backend_available for c10d backend, either the built-in backends or third-party backends through function ``Backend.register_backend``. There is a related discussion in https://github.com/pytorch/pytorch/pull/101775#discussion_r1199253553 > For example in python constructor for their backend they should explicitly add the is_X_available. Or if defining in C++ they should modify pybind like this https://github.com/H-Huang/torch_collective_extension/blob/main/custom_backend/include/dummy.hpp#L98-L101 to also add their own is_available property It is a natural choice for users to add their own `is_available` when they create a backend. We think it might be a possible way for the user to use `is_X_available` in the same way as the native, for example by dynamically adding`torch.distributed.is_dummpy_available()` function. This is why we want to dynamically add the `is_X_available` to `torch.distributed` in `register_backend`. > Or we could add an Is_available(backend) function, that checks for the backend. Providing a public function is indeed another good approach. We have implemented an `is_backend_available` in https://github.com/pytorch/pytorch/pull/101945 that supports both built-in backends and third-party backends. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101945 Approved by: https://github.com/H-Huang	2023-05-31 22:51:51 +00:00
Edward Z. Yang	e03800a93a	Add torch._utils.render_call, improve printoptions (#102623 ) - Add get_printoptions and printoptions context manager - Improve edgeitems handling when it is zero - Add render_call which can be used to conveniently print command line arguments of a function call, while suppressing actual tensor data Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102623 Approved by: https://github.com/albanD	2023-05-31 22:08:04 +00:00
Huy Do	cba4004983	Run libtorch in 2 shards (manual sharding) (#102554 ) This is a quick way to mitigate libtorch timing out issue on 2nd shard when running with memory leak check, for example https://github.com/pytorch/pytorch/actions/runs/5119293905/jobs/9204880456 ### Testing * Slow gradcheck https://github.com/pytorch/pytorch/actions/runs/5128253177 * `slow / linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 1, 4, linux.4xlarge.nvidia.gpu)`: `3h40` → `3h20`? * `slow / linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 2, 4, linux.4xlarge.nvidia.gpu)`: `4h30` → `3h50` * `linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 1, 4, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `3h35` → `3h20` * `linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 2, 4, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `4h20` → `4h` * Linux GPU https://github.com/pytorch/pytorch/actions/runs/5128252752 * `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 1, 5, linux.4xlarge.nvidia.gpu)`: `1h40` → `1h40` * `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 2, 5, linux.4xlarge.nvidia.gpu)`: `2h10` → `1h35` * `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 1, 5, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `2h30` → `2h50` * `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 2, 5, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `3h20` → `2h50` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102554 Approved by: https://github.com/clee2000	2023-05-31 22:03:17 +00:00
Tugsbayasgalan Manlaibaatar	d9f75dded1	[export] Add aot_export 1/N (#101490 ) This PR adds aot_export_module as the lowering path from torch.level graph to aten graph. Some known limitations that need to be addressed in the follow up PRs: 1. Store param/buffer data in ExportedProgram 2. Fully support torch.cond with params/buffers 3. Making sure no duplicated ExportMetaData entry 4. This API will break Executorch if used on PyE, we will figure out a plan internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101490 Approved by: https://github.com/avikchaudhuri	2023-05-31 20:56:21 +00:00
Huy Do	04c1c2b791	Try to build the Docker image if it doesn't exist (#102562 ) There is a bug in the test workflow where it could fail to find the new Docker image when the image hasn't yet became available on ECR, for example `e71ab21422`. This basically is a race condition where the test job starts before the docker-build workflow could finish successfully. The fix here is to make sure that the test job has the opportunity to build the image if it doesn't exist, same as what the build workflow does atm. Once the docker-build workflow finishes pushing the new image to ECR, that can then be used instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102562 Approved by: https://github.com/PaliC	2023-05-31 20:50:27 +00:00
Pritam Damania	9a2df0a5af	[RFC] Add method to DDP to check for backward finalization. (#100773 ) Summary: In cases where DDP backward is not finalized, the error is raised only in the next forward iteration of DDP. However, if there are other collective calls between those two points, training scripts could potentially get stuck. As a result, there should be a way to check if DDP finalized after calling `.backward()`. To address this, I've added a `_check_reducer_finalized` method to validate that DDP indeed did successfully finish reduction. Test Plan: Added unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100773 Approved by: https://github.com/rohan-varma	2023-05-31 20:43:06 +00:00
Richard Zou	fc31b3a106	Allow existing "Python RAII guards" to be used as context managers (#102579 ) This PR adds a `py_context_manager_DEPRECATED` that converts a C++ RAII guard to an object that may be either used as Python context manager or as a "Python RAII guard". We don't convert all of them to Python context manager only due to BC reasons; people in OSS and internally actually rely on these APIs and I don't want to break them. We are justified in breaking BC if we wanted to, but it seemed like too much work for not a lot of gain. The API is postfixed with "DEPRECATED" to indicate that people should really use `py_context_manager` (converts C++ RAII guard to Python context manager) instead. Test Plan: - this PR converts all PyTorch usages of _AutoDispatchBelowAutograd to context manager. I can do the rest in follow-ups. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102579 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2023-05-31 19:55:38 +00:00
Animesh Jain	65631d4515	[benchmarks] Use train mode for accuracy checks for HF models (#102578 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102578 Approved by: https://github.com/desertfire	2023-05-31 19:47:18 +00:00
shibo19	213e10dc3d	fix bug in trace model when out-operator has more than one output (#101563 ) Fixes #https://github.com/pytorch/pytorch/issues/101960 when I trace a func to run out-operator has more than one output, I got the error. This is because the situation when the output of the out operator is greater than 1 is not handled. ``` def test_trace_out_operator_with_two_output(): example_input = torch.rand(2, 8) out_1, out_2 = torch.cummax(example_input, 1) def run_cummax(example_input, out_1, out_2): output_1, output_2 = torch.cummax(example_input, 1, out=(out_1, out_2)) return output_1, output_2 trace_model = torch.jit.trace(run_cummax, (example_input, out_1, out_2)) and the error info: raise TracingCheckError( torch.jit._trace.TracingCheckError: Tracing failed sanity checks! encountered an exception while running the trace with test inputs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101563 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/davidberard98	2023-05-31 19:39:52 +00:00
lantiankaikai	17166c2511	python_arg_parser to allow fake tensor element in symint_list when in dynamo mode #95424 (#97508 ) Failing mechanism on #95424 : In dynamo mode, when passing numpy.int_ to 'shape' like param (Sequence[Union[int, symint]]) is wrapped as list with FakeTensor. However, in python_arg_parser, parser expect int in symint_list but got FakeTensor. Following #85759, this PR allow tensor element in symint_list when in dynamo mode This PR also fix below test with similar failing mechanism pytest ./generated/test_huggingface_diffusers.py -k test_016 pytest ./generated/test_ustcml_RecStudio.py -k test_036 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97508 Approved by: https://github.com/yanboliang	2023-05-31 19:19:17 +00:00
cyy	3ae42cb7db	adjust header inclusions in C10 as sugguested by IWYU (#102467 ) This PR aims to reduce unused header inclusions in C10. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102467 Approved by: https://github.com/albanD	2023-05-31 19:19:10 +00:00
Rohan Varma	0ecca122e7	[Replicate] Add unit test with replicate param names (#102401 ) This attribute wasn't actually used in tests, add a test ensuring that if replicate is used on top of FSDP, the replicated parameter names are as expected. TODO: there are a few ways to check if module is managed by composable API, such as replicated param names for replicate, _get_module_state API, _get_registry_api, etc. We should unify all composable APIs to check in a unified way (filed an issue) Differential Revision: [D46236377](https://our.internmc.facebook.com/intern/diff/D46236377/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102401 Approved by: https://github.com/awgu	2023-05-31 18:41:03 +00:00
Huy Do	9331b7fa05	Run slow gradcheck on the newer G5 runner (#102496 ) As slow gradcheck is slow (Thank you, Captain Obvious!), let's run it on the newer G5 runner to improve its TTS and avoid flaky timing out error such as https://github.com/pytorch/pytorch/actions/runs/5112059782/jobs/9190167924. AFAIK, there is no reason to keep running slow gradcheck on `linux.4xlarge.nvidia.gpu` ### Testing * `1st` shard: `3h30m` → `4h`, The increase is probably due to https://github.com/pytorch/pytorch/pull/102380 in which the job's name switch from `gcc7` to `gcc9`. Does this invalidate the test time used to balance these shards? * `2nd` shard: `4h35m` → `4h15m` * `3rd` shard: `3h20m` → `1h20m` * `4th` shard: `3h20m` → `2h10m` * `14h45m` → `11h45m`, a total saving of `3h` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102496 Approved by: https://github.com/malfet	2023-05-31 17:28:03 +00:00
Vishwa Raj Singh	c27cefccd3	Faketensor hpu device normalization (#102512 ) FakeTensor doesn't normalize device_idx and failed with below testcase. import torch import habana_frameworks.torch.hpu from torch._subclasses.fake_tensor import FakeTensorMode with FakeTensorMode.push(): a = torch.empty(1, device="hpu") b = torch.empty(1, device="hpu:0") result = a + b Pull Request resolved: https://github.com/pytorch/pytorch/pull/102512 Approved by: https://github.com/albanD	2023-05-31 17:06:44 +00:00
Andres Lugo-Reyes	eaffd98880	Enable hipSOLVER in ROCm builds (#97370 ) Enables the hipSolver backend for ROCm builds -------------------------------------------------------------------------- - Minimum ROCm version requirement - 5.3 - Introduces new macro USE_LINALG_SOLVER the controls enablement of both cuSOLVER and hipSOLVER - Adds hipSOLVER API to hipification process - combines hipSOLVER and hipSPARSE mappings into single SPECIAL map that takes priority among normal mappings - Torch api to be moved to hipsolver backend (as opposed to magma) include: torch.svd(), torch.geqrf(), torch.orgqr(), torch.ormqr() - Will enable 100+ linalg unit tests for ROCm Pull Request resolved: https://github.com/pytorch/pytorch/pull/97370 Approved by: https://github.com/malfet	2023-05-31 16:53:23 +00:00
Aleksei Nikiforov	46a925795e	S390x clang fixes for SIMD (#100874 ) S390x clang fixes for SIMD Pull Request resolved: https://github.com/pytorch/pytorch/pull/100874 Approved by: https://github.com/jgong5	2023-05-31 16:38:19 +00:00
cyy	850b37cc3b	merge identical branches in cpu_index_kernel (#102601 ) A simple simplification when reviewing code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102601 Approved by: https://github.com/jgong5	2023-05-31 16:22:49 +00:00
Yanli Zhao	f47ee87765	Fix ignored_states when they are passed as generators (#102575 ) This PR fixed the case where ignored_states are passed as generators, not List/Set Pull Request resolved: https://github.com/pytorch/pytorch/pull/102575 Approved by: https://github.com/awgu	2023-05-31 15:58:55 +00:00
Pearu Peterson	9f97b7c43b	Add integer overflow checks for large compressed tensor dimensions and nnz (#102530 ) With the previous PR allowing large compressed tensors (dimensions larger than `2 31 - 1`), sparse compressed tensor invariants checks may give false-positive results: ```python >>> nnz=231 >>> torch.sparse.check_sparse_tensor_invariants.enable() >>> torch.sparse_csr_tensor(torch.arange(nnz+1, dtype=torch.int32), torch.zeros(nnz, dtype=torch.int32), torch.ones(nnz), (nnz, 1)) tensor(crow_indices=tensor([ 0, 1, 2, ..., 2147483646, 2147483647, -2147483648]), col_indices=tensor([0, 0, 0, ..., 0, 0, 0]), values=tensor([1., 1., 1., ..., 1., 1., 1.]), size=(2147483648, 1), nnz=2147483648, layout=torch.sparse_csr) ``` (notice that the last entry in `crow_indices` is invalid) or raise a bogus exception as in ```python >>> torch.sparse_csr_tensor(torch.arange(nnz+1, dtype=torch.int32), torch.arange(nnz, dtype=torch.int32), torch.ones(nnz), (nnz, 1)) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: `0 <= col_indices < ncols` is not satisfied. ``` (notice that `col_indices` is actually valid). This PR fixes the above-reported bugs by introducing integer overflow checks for sparse compressed tensors dimensions as well as nnz. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102530 Approved by: https://github.com/nikitaved	2023-05-31 15:34:08 +00:00
Peter Bell	9fd14fcd09	Improve repeat_interleave with scalar repeat value (#102570 ) `repeat_interleave_symint` is currently implemented by guarding on the `SymInt` and converting it to a tensor to pass to the Tensor overload. This instead implements it as a copy of an expanded tensor, which can be done without guards and is also much more efficient in eager mode to boot. For example, these are timings for `x.repeat_interleave(100, dim=-1)` with `x.shape == (1000, 100)` \| Device \| Time (Master) \| Time (This PR) \| Speedup \| \|--------\|---------------\|-----------------\|---------\| \| cpu \| 18.8 ms \| 3.5 ms \| 5.4 \| \| cuda \| 271 us \| 134 us \| 2.0 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/102570 Approved by: https://github.com/lezcano	2023-05-31 14:14:32 +00:00
Bin Bao	47b884a74c	[inductor] Revert a CI remedy for Triton compilation error (#102541 ) Summary: revert https://github.com/pytorch/pytorch/pull/91634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102541 Approved by: https://github.com/ngimel	2023-05-31 13:13:51 +00:00
Nikita Vedeneev	d80d3b18d0	nn.Linear with BSR inputs: spare the user from explicit Triton kernel registrations (#98403 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 08f7a6a</samp> This pull request adds support for triton kernels in `torch` and `torch/cuda`, and refactors and tests the existing triton kernel for BSR matrix multiplication. It also adds a test case to ensure that importing `torch` does not implicitly import `triton`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98403 Approved by: https://github.com/malfet, https://github.com/cpuhrsch	2023-05-31 13:09:45 +00:00
Larry Liu	019c38624c	[Executorch][codegen] Add ETKernelIndex for aggregating all kernels for kernel (#102565 ) keys and change codegen to take ETKernelIndex We are adding support for dtype and dim order specialized kernel registration. This requires us to reorganize `BackendIndex` (which is a `Dict[DispatchKey, Dict[OperatorName, BackendMetadata]]`) to be `Dict[OperatorName, Dict[ETKernelKey, BackendMetadata]]`. This PR adds new data structures in order to support this change: * `ETKernelKey` to retrieve a certain kernel from the registry. * `ETKernelIndex`, the dictionary from operator name to kernel key to kernel mapping. Note that the codegen logic is not changed yet, we need subsequent diffs to actually generate code for different kernel keys. Differential Revision: [D46206339](https://our.internmc.facebook.com/intern/diff/D46206339/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102565 Approved by: https://github.com/Jack-Khuu	2023-05-31 09:41:36 +00:00
cyy	7c2641d5f1	apply constexpr and if constexpr when possible (#102471 ) Now that we have full C++17 support, we can use if constexpr in some identified cases. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at df4c16d</samp> The pull request improves the performance, readability, and consistency of various function templates in the `ATen` and `torch` modules by using `constexpr` keywords and C++17 features. It also fixes some type conversion and overflow issues for different input and output types. The changes affect the code for distributions, BLAS, batch normalization, embedding bag, random number generation, vectorized operations, cuBLAS, XNNPACK, CUTLASS, and shape inference. The affected files include `DistributionsHelper.h`, `vec256_int.h`, `vec512_int.h`, `BlasKernel.cpp`, `IndexKernel.cpp`, `EmbeddingBag.cpp`, `Normalization.cpp`, `rng_test.h`, `vec_test_all_types.h`, `TransformationHelper.h`, `CUDABlas.cpp`, `DistributionKernels.cpp`, `DistributionTemplates.h`, `RangeFactories.cu`, `RangeFactories.cpp`, `qconv.cpp`, `StructuredSparseLinearCUTLASS.cu`, `vec_test_all_types.cpp`, and `shape_inference.cpp`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102471 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-05-31 06:17:07 +00:00
Catherine Lee	a5ddb72aec	Quick fix for keep-going + reruns (#102569 ) Currently file level reruns + stepcurrent are incompatible and it's making PRs green when they are actually red, so turn off stepcurrent + file level reruns when keep-going is used until I figure out a better way to do this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102569 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-05-31 04:46:25 +00:00
Jason Ansel	3c0251a100	[inductor] Fix issue with 0D reductions (#102568 ) Fixes https://github.com/pytorch/pytorch/issues/102546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102568 Approved by: https://github.com/ngimel	2023-05-31 04:38:53 +00:00
Animesh Jain	46691d4369	[inductor][pattern matcher] Retain meta tags (#102462 ) This will be used later on while propagating the `recompute` flag all the way to min-cut partitioner. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102462 Approved by: https://github.com/jansel	2023-05-31 03:59:13 +00:00
Will Constable	e7cc41772d	Add dynamo collections.deque support (#102412 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102412 Approved by: https://github.com/jansel, https://github.com/voznesenskym	2023-05-31 03:54:20 +00:00
Devashish Shankar	cdca25cdc7	Fix warning couldn't find split args (#102561 ) Summary: Fixes #102416 [WARNING] couldn't find split args In case `dim=` kwarg is absent is absent, we can default it to 0. Even after this, probably okay to make this an INFO rather than a WARNING Test Plan: run torchbench Differential Revision: D46292754 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102561 Approved by: https://github.com/jansel	2023-05-31 03:27:51 +00:00
Ilya Sherstyuk	b4a49124c8	[ONNX] Reduce exporter memory usage by removing intermediate values (#101148 ) This commit reduces the exporter memory usage by as much as 50%. During the shape inference step, the exporter caches the values of intermediate tensors in a `ConstantValueMap`. This can use as much memory as the model itself, or even more. For example, model weight tensors are often fed to a Transpose layer, and the output of that is the same size of the weights. This commit fixes the issue by removing the intermediate tensor values after they are used by all consumers. The cached values are only used for shape inference, so removing them after use should be safe. `ConstantValueMap` is cleared anyways once shape inference is complete for the entire graph. As an example, here is the model from issue #61263: ```python import torch import math # Size in GB tensor_size = 1 model_size = 8 layers_num = model_size // tensor_size kB = 1024 MB = kB * kB GB = MB * kB precision_size = 4 # bytes per float activation_size = math.floor(math.sqrt(tensor_size * GB / precision_size)) class Net(torch.nn.Module): def __init__(self): super(Net, self).__init__() for i in range(layers_num): name = "fc_%d" % i linear = torch.nn.Linear(activation_size, activation_size) setattr(self, name, linear) def forward(self, x): for i in range(layers_num): name = "fc_%d" % i linear = getattr(self, name) x = linear(x) return x model = Net().cuda() input = torch.zeros(activation_size, requires_grad=True).cuda() with torch.no_grad(): torch.onnx.export(model, (input, ), './model_large.onnx', do_constant_folding=False, opset_version=13) ``` It is just some large linear layers stacked together. Before this commit, my max GPU usage during export was about 16.7 GB, twice the model size. With this commit in combination with #101134, it was only about 9.5 GB. Together with #101134, fixes issue #61263 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101148 Approved by: https://github.com/BowenBao	2023-05-31 02:55:57 +00:00
David Berard	5324124eac	[profiler] Reintroduce forward-backward links (#102424 ) TL;DR: This re-introduces links between backward kernels and their corresponding forward kernels. <img width="1020" alt="Screenshot 2023-05-26 at 7 25 22 PM" src="https://github.com/pytorch/pytorch/assets/5067123/02571b59-859c-4c9e-b3ef-121ef3159812"> In the example above, you can see there are two such flows - one for aten::add, and one for aten::binary_cross_entropy ### Details Forward/backward links were added in https://github.com/pytorch/pytorch/pull/62553, but then disabled in https://github.com/pytorch/pytorch/pull/72904 due to segfaults (e.g. https://github.com/pytorch/pytorch/issues/69443). Between now and when the fwd-bwd links were disabled, there's been a lot of refactoring; so this PR updates the implementation: * Use a raw profiler::impl::Result instead of a KinetoEvent * Move the implementation to collection.cpp, where the TraceWrapper is currently handled. * Sort the events before processing, because they aren't always in chronological order * There can now be more than one event in the backward pass that matches the sequenceNr-threadID pair. The implementation needed to be updated to avoid showing multiple endpoints for a given sequenceNr-threadID pair ([ptr to where the bwd sequenceNr-threadID pair is duplicated](`6e3e3dd477/torch/csrc/profiler/collection.cpp (L398-L399)`)). Next, we need to verify that https://github.com/pytorch/pytorch/issues/69443 is fixed. Running the repro no longer errors. Looking further into the details of the issue it seems like the handling of the [raw linkedActivity pointer (old code from 2021)](`6089dcac48/libkineto/src/output_json.cpp (L283)`) resulted in the segfault. Now, it doesn't look like the linked activity is used anywhere in output_json.cpp so the issue should be fixed. ### Testing #### 1. unit test `test_profiler_fwd_bwd_link` was un-skipped. It was modified to match the new implementation. #### 2. https://github.com/pytorch/pytorch/issues/69443 I ran the repro in https://github.com/pytorch/pytorch/issues/69443 and verified there were no segfaults. #### 3. Duplicate flow IDs When forward-backward connections were first introduced, gpu-cpu async links had not been introduced. There's a possibility that gpu-cpu links and fwd-bwd links could interfere if their IDs overlap. I manually tested this in chrome://tracing; I edited a file so that a gpu-cpu link had the same ID as one of the fwd-bwd connections. The chrome tracing UI continued showing both types of links. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102424 Approved by: https://github.com/aaronenyeshi	2023-05-31 02:50:38 +00:00
ecao	73fd7235ad	add function specializations for the case of parameters in BFloat16 data type (#100233 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100233 Approved by: https://github.com/jgong5, https://github.com/ngimel	2023-05-31 02:01:07 +00:00
Kazuaki Ishizaki	9edf65a821	[build] fix compilation error on s390x (#101923 ) This PR fixes the following compilation error due to the unexpected conflicts among #99057 and #101000 ``` In file included from /home1/ishizaki/PyTorch/main-lastest/aten/src/ATen/cpu/vec/vec256/vec256.h:21, from /home1/ishizaki/PyTorch/main-lastest/aten/src/ATen/cpu/vec/vec.h:6, from /home1/ishizaki/PyTorch/main-lastest/aten/src/ATen/native/cpu/Loops.h:37, from /home1/ishizaki/PyTorch/main-lastest/aten/src/ATen/native/cpu/batch_norm_kernel.cpp:9, from /home1/ishizaki/PyTorch/main-lastest/build/aten/src/ATen/native/cpu/batch_norm_kernel.cpp.ZVECTOR.cpp:1: /home1/ishizaki/PyTorch/main-lastest/aten/src/ATen/cpu/vec/vec256/zarch/vec256_zarch.h:2332:17: error: ‘at::vec::ZVECTOR::Vectorized<T> at::vec::ZVECTOR::Vectorized<T, typename std::enable_if<is_zarch_implemented_complex<T>(), void>::type>::expm1() const’ cannot be overloaded with ‘at::vec::ZVECTOR::Vectorized<T> at::vec::ZVECTOR::Vectorized<T, typename std::enable_if<is_zarch_implemented_complex<T>(), void>::type>::expm1() const’ 2332 \| Vectorized<T> expm1() const { \| ^~~~~ /home1/ishizaki/PyTorch/main-lastest/aten/src/ATen/cpu/vec/vec256/zarch/vec256_zarch.h:2328:17: note: previous declaration ‘at::vec::ZVECTOR::Vectorized<T> at::vec::ZVECTOR::Vectorized<T, typename std::enable_if<is_zarch_implemented_complex<T>(), void>::type>::expm1() const’ 2328 \| Vectorized<T> expm1() const { \| ^~~~~ cc1plus: note: unrecognized command-line option ‘-Wno-aligned-allocation-unavailable’ may have been intended to silence earlier diagnostics cc1plus: note: unrecognized command-line option ‘-Wno-unused-private-field’ may have been intended to silence earlier diagnostics cc1plus: note: unrecognized command-line option ‘-Wno-invalid-partial-specialization’ may have been intended to silence earlier diagnostics ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101923 Approved by: https://github.com/malfet	2023-05-31 01:19:40 +00:00
Li-Huai (Allan) Lin	cce58a43c9	[MPS] Fix softplus with f16 input (#101948 ) Fixes #101946 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101948 Approved by: https://github.com/malfet	2023-05-31 00:40:10 +00:00
Animesh Jain	c3c1496143	[dynamo][higher order op] Bugfixes to pass graph.lint (#102448 ) This PR ensures that the subgraphs use the newly created placeholder for the primary inputs and free variables. Earlier, this was not happening, and graph.lint() was failing. I need `graph.lint()` in the followup PRs where I run an `Interpreter` on the subgraph to preserve the metadata information to AOT Autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102448 Approved by: https://github.com/zou3519	2023-05-31 00:29:29 +00:00
Ilya Sherstyuk	8d5b14e907	[ONNX] Don't duplicate model weights in ONNX export (#101134 ) This commit partially fixes an issue where the ONNX exporter always requires about 2x memory than the model size. The `ONNXTracedModule` class uses a copy of the original weights only when `return_inputs=True`, so this commit makes sure the weights are cloned only in that case. As a side note, I don't think the exporter is ever called with `return_inputs=True`, so maybe this is just some old code that can be removed. Partially fixes #61263. There are still other places in the exporter which use more memory than they need to. For example, during the shape inference step many intermediate tensors are computed and saved until shape inference on the model is complete. I am working on a fix for that, but that optimization is independent of this one and can be done in a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101134 Approved by: https://github.com/BowenBao, https://github.com/osalpekar	2023-05-30 23:47:04 +00:00
Animesh Jain	33a49eeae7	[benchmark] Flag to switch on activation checkpointing for HF models (#102557 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102557 Approved by: https://github.com/ngimel, https://github.com/Chillee	2023-05-30 23:46:14 +00:00
atalman	6ac8a11746	Switch cuda 12.1 docker images to gcc9 (#102380 ) Update CUDA-12.1 CI docker images to gcc-9, that should tentatively fix for internal compiler error in [libtorch-linux-bionic-cuda12.1-py3.7-gcc7 / build](https://github.com/pytorch/pytorch/actions/runs/5071681366/jobs/9135310361) Co-authored by: Nikita Shulga <nshulga@meta.com> Fixes: https://github.com/pytorch/pytorch/issues/102372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102380 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-05-30 23:03:55 +00:00
Yanbo Liang	9ff1932d2b	[Dynamo] Save global autocast state to restore on graph break (#102415 ) Fixes #102414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102415 Approved by: https://github.com/yf225	2023-05-30 23:03:21 +00:00
PyTorch MergeBot	1a6ab8a5dc	Revert "Quick fix for keep-going + reruns (#102569 )" This reverts commit 7f6edcf422d133b6fd747ec0775d1c840a91ee46. Reverted https://github.com/pytorch/pytorch/pull/102569 on behalf of https://github.com/clee2000 due to broke a ton of stuff ([comment](https://github.com/pytorch/pytorch/pull/102569#issuecomment-1569167673))	2023-05-30 22:04:27 +00:00
Kimish Patel	4f468646d9	[PT2][Quant][BE] refactor tets cose to reduce duplication and standardize (#102497 ) Summary: This refactor introduces an internal function which selectively tests againt fx quant as well. Notably this does increase test times so wo need to figure out how to resolve that. Test Plan: test_quantization_pt2e Reviewed By: jerryzh168 Differential Revision: D46154323 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102497 Approved by: https://github.com/jerryzh168	2023-05-30 21:37:59 +00:00
Catherine Lee	7f6edcf422	Quick fix for keep-going + reruns (#102569 ) Currently file level reruns + stepcurrent are incompatible and it's making PRs green when they are actually red, so turn off stepcurrent + file level reruns when keep-going is used until I figure out a better way to do this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102569 Approved by: https://github.com/huydhn	2023-05-30 21:29:56 +00:00
Jerry Zhang	f14ac74fce	[quant][pt2e] Add support for FixedQParamsQuantizationSpec (#102439 ) Summary: This PR adds support for FixedQParamsQuantizationSpec: ``` dataclass(eq=True, frozen=True) class FixedQParamsQuantizationSpec(QuantizationSpecBase): dtype: torch.dtype scale: float zero_point: int quant_min: Optional[int] = None quant_max: Optional[int] = None qscheme: Optional[torch.qscheme] = None ``` This is useful to define quantization spec for operators like sigmoid which has predefined and fixed scale/zero_point Test Plan: ``` buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e' buck2 test mode/opt caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_fixed_qparams_qspec (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)' ``` Reviewed By: kimishpatel Differential Revision: D46153082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102439 Approved by: https://github.com/kimishpatel	2023-05-30 21:28:13 +00:00
Animesh Jain	168ae806d0	[fx] Fix repr when arg is an OpOverload (#102547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102547 Approved by: https://github.com/Skylion007, https://github.com/jansel	2023-05-30 21:11:05 +00:00
Animesh Jain	68e55bff62	[minifier] add missing import (#102521 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102521 Approved by: https://github.com/jansel	2023-05-30 20:57:16 +00:00
PyTorch MergeBot	95cdd58c8f	Revert "[pt2] add `SymInt` support for `linalg.tensorsolve` (#102466 )" This reverts commit b1b76f614d5f1899f8f49518836a504ff05bf847. Reverted https://github.com/pytorch/pytorch/pull/102466 on behalf of https://github.com/clee2000 due to reverting b/c stack https://github.com/pytorch/pytorch/pull/102469#issuecomment-1569041604, i think this is the one that actually causes the test to fail ([comment](https://github.com/pytorch/pytorch/pull/102466#issuecomment-1569045123))	2023-05-30 20:26:46 +00:00
PyTorch MergeBot	463df86ce8	Revert "[pt2] add `SymInt` support for `linalg.vander` (#102469 )" This reverts commit 05717895aaab826bfd0e59567729e0d979e27897. Reverted https://github.com/pytorch/pytorch/pull/102469 on behalf of https://github.com/clee2000 due to broke test_aotdispatch on linux ex `05717895aa` https://github.com/pytorch/pytorch/actions/runs/5125654882/jobs/9219389448, shows up as green on pr due to bug with keep-going flag and reruns ([comment](https://github.com/pytorch/pytorch/pull/102469#issuecomment-1569041604))	2023-05-30 20:24:26 +00:00
Matthew Hoffman	c28f8e314d	Add type hints in torch/distributed/utils.py (#102262 ) Fixes #77190 Pretty similar to the typing in `torch/nn/parallel`, which was also improved recently: #102194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102262 Approved by: https://github.com/Skylion007, https://github.com/Neilblaze	2023-05-30 19:57:45 +00:00
Nikita Karetnikov	05717895aa	[pt2] add `SymInt` support for `linalg.vander` (#102469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102469 Approved by: https://github.com/Skylion007, https://github.com/lezcano	2023-05-30 19:50:16 +00:00
Nikita Karetnikov	b1b76f614d	[pt2] add `SymInt` support for `linalg.tensorsolve` (#102466 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102466 Approved by: https://github.com/Skylion007, https://github.com/lezcano	2023-05-30 19:50:15 +00:00
Nikita Karetnikov	0ba81ce8fe	[pt2] add `SymInt` support for `linalg.tensorinv` (#102465 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102465 Approved by: https://github.com/Skylion007, https://github.com/lezcano	2023-05-30 19:50:14 +00:00
Driss Guessous	7378b6b9e3	Add devcontainer support to PyTorch Project (#98252 ) # Summary <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 293ded1</samp> This pull request adds support for using Visual Studio Code Remote - Containers extension with the pytorch project. It adds a `.devcontainer` folder with a `devcontainer.json` file, a `Dockerfile`, and a `noop.txt` file that configure and create a dev container with Anaconda and Python 3. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at d6b9cd7</samp> > _`devcontainer.json`_ > _Configures PyTorch containers_ > _For CPU or GPU_ ## Related to: https://github.com/pytorch/pytorch/issues/92838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98252 Approved by: https://github.com/ZainRizvi	2023-05-30 19:44:18 +00:00
Michael Gschwind	4d89489df5	Move static checks of layers[0] (e.g., isinstance check) to model build time (#102045 ) Summary: Move static checks of layers[0] (e.g., isinstance check) to model build time because isinstance() does not work for torchscripted code. Because the validation is now performed while constructing the object, the isinstance() call is performed in eager mode at model build time, and we avoid needing to call isinstance() at runtime to determine whether the layers in a model are an instance of the TransformerEncoderLayer class, or its derived classes. Test Plan: sandcastle, github Differential Revision: D46096222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102045 Approved by: https://github.com/mikaylagawarecki	2023-05-30 19:42:01 +00:00
Wanchao Liang	ff58d19c89	DeviceMesh use dispatchable PG to support custom backend (#102336 ) This PR switches DeviceMesh to use dispatchable process group instead, this could enable easier backend integration as user only need to integrate with c10d process group custom backend, without needing to change DeviceMesh to plug in the backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/102336 Approved by: https://github.com/fduwjj	2023-05-30 19:22:37 +00:00
Wanchao Liang	3ef4d697df	[c10d] default backend need to check for nccl availability (#102470 ) As titled, we can only initialize nccl backend when NCCL is available Pull Request resolved: https://github.com/pytorch/pytorch/pull/102470 Approved by: https://github.com/Skylion007, https://github.com/XilunWu	2023-05-30 19:22:37 +00:00
ALi	b02f48b181	implement __dir__ for dynamo (#102480 ) Fixes #94478 modules' attributes are not included in when `__dir__` is called on the optimized module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102480 Approved by: https://github.com/msaroufim	2023-05-30 18:46:10 +00:00
thomasw21	704283d61f	Improve `clip_grad_norm` to use torch.linalg.vector_norm (#102429 ) Done in this PR: - Use `torch.linalg.vector_norm` instead of `torch.norm` - Reduce bandwidth boundary of clip_grad_norm when used with `inf`, ie no need to get the returned tensor after `abs` What I'm slightly unsure: - I don't know if `inf` support `torch._foreach` API Pull Request resolved: https://github.com/pytorch/pytorch/pull/102429 Approved by: https://github.com/lezcano	2023-05-30 18:35:18 +00:00
Horace He	e71ab21422	update triton pin (#101919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101919 Approved by: https://github.com/ngimel	2023-05-30 17:16:05 +00:00
Aleksei Nikiforov	5fa273c870	ASAN: fix heap-buffer-overflow (#101970 ) Pass size argument. <details> <summary>ASAN report</summary> ``` ==1640574==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x609000022160 at pc 0x03ff31a04b42 bp 0x03ff69885dc0 sp 0x03ff69885db0 READ of size 16 at 0x609000022160 thread T1 #0 0x3ff31a04b41 in at::vec::ZVECTOR::Vectorized<unsigned char, void>::loadu(void const, int) /home/user/pytorch/aten/src/ATen/cpu/vec/vec256/zarch/vec256_zarch.h:397 #1 0x3ff31a04b41 in at::vec::ZVECTOR::Vectorized<c10::quint8, void>::loadu(void const, int) /home/user/pytorch/aten/src/ATen/cpu/vec/vec256/zarch/vec256_zarch.h:1574 #2 0x3ff31a04b41 in operator() /home/user/pytorch/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp:2668 #3 0x3ff31cefa5d in void at::internal::invoke_parallel<at::native::(anonymous namespace)::quantized_normalize_kernel(at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, int, int, long, long , double, at::Tensor)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda(long, long)#1}>(long, long, long, at::native::(anonymous namespace)::quantized_normalize_kernel(at::Tens or const&, at::Tensor const&, at::Tensor const&, bool, int, int, long, long, double, at::Tensor)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda(long, long)#1} const&) [clone ._omp_fn.0] /home/user/pytorch/aten/src/ATen/ParallelOpenMP.h:42 #4 0x3ff6f31f52d in gomp_thread_start /var/tmp/portage/sys-devel/gcc-12.2.1_p20230304/work/gcc-12-20230304/libgomp/team.c:129 #5 0x3ff82218381 in start_thread /usr/src/debug/sys-libs/glibc-2.37-r1/glibc-2.37/nptl/pthread_create.c:444 #6 0x3ff822943f1 (/lib64/libc.so.6+0x1143f1) 0x609000022160 is located 0 bytes to the right of 32-byte region [0x609000022140,0x609000022160) allocated by thread T0 here: #0 0x3ff82a3663f in __interceptor_posix_memalign /usr/src/debug/sys-devel/gcc-11.3.1_p20230303/gcc-11-20230303/libsanitizer/asan/asan_malloc_linux.cpp:226 #1 0x3ff6f53ad95 in c10::alloc_cpu(unsigned long) /home/user/pytorch/c10/core/impl/alloc_cpu.cpp:74 Thread T1 created by T0 here: #0 0x3ff829dc263 in __interceptor_pthread_create /usr/src/debug/sys-devel/gcc-11.3.1_p20230303/gcc-11-20230303/libsanitizer/asan/asan_interceptors.cpp:216 #1 0x3ff6f31fad5 in gomp_team_start /var/tmp/portage/sys-devel/gcc-12.2.1_p20230304/work/gcc-12-20230304/libgomp/team.c:858 SUMMARY: AddressSanitizer: heap-buffer-overflow /home/user/pytorch/aten/src/ATen/cpu/vec/vec256/zarch/vec256_zarch.h:397 in at::vec::ZVECTOR::Vectorized<unsigned char, void>::loadu(void const*, int) Shadow bytes around the buggy address: 0x100c12000043d0: 00 fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x100c12000043e0: fd fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x100c12000043f0: fd fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x100c1200004400: fd fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x100c1200004410: fa fa fa fa fa fa fa fa fd fa fa fa fa fa fa fa =>0x100c1200004420: fa fa fa fa fa fa fa fa 00 00 00 00[fa]fa fa fa 0x100c1200004430: fa fa fa fa fa fa fa fa fd fd fa fa fa fa fa fa 0x100c1200004440: fa fa fa fa fa fa fa fa fd fd fa fa fa fa fa fa 0x100c1200004450: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x100c1200004460: 00 00 fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x100c1200004470: 00 00 fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Shadow gap: cc ==1640574==ABORTING ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101970 Approved by: https://github.com/Skylion007, https://github.com/jgong5	2023-05-30 17:09:52 +00:00
Pearu Peterson	fcbdbd6682	Fix silent nnz overflow for large sparse compressed tensors. (#102523 ) Fixes https://github.com/pytorch/pytorch/issues/102520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102523 Approved by: https://github.com/nikitaved, https://github.com/cpuhrsch	2023-05-30 16:58:01 +00:00
Will Constable	77f97019b7	Dynamo remaps legacy allgather to traceable one (#102232 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102232 Approved by: https://github.com/voznesenskym	2023-05-30 16:45:25 +00:00
Bin Bao	c58264c3e9	[inductor] Support multiple symbolic numel expr in CudaWrapperCodeGen (#102093 ) Summary: Add a set to avoid generating extra `auto` when seeing the symbolic numel expression for the second time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102093 Approved by: https://github.com/jansel	2023-05-30 16:08:00 +00:00
vfdev-5	7042e10215	Fixed issue with bicubic interpolation on uint8 input and antialising (#102296 ) Description: - Fixed issue with bicubic interpolation on uint8 input and antialising, discovered by @NicolasHug - Unified `_separable_upsample_generic_Nd_kernel_impl_single_dim` on `antialis` arg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102296 Approved by: https://github.com/NicolasHug	2023-05-30 14:57:19 +00:00
Nikita Karetnikov	0f1621df1a	[pt2] fix typos in `checkFloatingOrComplex` errors (#102456 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102456 Approved by: https://github.com/lezcano	2023-05-30 11:18:50 +00:00
Nikita Karetnikov	e380d692dc	[pt2] skip `linalg.householder_product` tests on x86 macOS (#102460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102460 Approved by: https://github.com/lezcano	2023-05-30 08:44:13 +00:00
Nikita Karetnikov	076f84c46f	[pt2] update tolerance for `linalg.pinv` `singular` tests (#102458 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102458 Approved by: https://github.com/lezcano	2023-05-30 08:44:13 +00:00
ecao	999bae0f54	Add padding check for use_nnpack (#92238 ) Fixes #90142 nnp_convolution_output doesn't support the case of input padding > = kernel_size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92238 Approved by: https://github.com/jgong5, https://github.com/ganler	2023-05-30 05:07:59 +00:00
David Berard	00992ffa2f	[profiler] Global function for controlling fwd-bwd connection behavior (#102492 ) Summary: In https://github.com/pytorch/pytorch/pull/102424, we'll re-introduce forward-backward links. We assume that most users will want to see them, but in case there are issues, we'll provide these APIs for turning them on and off. Differential Revision: D46266365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102492 Approved by: https://github.com/aaronenyeshi	2023-05-30 04:50:34 +00:00
PyTorch UpdateBot	0e72ada9bb	[vision hash update] update the pinned vision hash (#102495 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102495 Approved by: https://github.com/pytorchbot	2023-05-30 03:04:56 +00:00
Natalia Gimelshein	2cc6ae1926	squash xblock for persistent inner reduction (#102444 ) Currently layer norm kernel performance is pretty bad due to triton perf bug https://gist.github.com/ngimel/c1e7f70f8268f038e710e835b0065f63, but since XBLOCK for persistent reduction is `1` we can just drop this dimension and operate on 1d tensors (and then perf of ln kernels improves a lot) Perf results http://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2022%20May%202023%2001%3A27%3A25%20GMT&stopTime=Mon%2C%2029%20May%202023%2001%3A27%3A25%20GMT&suite=torchbench&mode=training&dtype=amp&lBranch=ngimel/persistent_1d&lCommit=1d5175f5e682f37aae15fd217bc3767e1788bacf&rBranch=main&rCommit=c9f4f01981fd73fcc7c27676cc50230cd1b5bc22, approx 4% on hf Pull Request resolved: https://github.com/pytorch/pytorch/pull/102444 Approved by: https://github.com/jansel	2023-05-30 02:51:10 +00:00
PyTorch MergeBot	3c2519ab5e	Revert "apply constexpr and if constexpr when possible (#102471 )" This reverts commit 461c03a93c0ac85837c1ef11afc0ec1dc8900d0c. Reverted https://github.com/pytorch/pytorch/pull/102471 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. I think it breaks Windows CUDA build with a landrace `461c03a93c` ([comment](https://github.com/pytorch/pytorch/pull/102471#issuecomment-1567653793))	2023-05-30 01:41:20 +00:00
cyy	461c03a93c	apply constexpr and if constexpr when possible (#102471 ) Now that we have full C++17 support, we can use if constexpr in some identified cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102471 Approved by: https://github.com/Skylion007	2023-05-30 00:47:07 +00:00
vfdev-5	319a1cb4e5	[inductor] Replaced refs.op by torch.op in _refs/* (#102176 ) Description: - Replaced `refs.op` by `torch.op` in `_refs/*` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102176 Approved by: https://github.com/lezcano	2023-05-29 22:36:14 +00:00
Jason Ansel	fc0fed36d9	[inductor] fix issue with ops.lookup_seed (#102485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102485 Approved by: https://github.com/anijain2305	2023-05-29 22:25:47 +00:00
Peter Bell	c6d9a0b9dd	[inductor] Handle floordiv and remainder in IndexPropagation (#102277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102277 Approved by: https://github.com/lezcano	2023-05-29 17:21:13 +00:00
Peter Bell	e4e151d669	[inductor] Inline ComputedBuffer computation when there are no reads (#102000 ) When inductor compiles the following example, ```python def flip(x): idx = torch.arange(x.shape[0] - 1, -1, -1, device=x.device) return x[idx], idx ``` The return of `idx` forces it to be realized into a `ComputedBuffer` and the downstream index call inserts a corresponding load and indirect_indexing: ```python tmp0 = tl.load(in_ptr0 + (x1), None) tmp1 = triton_helpers.promote_to_tensor(tmp0) tl.device_assert((0 <= tmp1) & (tmp1 < 128), "index out of bounds: 0 <= tmp1 < 128") tmp2 = tl.load(in_ptr1 + (x0 + (128tmp0)), None) ``` However, if we can inline the index expression from the buffer's computation we instead get direct indexing (and half the loads): ```python tmp0 = tl.load(in_ptr0 + (127 + ((-1)x0)), None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102000 Approved by: https://github.com/lezcano	2023-05-29 17:21:13 +00:00
kshitij12345	b1bc8aecf5	[inductor] erfinv: CPU/CUDA lowering (#101863 ) Add `erfinv` lowering for CUDA. On CPU, we just fallback to the aten operator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101863 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2023-05-29 15:31:54 +00:00
PyTorch MergeBot	0803b91867	Revert "Replace int64_t with a size type in python_list.h when applicable (#101922 )" This reverts commit 44e7f07ed4ef3ea2f9dc8deb66a779aeb4450b21. Reverted https://github.com/pytorch/pytorch/pull/101922 on behalf of https://github.com/atalman due to breaks windows nightlies ([comment](https://github.com/pytorch/pytorch/pull/101922#issuecomment-1567240450))	2023-05-29 14:58:31 +00:00
ecao	af1d437654	Improve precision and performance for BFloat16 upsampling (#91169 ) ### Description - Fix precision issue for BFloat16 upsampling: https://github.com/pytorch/pytorch/issues/89212 - Improve performance for BFloat16 upsampling. ### Testing data type: BFloat16 - Single core contiguous: mode \| scale_factor \| shape \| before backward / ms \| after backward / ms -- \| -- \| -- \| -- \| -- nearest \| 2 \| [10, 3, 200, 200] \| 14.47 \| 8.34 linear \| 2 \| [3, 200, 200] \| 3.69 \| 2.74 bilinear \| 2 \| [3, 5, 200, 200] \| 87.99 \| 49.05 trilinear \| 2 \| [3, 3, 3, 100, 100] \| 171.02 \| 72.53 bicubic \| 2 \| [3, 3, 200, 200 ] \| 176.29 \| 78 channels last: mode \| scale_factor \| shape \| before backward / ms \| after backward / ms -- \| -- \| -- \| -- \| -- nearest \| 2 \| [10, 3, 200, 200] \| 17.70 \| 10.30 linear \| 2 \| [3, 200, 200] \| \ \| \ bilinear \| 2 \| [3, 5, 200, 200] \| 50.90 \| 18.83 trilinear \| 2 \| [3, 3, 3, 100, 100] \| 121.56 \| 42.60 bicubic \| 2 \| [3, 3, 200, 200 ] \| 179.40 \| 80 - 20 cores contiguous: mode \| scale_factor \| shape \| before backward / ms \| after backward / ms -- \| -- \| -- \| -- \| -- nearest \| 2 \| [10, 3, 200, 200] \| 1.17 \| 1.01 linear \| 2 \| [3, 200, 200] \| 0.41 \| 0.26 bilinear \| 2 \| [3, 5, 200, 200] \| 7.19 \| 4.07 trilinear \| 2 \| [3, 3, 3, 100, 100] \| 21.32 \| 9.33 bicubic \| 2 \| [3, 3, 200, 200 ] \| 178.67 \| 10 channels last: mode \| scale_factor \| shape \| before backward / ms \| after backward / ms -- \| -- \| -- \| -- \| -- nearest \| 2 \| [10, 3, 200, 200] \| 2.25 \| 1.55 linear \| 2 \| [3, 200, 200] \| \ \| \ bilinear \| 2 \| [3, 5, 200, 200] \| 20.17 \| 7.20 trilinear \| 2 \| [3, 3, 3, 100, 100] \| 43.33 \| 15.66 bicubic \| 2 \| [3, 3, 200, 200 ] \| 176.76 \| 10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91169 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/Skylion007	2023-05-29 01:35:57 +00:00
Animesh Jain	040d2cc969	[dynamo] Some torchrec_dlrm related fixes (#101953 ) Issue 1 of https://github.com/pytorch/pytorch/issues/101918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101953 Approved by: https://github.com/jansel	2023-05-28 17:56:08 +00:00
Nikita Shulga	53d1d301c6	Enable CuDNN v8 frontend in RL (#102284 ) Summary: This enables use of CUDNN v8 in all Meta internal workflows. Also, fixes two minor issues: - Skip LogCumSumExp compilation for complex dtypes for fbcode and RL - Move `MakeConvOutputShape` template definition/specialization to anonymous namespace inside `at::native::quantized` as it is referenced from both `torch_cpu` and `torch_cuda`. This is necessary to avoid `duplicate symbol` linker error if say `libtorch_cpu` and `libtorch_cuda` are statically linked together. - Lower CuDNN v8 version guard from 8.3 to 8.2 (as there are no good reason why it should be 8.3, first version of the library that properly supports all the features is actually 8.5) Test Plan: CI Differential Revision: D46161651 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102284 Approved by: https://github.com/atalman	2023-05-28 13:21:47 +00:00
PyTorch MergeBot	81ac076bce	Revert "[FSDP]Add device_mesh to FSDPstate (#102317 )" This reverts commit 4c584acc5d87ece9b236424cef6474c453e8d4b3. Reverted https://github.com/pytorch/pytorch/pull/102317 on behalf of https://github.com/malfet due to Broke test_fake_pg, see https://github.com/pytorch/pytorch/actions/runs/5100633726/jobs/9173277369 ([comment](https://github.com/pytorch/pytorch/pull/102317#issuecomment-1566129496))	2023-05-28 12:53:28 +00:00
Kimish Patel	af70fe9f3e	[PT2][Quant] Enable test_qnnpack_quantizer_conv_linear test (#102399 ) Earlier this test was disabled due to pattern matching not working correctly. Enablign this test now since we moved to module partitioner based matching. Differential Revision: [D46130722](https://our.internmc.facebook.com/intern/diff/D46130722/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102399 Approved by: https://github.com/jerryzh168	2023-05-28 06:44:16 +00:00
Kimish Patel	0d876f7d43	[PT2][Quant] Move observer sharing ops to use module partitions (#102398 ) As title Differential Revision: [D46095331](https://our.internmc.facebook.com/intern/diff/D46095331/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102398 Approved by: https://github.com/jerryzh168	2023-05-28 05:50:15 +00:00
Kimish Patel	9fac5afbcc	[PT2][Quant] Move add/add relu pattern via module partitioner (#102397 ) This diff uses module partitioners to find add and add + relu patterns. Differential Revision: [D46095330](https://our.internmc.facebook.com/intern/diff/D46095330/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102397 Approved by: https://github.com/jerryzh168	2023-05-28 05:47:43 +00:00
Kimish Patel	3d8f405022	[PT2][Quant] Move maxpool_2d quant to use module partitioners (#102396 ) As summary Differential Revision: [D46095332](https://our.internmc.facebook.com/intern/diff/D46095332/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102396 Approved by: https://github.com/jerryzh168	2023-05-28 05:44:54 +00:00
Kimish Patel	d997e3aac6	[PT2][Quant] Use module partitions for conv2d and conv2d + relu (#102395 ) In this diff we continue to use source partition for identifying node patterns to annotate. Here we expand the usecase for conv2d+relu and conv2d Differential Revision: [D46095329](https://our.internmc.facebook.com/intern/diff/D46095329/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102395 Approved by: https://github.com/jerryzh168	2023-05-28 05:40:45 +00:00
Kimish Patel	4cb6add471	[PT2][Quant] Use module partition for fused patterns (#102394 ) This diff introduces utility `find_sequential_partitions`. This utility allows one to specify sequential pattern of nn.Module/nn.functional and returns a list. Each item in the list contains a List[SourcePartition] that represents sequentially connected partitions that are of the pattern requested. For example `find_sequential_partitions(model, [nn.Conv2d, nn.ReLU])` will find all nn.Conv2d and nn.ReLU partitions that are sequentially connected. Furthmore, move to using `find_sequential_partitions` for conv_bn/conv_bn_relu for QAT. Differential Revision: [D45948057](https://our.internmc.facebook.com/intern/diff/D45948057/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D45948057/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/102394 Approved by: https://github.com/jerryzh168	2023-05-28 05:29:16 +00:00
Iris	4c584acc5d	[FSDP]Add device_mesh to FSDPstate (#102317 ) This PR creates a device_mesh and share it across all FSDP state. The device_mesh will later be used to test out dtensor state_dict (1d device_mesh). Pull Request resolved: https://github.com/pytorch/pytorch/pull/102317 Approved by: https://github.com/awgu	2023-05-27 20:25:30 +00:00
Nikita Karetnikov	c3ea8cc58b	[pt2] convert `out` params in `register_meta` (#101344 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101344 Approved by: https://github.com/lezcano	2023-05-27 18:38:52 +00:00
PandaNinjas	44e7f07ed4	Replace int64_t with a size type in python_list.h when applicable (#101922 ) Potentially fixes the second issue described in #87159. In python_list.h, `int64_t` is used when `diff_type` is better suited. On 32 bit systems, int64_t isn't a proper signed size type, which may cause the compilation error described in #87159. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101922 Approved by: https://github.com/Skylion007	2023-05-27 17:55:53 +00:00
ecao	3f4fee735a	add Half support for logsigmoid, threshold, elu, gelu, hardtanh, hardsigmoid, hardswish, hardshrink, softshrink, leakyrelu, softplus, glu, silu, mish, and prelu on CPU (#98745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98745 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/ngimel	2023-05-27 16:20:21 +00:00
Jerry Zhang	eda5abf5e0	[quant][pt2e] Fix propagate_annotation after recent refactors (#102422 ) Summary: Recently we changed the annotation from "target_dtype_info" to "quantization_annotation" and introduced QuantizationAnnotation API and SharedQuantizationSpec API for users to convey sharing between input/outputs, this PR updates the _propagate_annotation pass to accommadate the recent changes Test Plan: ``` buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e' ``` Reviewed By: kimishpatel Differential Revision: D46153084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102422 Approved by: https://github.com/kimishpatel	2023-05-27 16:01:47 +00:00
Huy Do	6e3e3dd477	Do not collect and skip non-disabled tests when rerunning disabled tests (#102107 ) The console log blows up to much when running in rerun disabled tests mode (x50) `e132f09e88`. Each log is around 1GB and the whole uncompressed logs is ~50GB. After compression, it will be around 1GB, still too big. The increase comes mainly from the multiple SKIPPED message for non-disabled tests, which is expected due to how SkipTest and pytest-flakyfinder currently work. I update `test/conftest.py` to completely ignore skipped tests when rerunning disabled test instead of collecting then skipping 50 tests each. The benefit of doing is is much more than I originally expect: * Rerun disabled tests jobs now finish in less than half an hour as they should be * Fix OOM runner crash because of too many collected tests * Fix verbosity issue as now only disabled tests are run x50 times. There are only few hundreds of them atm * Fix timed out issue when rerunning disabled distributed and ASAN tests. They are just too slow when running at x50 ### Testing When rerunning disabled tests https://github.com/pytorch/pytorch/actions/runs/5084508614, only disabled tests on the platform are run, for example `test_ops_jit` on https://ossci-raw-job-status.s3.amazonaws.com/log/13770164954 only ran 100 tests (`test_variant_consistency_jit_linalg_lu_cuda_float32` + `test_variant_consistency_jit_linalg_lu_factor_cuda_complex64`) x50. ``` Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops_jit.py', '--shard-id=1', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--sc=test_ops_jit_1', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2023-05-25 21:32:49.763856] Expand the folded group to see the log file of test_ops_jit 2/2 ##[group]PRINTING LOG FILE of test_ops_jit 2/2 (/var/lib/jenkins/workspace/test/test-reports/test_ops_jit_h2wr_t2c.log) Test results will be stored in test-reports/python-pytest/test_ops_jit/test_ops_jit-51a83bd44549074e.xml ============================= test session starts ============================== platform linux -- Python 3.10.11, pytest-7.3.1, pluggy-1.0.0 -- /opt/conda/envs/py_3.10/bin/python cachedir: .pytest_cache hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] rootdir: /var/lib/jenkins/workspace configfile: pytest.ini plugins: hypothesis-5.35.1, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-11.1.2, shard-0.1.2, xdist-3.3.0, xdoctest-1.1.0 collecting ... collected 1084 items Running 100 items in this shard: test/test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_cuda_float32 (x50), test/test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_factor_cuda_complex64 (x50) stepcurrent: Cannot find last run test, not skipping test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_cuda_float32 PASSED [2.1876s] [ 1%] test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_factor_cuda_complex64 PASSED [4.5615s] [ 2%] ``` * [pull](https://github.com/pytorch/pytorch/actions/runs/5093566864) * [trunk](https://github.com/pytorch/pytorch/actions/runs/5095364311) * [periodic](https://github.com/pytorch/pytorch/actions/runs/5095378850) * [slow](https://github.com/pytorch/pytorch/actions/runs/5095390285) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102107 Approved by: https://github.com/clee2000, https://github.com/malfet	2023-05-27 12:10:36 +00:00
Nikita Karetnikov	995ac703cd	[pt2] add `SymInt` support for `linalg.pinv` (#102367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102367 Approved by: https://github.com/lezcano	2023-05-27 11:10:47 +00:00
Daniil Kutz	c9f4f01981	Add security guards to avoid crashes in torch::jit module (#102156 ) Hi! I've been fuzzing different pytorch modules with with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch), and found a multiple crashes in torch::jit::load() function. All found errors could be reproduced with provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch). ### Crash in torch/csrc/jit/unpickler.cpp:1075 [crash-1f59083b8396c5b62b4705c7556e68f129e833b1.zip](https://github.com/pytorch/pytorch/files/11552947/crash-1f59083b8396c5b62b4705c7556e68f129e833b1.zip) ```asan "#0 0x00007ffff7a5600b in raise () from /lib/x86_64-linux-gnu/libc.so.6", "#1 0x00007ffff7a35859 in abort () from /lib/x86_64-linux-gnu/libc.so.6", "#2 0x00007ffff7ce3911 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6", "#3 0x00007ffff7cef38c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6", "#4 0x00007ffff7cef3f7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6", "#5 0x00007ffff7cef6a9 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6", "#6 0x00007ffff7ce6326 in std::__throw_length_error(char const) () from /lib/x86_64-linux-gnu/libstdc++.so.6", "#7 0x00007ffff7d87edc in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_create(unsigned long&, unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6", "#8 0x00007ffff7d88880 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::reserve(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6", "#9 0x000000000ea52931 in torch::jit::Unpickler::readBytes[abi:cxx11](unsigned long) (this=this@entry=0x7fffffffac10, length=length@entry=8358680908539635837) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:1075", "#10 0x000000000ea4c3a0 in torch::jit::Unpickler::readInstruction (this=0x7fffffff90d0) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:355", "#11 0x000000000ea49eb8 in torch::jit::Unpickler::run (this=0x7fffffffac10) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:251", "#12 0x000000000ea49b12 in torch::jit::Unpickler::parse_ivalue (this=0x7fffffffac10) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204", "#13 0x000000000e960a9f in torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) (archive_name=..., pickle_prefix=..., tensor_prefix=..., type_resolver=..., obj_loader=..., device=..., stream_reader=..., type_parser=<optimized out>, storage_context=...) at /pytorch/torch/csrc/jit/serialization/import_read.cpp:53", "#14 0x000000000e8ef599 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive (this=0x7fffffffbc60, archive_name=...) at /pytorch/torch/csrc/jit/serialization/import.cpp:184", "#15 0x000000000e8eb886 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::deserialize (this=<optimized out>, device=..., extra_files=..., restore_shapes=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:287", "#16 0x000000000e8e9cc5 in torch::jit::import_ir_module (cu=..., in=..., device=..., extra_files=..., load_debug_files=<optimized out>, restore_shapes=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:386", "#17 0x000000000e8f37bf in torch::jit::import_ir_module (cu=..., in=..., device=..., load_debug_files=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:322", "#18 0x000000000e8f615a in torch::jit::load (in=..., device=..., load_debug_files=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:482", "#19 0x00000000005c2d61 in LLVMFuzzerTestOneInput (data=<optimized out>, size=1663) at /load.cc:42", "#20 0x00000000005c2a8e in ExecuteFilesOnyByOne (argc=2, argv=0x7fffffffc6b8, callback=callback@entry=0x5c2ae0 <LLVMFuzzerTestOneInput(uint8_t const, size_t)>) at /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255", "#21 0x00000000005c2899 in LLVMFuzzerRunDriver (argcp=argcp@entry=0x7fffffffc5b4, argvp=argvp@entry=0x7fffffffc5b8, callback=0x5c2ae0 <LLVMFuzzerTestOneInput(uint8_t const, size_t)>) at /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:364", "#22 0x00000000005c2459 in main (argc=2, argv=0x7fffffffc6b8) at /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300" ``` ### Crash in torch/csrc/jit/unpickler.cpp:386 [crash-2e9923de375c393e700e8c0441f0ebe8252ca364.zip](https://github.com/pytorch/pytorch/files/11552950/crash-2e9923de375c393e700e8c0441f0ebe8252ca364.zip) ```asan "#0 0x00007ffff7a5600b in raise () from /lib/x86_64-linux-gnu/libc.so.6", "#1 0x00007ffff7a35859 in abort () from /lib/x86_64-linux-gnu/libc.so.6", "#2 0x00007ffff7ce3911 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6", "#3 0x00007ffff7cef38c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6", "#4 0x00007ffff7cef3f7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6", "#5 0x00007ffff7cef6a9 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6", "#6 0x00007ffff7ce6326 in std::__throw_length_error(char const) () from /lib/x86_64-linux-gnu/libstdc++.so.6", "#7 0x0000000000670aff in std::vector<c10::IValue, std::allocator<c10::IValue> >::reserve (this=this@entry=0x7fffffff9750, __n=__n@entry=18446744073709551614) at /usr/include/c++/10/bits/vector.tcc:70", "#8 0x000000000ea4d5cd in torch::jit::Unpickler::readInstruction (this=0x7fffffffac10) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:386", "#9 0x000000000ea49eb8 in torch::jit::Unpickler::run (this=0x7fffffffac10) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:251", "#10 0x000000000ea49b12 in torch::jit::Unpickler::parse_ivalue (this=0x7fffffffac10) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204", "#11 0x000000000e960a9f in torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) (archive_name=..., pickle_prefix=..., tensor_prefix=..., type_resolver=..., obj_loader=..., device=..., stream_reader=..., type_parser=<optimized out>, storage_context=...) at /pytorch/torch/csrc/jit/serialization/import_read.cpp:53", "#12 0x000000000e8ef599 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive (this=0x7fffffffbc60, archive_name=...) at /pytorch/torch/csrc/jit/serialization/import.cpp:184", "#13 0x000000000e8eb886 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::deserialize (this=<optimized out>, device=..., extra_files=..., restore_shapes=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:287", "#14 0x000000000e8e9cc5 in torch::jit::import_ir_module (cu=..., in=..., device=..., extra_files=..., load_debug_files=<optimized out>, restore_shapes=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:386", "#15 0x000000000e8f37bf in torch::jit::import_ir_module (cu=..., in=..., device=..., load_debug_files=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:322", "#16 0x000000000e8f615a in torch::jit::load (in=..., device=..., load_debug_files=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:482", "#17 0x00000000005c2d61 in LLVMFuzzerTestOneInput (data=<optimized out>, size=5498) at /load.cc:42", "#18 0x00000000005c2a8e in ExecuteFilesOnyByOne (argc=2, argv=0x7fffffffc6b8, callback=callback@entry=0x5c2ae0 <LLVMFuzzerTestOneInput(uint8_t const, size_t)>) at /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255", "#19 0x00000000005c2899 in LLVMFuzzerRunDriver (argcp=argcp@entry=0x7fffffffc5b4, argvp=argvp@entry=0x7fffffffc5b8, callback=0x5c2ae0 <LLVMFuzzerTestOneInput(uint8_t const, size_t)>) at /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:364", "#20 0x00000000005c2459 in main (argc=2, argv=0x7fffffffc6b8) at /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300" ``` ### Crash in torch/csrc/jit/serialization/source_range_serialization.cpp:211 [crash-5598d386057152f606bfa69d85605499e8852625.zip](https://github.com/pytorch/pytorch/files/11552952/crash-5598d386057152f606bfa69d85605499e8852625.zip) ```asan "#0 torch::jit::ConcreteSourceRangeUnpickler::unpickle (this=0x99b8d80) at /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:211", "#1 0x0000000004042566 in torch::jit::ConcreteSourceRangeUnpickler::findSourceRangeThatGenerated (this=0x99aa1c0, range=...) at /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:229", "#2 0x00000000007b5cc8 in torch::jit::Source::findSourceRangeThatGenerated (this=<optimized out>, range=...) at /pytorch/torch/csrc/jit/frontend/source_range.cpp:144", "#3 torch::jit::SourceRange::findSourceRangeThatGenerated (this=0x7fffffffa650) at /pytorch/torch/csrc/jit/frontend/source_range.h:384", "#4 torch::jit::SourceRange::highlight (this=0x7fffffffa650, out=...) at /pytorch/torch/csrc/jit/frontend/source_range.cpp:149", "#5 0x00000000007a0e74 in torch::jit::Lexer::expected (this=this@entry=0x99979a0, what=..., t=...) at /pytorch/torch/csrc/jit/frontend/lexer.h:461", "#6 0x000000000079fcaa in torch::jit::Lexer::lexRaw (this=this@entry=0x99979a0, whitespace_token=false) at /pytorch/torch/csrc/jit/frontend/lexer.h:552", "#7 0x000000000079fd23 in torch::jit::Lexer::lex (this=this@entry=0x99979a0) at /pytorch/torch/csrc/jit/frontend/lexer.h:487", "#8 0x00000000007a1da1 in torch::jit::Lexer::next (this=this@entry=0x99979a0) at /pytorch/torch/csrc/jit/frontend/lexer.h:436", "#9 0x0000000003bff6a8 in torch::jit::Lexer::nextIf (this=0x99979a0, kind=330) at /pytorch/torch/csrc/jit/frontend/lexer.h:444", "#10 torch::jit::ParserImpl::parseReturnAnnotation (this=this@entry=0x99979a0) at /pytorch/torch/csrc/jit/frontend/parser.cpp:703", "#11 0x0000000003bfd500 in torch::jit::ParserImpl::parseDecl (this=this@entry=0x99979a0) at /pytorch/torch/csrc/jit/frontend/parser.cpp:729", "#12 0x0000000003bfb725 in torch::jit::ParserImpl::parseFunction (this=this@entry=0x99979a0, is_method=true) at /pytorch/torch/csrc/jit/frontend/parser.cpp:755", "#13 0x0000000003bfdc28 in torch::jit::ParserImpl::parseStmt (this=this@entry=0x99979a0, in_class=<optimized out>) at /pytorch/torch/csrc/jit/frontend/parser.cpp:599", "#14 0x0000000003bfd8dd in torch::jit::ParserImpl::parseStatements (this=this@entry=0x99979a0, expect_indent=<optimized out>, in_class=<optimized out>) at /pytorch/torch/csrc/jit/frontend/parser.cpp:697", "#15 0x0000000003bfc4ba in torch::jit::ParserImpl::parseClass (this=0x99979a0) at /pytorch/torch/csrc/jit/frontend/parser.cpp:747", "#16 0x0000000003bfaddc in torch::jit::Parser::parseClass (this=<optimized out>) at /pytorch/torch/csrc/jit/frontend/parser.cpp:812", "#17 0x0000000004008e2d in torch::jit::SourceImporterImpl::parseSourceIfNeeded (this=this@entry=0x95d41f0, qualifier=...) at /pytorch/torch/csrc/jit/serialization/import_source.cpp:182", "#18 0x0000000004008ab7 in torch::jit::SourceImporterImpl::findNamedType (this=this@entry=0x95d41f0, name=...) at /pytorch/torch/csrc/jit/serialization/import_source.cpp:135", "#19 0x000000000400d010 in torch::jit::SourceImporterImpl::resolveType (this=0x95d41f0, name=..., loc=...) at /pytorch/torch/csrc/jit/serialization/import_source.cpp:261", "#20 0x0000000003c20821 in torch::jit::ScriptTypeParser::parseTypeFromExpr (this=this@entry=0x7fffffffb658, expr=...) at /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:238", "#21 0x0000000003c20acc in torch::jit::ScriptTypeParser::parseType (this=0x7fffffffb658, str=...) at /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:312", "#22 0x0000000004019416 in torch::jit::SourceImporter::loadType (this=<optimized out>, name=...) at /pytorch/torch/csrc/jit/serialization/import_source.cpp:786", "#23 0x0000000003ff365e in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0::operator()(c10::QualifiedName const&) const (this=<optimized out>, qn=...) at /pytorch/torch/csrc/jit/serialization/import.cpp:146", "#24 std::__invoke_impl<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>(std::__invoke_other, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&) (__f=..., __args=...) at /usr/include/c++/10/bits/invoke.h:60", "#25 std::__invoke_r<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>(torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&) (__fn=..., __args=...) at /usr/include/c++/10/bits/invoke.h:113", "#26 std::_Function_handler<c10::StrongTypePtr (c10::QualifiedName const&), torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0>::_M_invoke(std::_Any_data const&, c10::QualifiedName const&) (__functor=..., __args=...) at /usr/include/c++/10/bits/std_function.h:291", "#27 0x000000000404e5c4 in std::function<c10::StrongTypePtr (c10::QualifiedName const&)>::operator()(c10::QualifiedName const&) const (this=0x7fffffffbf28, __args=...) at /usr/include/c++/10/bits/std_function.h:622", "#28 torch::jit::Unpickler::readGlobal (this=this@entry=0x7fffffffbd50, module_name=..., class_name=...) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:820", "#29 0x0000000004049ce5 in torch::jit::Unpickler::readInstruction (this=this@entry=0x7fffffffbd50) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:496", "#30 0x00000000040497a8 in torch::jit::Unpickler::run (this=0x7fffffffbd50) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:251", "#31 0x00000000040494f9 in torch::jit::Unpickler::parse_ivalue (this=0x99aa1c0) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204", "#32 0x00000000040075f8 in torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) (archive_name=..., pickle_prefix=..., tensor_prefix=..., type_resolver=..., obj_loader=..., device=..., stream_reader=..., type_parser=0x0, storage_context=...) at /pytorch/torch/csrc/jit/serialization/import_read.cpp:53", "#33 0x0000000003ff3545 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive (this=this@entry=0x7fffffffc2b8, archive_name=...) at /pytorch/torch/csrc/jit/serialization/import.cpp:184", "#34 0x0000000003fed8bf in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::deserialize (this=this@entry=0x7fffffffc2b8, device=device@entry=..., extra_files=..., restore_shapes=220) at /pytorch/torch/csrc/jit/serialization/import.cpp:287", "#35 0x0000000003febb0f in torch::jit::import_ir_module (cu=..., in=..., device=..., device@entry=..., extra_files=..., load_debug_files=true, restore_shapes=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:386", "#36 0x0000000003feb7a1 in torch::jit::import_ir_module (cu=..., in=..., device=..., device@entry=..., load_debug_files=false) at /pytorch/torch/csrc/jit/serialization/import.cpp:322", "#37 0x0000000003ff015a in torch::jit::load (in=..., device=device@entry=..., load_debug_files=true) at /pytorch/torch/csrc/jit/serialization/import.cpp:482", "#38 0x00000000004a1655 in LLVMFuzzerTestOneInput (data=0x981a680 \"PK\\003\\004\", size=1609) at /load.cc:42", "#39 0x00000000004a1dbf in main ()" ``` ### Segmentation fault in /pytorch/aten/src/ATen/core/ivalue.h:526 [crash-9bd059c1ae85ab9cdb41d786932214d942baa189.zip](https://github.com/pytorch/pytorch/files/11552956/crash-9bd059c1ae85ab9cdb41d786932214d942baa189.zip) ```asan "==8528==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x00000e55d97e bp 0x7fffffffb4d0 sp 0x7fffffffb360 T0)", "==8528==The signal is caused by a READ memory access.", "==8528==Hint: this fault was caused by a dereference of a high value address (see register values below). Disassemble the provided pc to learn which register was used.", " #0 0xe55d97e in c10::IValue::isTuple() const /pytorch/aten/src/ATen/core/ivalue.h:526:26", " #1 0xe55d97e in torch::distributed::rpc::GloballyUniqueId::fromIValue(c10::IValue const&) /pytorch/torch/csrc/distributed/rpc/types.cpp:60:3", " #2 0xe4b04fb in torch::distributed::rpc::ScriptRemoteCall::fromIValues(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch/torch/csrc/distributed/rpc/script_remote_call.cpp:33:20", " #3 0xe4b1ed5 in torch::distributed::rpc::ScriptRemoteCall::fromMessage(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/script_remote_call.cpp:80:10", " #4 0xe55f8a0 in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:108:14", " #5 0x6120a8 in LLVMFuzzerTestOneInput /message_deserialize.cc:192:27", " #6 0x535de1 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15", " #7 0x51fcec in fuzzer::RunOneTest(fuzzer::Fuzzer, char const, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6", " #8 0x525a3b in fuzzer::FuzzerDriver(int, char*, int ()(unsigned char const*, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9", " #9 0x54eff2 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10", " #10 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)", " #11 0x51a60d in _start (/message_deserialize_fuzz+0x51a60d)", "", "AddressSanitizer can not provide additional info.", "SUMMARY: AddressSanitizer: SEGV /pytorch/aten/src/ATen/core/ivalue.h:526:26 in c10::IValue::isTuple() const", "==8528==ABORTING" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102156 Approved by: https://github.com/ezyang	2023-05-27 04:23:01 +00:00
cyy	d7eec5628d	Fix some move warnings by gcc13 (#102353 ) GCC 13 has improved warnings about std::move. This PR tries to fix some detected code issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102353 Approved by: https://github.com/ezyang	2023-05-27 04:19:00 +00:00
Ben Lawrence	26f53bb8b0	Deallocate workspace on thread exit (#102276 ) LeakSanitizer picks up this allocation as a leak, so turn the buffer and size into a single object that deallocates when the thread_local is destroyed. Note that in our use case the call that hits this code is running on a separate thread(s) which can, under the right circumstances, be torn down and rebuilt hence leaking multiple instances of this allocation. Testing was performed locally on an Apple M2 with this patch applied and the ~100MB of leaks previously shown by LeakSanitizer and Instruments are no longer there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102276 Approved by: https://github.com/ezyang	2023-05-27 03:57:30 +00:00
spectrometerHBH	5ee46afc05	perf hint logging in inductor (#102250 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102250 Approved by: https://github.com/Skylion007, https://github.com/shunting314, https://github.com/jansel	2023-05-27 03:43:30 +00:00
blorange-amd	25058d5f66	Modified logging threshold for memory profiling (#102243 ) Fixed test_memory_profiler::TestMemoryProfilerE2E::test_memory_timeline by changing the (arbitrary) threshold for logging. We observe differently-sized allocations on different AMD GPUs, so chose a higher threshold over 512 to account for those differences and yet satisfy the test requirements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102243 Approved by: https://github.com/ezyang	2023-05-27 03:36:25 +00:00
Will Constable	e344ff4113	Support dynamo tracing collectives with processgroup arg (#102222 ) Previously, other types of rank descriptors worked but pg caused dynamo to break down when tracing the internal func that converts pg to rank list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102222 Approved by: https://github.com/wanchaol, https://github.com/voznesenskym	2023-05-27 03:01:49 +00:00
Natalia Gimelshein	ecd79b1fef	add additional stream priority for cuda streams (#101956 ) Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions. Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956 Approved by: https://github.com/ezyang	2023-05-27 02:36:16 +00:00
PyTorch MergeBot	88961e6d30	Revert "[inductor] Inline ComputedBuffer computation when there are no reads (#102000 )" This reverts commit f2dfcb8778f109d32c8fb141ac7492b07ad8547b. Reverted https://github.com/pytorch/pytorch/pull/102000 on behalf of https://github.com/kit1980 due to Broke inductor tests https://github.com/pytorch/pytorch/actions/runs/5095190248/jobs/9160028124 ([comment](https://github.com/pytorch/pytorch/pull/102000#issuecomment-1565131080))	2023-05-27 01:11:40 +00:00
chunyuan	20e6ff375a	support ConvBinary in Inductor cpp wrapper (#101393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101393 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/EikanWang	2023-05-27 01:03:51 +00:00
PyTorch MergeBot	f162ab0423	Revert "[inductor] Handle floordiv and remainder in IndexPropagation (#102277 )" This reverts commit 267a181beb3e7d39dce3c5dfa527080684969691. Reverted https://github.com/pytorch/pytorch/pull/102277 on behalf of https://github.com/kit1980 due to Broke inductor tests https://github.com/pytorch/pytorch/actions/runs/5095190248/jobs/9160028124 ([comment](https://github.com/pytorch/pytorch/pull/102277#issuecomment-1565108864))	2023-05-27 00:40:43 +00:00
PyTorch MergeBot	da3aba1e46	Revert "[pt2] add `SymInt` support for `linalg.pinv` (#102367 )" This reverts commit 0d5b74da0cab798fbfdb9caa53fad816999c8386. Reverted https://github.com/pytorch/pytorch/pull/102367 on behalf of https://github.com/kit1980 due to Broke slow tests https://github.com/pytorch/pytorch/actions/runs/5095190248/jobs/9160028124 ([comment](https://github.com/pytorch/pytorch/pull/102367#issuecomment-1565104562))	2023-05-27 00:33:42 +00:00
Jerry Zhang	23223402eb	[quant][pt2e] Add Support for DerivedQuantizationSpec (#102282 ) Summary: ``` """ 4. DerivedQuantizationSpec this is the quantization spec for the Tensors whose quantization parameters are derived from other Tensors """ class DerivedQuantizationSpec(QuantizationSpecBase): # specifies which Tensors the quantization parameters are derived from # this can either be an edge from argument to node, or a node derived_from: List[EdgeOrNode] derive_qparams_fn: Callabale[List[ObserverOrFakeQuantize], Tuple[Tensor, Tensor]] ... ``` Test Plan: ``` buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e' buck2 test mode/opt caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)' ``` Reviewed By: kimishpatel Differential Revision: D46097855 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102282 Approved by: https://github.com/andrewor14	2023-05-27 00:24:39 +00:00
Peter Bell	267a181beb	[inductor] Handle floordiv and remainder in IndexPropagation (#102277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102277 Approved by: https://github.com/lezcano	2023-05-26 23:47:53 +00:00
Peter Bell	f2dfcb8778	[inductor] Inline ComputedBuffer computation when there are no reads (#102000 ) When inductor compiles the following example, ```python def flip(x): idx = torch.arange(x.shape[0] - 1, -1, -1, device=x.device) return x[idx], idx ``` The return of `idx` forces it to be realized into a `ComputedBuffer` and the downstream index call inserts a corresponding load and indirect_indexing: ```python tmp0 = tl.load(in_ptr0 + (x1), None) tmp1 = triton_helpers.promote_to_tensor(tmp0) tl.device_assert((0 <= tmp1) & (tmp1 < 128), "index out of bounds: 0 <= tmp1 < 128") tmp2 = tl.load(in_ptr1 + (x0 + (128tmp0)), None) ``` However, if we can inline the index expression from the buffer's computation we instead get direct indexing (and half the loads): ```python tmp0 = tl.load(in_ptr0 + (127 + ((-1)x0)), None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102000 Approved by: https://github.com/lezcano	2023-05-26 23:47:53 +00:00
Angela Yi	1e4292a1e8	[export] Rename graph_module.py to exported_program.py (#102260 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102260 Approved by: https://github.com/ydwu4, https://github.com/tugsbayasgalan	2023-05-26 23:36:38 +00:00
Angela Yi	c4028de462	[export] ExportedProgram (#102259 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102259 Approved by: https://github.com/ydwu4, https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan, https://github.com/zhxchen17	2023-05-26 23:36:38 +00:00
Fuzzkatt	80b916a586	fix sm86 cuda 21.1 conv threshold issues (#102361 ) Fixes #102287, helps unblock https://github.com/pytorch/pytorch/pull/102178 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102361 Approved by: https://github.com/atalman	2023-05-26 22:48:33 +00:00
Will Constable	c06d33ce43	Add dynamo itertools.combinations support (#102379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102379 Approved by: https://github.com/jansel	2023-05-26 22:48:24 +00:00
spectrometerHBH	76a36159f7	Replace full_like lowerings with decomps (#101963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101963 Approved by: https://github.com/jansel	2023-05-26 21:51:22 +00:00
Animesh Jain	9c4fd72b53	[aot_autograd][functional_rng] Change calling convention (#102344 ) Key change - seed, offset are the last 2 args in both the fwd and bwd graphs Reason - The cudagraphs implementation in inductor currently relies on very simple ordering guarantees i.e. first n inputs are static for both fwd and bwd graphs. In the current implementation of functionalization of rng ops, this assumption is broken because the first 2 inputs are seed, offset. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102344 Approved by: https://github.com/eellison	2023-05-26 21:27:20 +00:00
Aleksei Nikiforov	bcaa93e80c	s390x simd: disable functions with out-of-bounds reads (#102266 ) 3 disabled functions are attempting out of bounds reads. Disable them until sleef library is fixed. <details> <summary>ASAN report</summary> ``` ================================================================= ==2030580==ERROR: AddressSanitizer: global-buffer-overflow on address 0x03ff70f54570 at pc 0x03ff6704e960 bp 0x03ffce128940 sp 0x03ffce128930 READ of size 4 at 0x03ff70f54570 thread T0 #0 0x3ff6704e95f in vgather_vf_p_vi2 /home/user/pytorch/third_party/sleef/src/arch/helpers390x_128.h:129 #1 0x3ff6704e95f in rempif /home/user/pytorch/third_party/sleef/src/libm/sleefsimdsp.c:550 #2 0x3ff6704e95f in Sleef_cosf4_u10vxe2 /home/user/pytorch/third_party/sleef/src/libm/sleefsimdsp.c:1021 #3 0x3ff67029cfb in Sleef_cosf4_u10 /home/user/pytorch/build/sleef/src/libm/disps390x_128.c:182 #4 0x3ff55d21941 in at::vec::ZVECTOR::Vectorized<float, void> at::vec::ZVECTOR::Vectorized<float, void>::mapSleef<float __vector(4) const ()(float __vector(4)), double __vector(2) const ()(double __ vector(2)), float, 0>(float __vector(4) const ()(float __vector(4)), double __vector(2) const ()(double __vector(2))) const /home/user/pytorch/aten/src/ATen/cpu/vec/vec256/zarch/vec256_zarch.h:991 #5 0x3ff5689ad01 in at::vec::ZVECTOR::Vectorized<float, void>::cos() const /home/user/pytorch/aten/src/ATen/cpu/vec/vec256/zarch/vec256_zarch.h:1074 #6 0x3ff5685df97 in at::vml::ZVECTOR::vcos<float>(float, float const, long)::{lambda(at::vec::ZVECTOR::Vectorized<float, void>)#1}::operator()(at::vec::ZVECTOR::Vectorized<float, void>) const /home/ user/pytorch/aten/src/ATen/cpu/vml.h:71 #7 0x3ff5689b691 in void at::vec::map<float, at::vml::ZVECTOR::vcos<float>(float, float const, long)::{lambda(at::vec::ZVECTOR::Vectorized<float, void>)#1}, 0>(at::vml::ZVECTOR::vcos<float>(float, float const, long)::{lambda(at::vec::ZVECTOR::Vectorized<float, void>)#1} const&, float, float const, long) /home/user/pytorch/aten/src/ATen/cpu/vec/functional_base.h:239 #8 0x3ff5685e0df in void at::vml::ZVECTOR::vcos<float>(float, float const, long) /home/user/pytorch/aten/src/ATen/cpu/vml.h:71 #9 0x3ff563fdde3 in operator() /home/user/pytorch/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp:770 #10 0x3ff5648e4a3 in operator() /home/user/pytorch/aten/src/ATen/TensorIterator.h:406 #11 0x3ff5663cae1 in callback_fn<at::TensorIteratorBase::loop_2d_from_1d<at::native::ZVECTOR::cos_kernel(at::TensorIteratorBase&)::<lambda()>::<lambda()>::<lambda(char*, const int64_t, int64_t)> >(c onst at::native::ZVECTOR::cos_kernel(at::TensorIteratorBase&)::<lambda()>::<lambda()>::<lambda(char*, const int64_t, int64_t)>&)::<lambda(char*, const int64_t, int64_t, int64_t)> > /home/user/pytorch/ c10/util/FunctionRef.h:43 #12 0x3ff4d45a933 in c10::function_ref<void (char*, long const, long, long)>::operator()(char*, long const, long, long) const /home/user/pytorch/c10/util/FunctionRef.h:64 #13 0x3ff4d455133 in at::internal::serial_for_each(c10::ArrayRef<long>, c10::ArrayRef<long>, char, unsigned long, c10::function_ref<void (char, long const, long, long)>, at::Range) /home/user/pyt orch/aten/src/ATen/TensorIteratorInternal.h:52 #14 0x3ff4d43b703 in at::TensorIteratorBase::serial_for_each(c10::function_ref<void (char, long const, long, long)>, at::Range) const /home/user/pytorch/aten/src/ATen/TensorIterator.cpp:777 #15 0x3ff4d43ab59 in at::TensorIteratorBase::for_each(c10::function_ref<void (char*, long const, long, long)>, long) /home/user/pytorch/aten/src/ATen/TensorIterator.cpp:749 #16 0x3ff5648e851 in for_each<at::native::ZVECTOR::cos_kernel(at::TensorIteratorBase&)::<lambda()>::<lambda()>::<lambda(char*, const int64_t, int64_t)> > /home/user/pytorch/aten/src/ATen/TensorItera tor.h:421 #17 0x3ff563fe5f9 in operator() /home/user/pytorch/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp:770 #18 0x3ff56400915 in operator() /home/user/pytorch/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp:770 #19 0x3ff56400f1d in at::native::ZVECTOR::cos_kernel(at::TensorIteratorBase&) /home/user/pytorch/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp:770 #20 0x3ff4f303007 in void at::native::DispatchStub<void ()(at::TensorIteratorBase&), at::native::cos_stub>::operator()<at::native::structured_cos_out&>(c10::DeviceType, at::native::structured_cos_out &) /home/user/pytorch/aten/src/ATen/native/DispatchStub.h:158 #21 0x3ff4f2edb3f in at::native::structured_cos_out::impl(at::Tensor const&, at::Tensor const&) /home/user/pytorch/aten/src/ATen/native/UnaryOps.cpp:330 #22 0x3ff526ef739 in wrapper_CPU_cos /home/user/pytorch/build/aten/src/ATen/RegisterCPU.cpp:4307 #23 0x3ff52c651d9 in operator() /home/user/pytorch/aten/src/ATen/core/boxing/impl/WrapFunctionIntoFunctor.h:13 #24 0x3ff52c651d9 in call /home/user/pytorch/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:463 #25 0x3ff5076df2f in at::Tensor c10::callUnboxedKernelFunction<at::Tensor, at::Tensor const&>(void, c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&) /home/user/pytorch/aten/src/ATen/core /boxing/KernelFunction_impl.h:50 #26 0x3ff5009a93f in at::Tensor c10::KernelFunction::call<at::Tensor, at::Tensor const&>(c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&) const /home/user/pytorch/aten/src/ATen/core /boxing/KernelFunction_impl.h:103 #27 0x3ff5009a93f in at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&)> const&, at::Tensor const&) const /home/user/pytorch/aten/s rc/ATen/core/dispatch/Dispatcher.h:639 #28 0x3ff5009a93f in c10::TypedOperatorHandle<at::Tensor (at::Tensor const&)>::call(at::Tensor const&) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:487 #29 0x3ff5009a93f in at::_ops::cos::call(at::Tensor const&) /home/user/pytorch/build/aten/src/ATen/Operators_0.cpp:2215 #30 0x3ff7d813741 in at::Tensor::cos() const /home/user/pytorch/build/aten/src/ATen/core/TensorBody.h:2107 #31 0x3ff7dc0f2b7 in operator() /home/user/pytorch/torch/csrc/autograd/generated/python_torch_functions_2.cpp:2953 #32 0x3ff7dc0faf7 in THPVariable_cos /home/user/pytorch/torch/csrc/autograd/generated/python_torch_functions_2.cpp:2955 #33 0x3ffa5ef5ae1 in cfunction_call Objects/methodobject.c:543 #34 0x3ffa5e843f3 in _PyObject_Call Objects/call.c:305 #35 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #36 0x3ffa5feb50d in do_call_core Python/ceval.c:5915 #37 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #38 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #39 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #40 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #41 0x3ffa5e841fb in PyVectorcall_Call Objects/call.c:255 #42 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290 #43 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #44 0x3ff7f87a393 in torch::impl::dispatch::PythonKernelHolder::operator()(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) /home/user/pytorch/ torch/csrc/utils/python_dispatch.cpp:175 #45 0x3ff7f8871a7 in c10::BoxedKernel::makeFromFunctor<torch::impl::dispatch::PythonKernelHolder>(std::unique_ptr<torch::impl::dispatch::PythonKernelHolder, std::default_delete<torch::impl::dispatch:: PythonKernelHolder> >)::{lambda(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >)#1}::operator()(c10::OperatorKernel, c10::Op eratorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) const /home/user/pytorch/aten/src/ATen/core/boxing/BoxedKernel_impl.h:87 #46 0x3ff7f887261 in c10::BoxedKernel::makeFromFunctor<torch::impl::dispatch::PythonKernelHolder>(std::unique_ptr<torch::impl::dispatch::PythonKernelHolder, std::default_delete<torch::impl::dispatch:: PythonKernelHolder> >)::{lambda(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >)#1}::_FUN(c10::OperatorKernel, c10::Operator Handle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) /home/user/pytorch/aten/src/ATen/core/boxing/BoxedKernel_impl.h:86 #47 0x3ff7e0d10ab in c10::BoxedKernel::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) const /home/user/pytorch/aten/src/ATen/core/b oxing/BoxedKernel_impl.h:41 #48 0x3ff7e0d1459 in c10::KernelFunction::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) const /home/user/pytorch/aten/src/ATen/cor e/boxing/KernelFunction_impl.h:43 #49 0x3ff7f876421 in c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:6 91 #50 0x3ff4d22bcdd in c10::OperatorHandle::callBoxed(std::vector<c10::IValue, std::allocator<c10::IValue> >) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:417 #51 0x3ff65a092d5 in c10::OperatorHandle::callBoxed(std::vector<c10::IValue, std::allocator<c10::IValue> >&) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:421 #52 0x3ff65a05641 in operator() /home/user/pytorch/torch/csrc/jit/runtime/register_c10_ops.cpp:15 #53 0x3ff65a08cb5 in __invoke_impl<void, torch::jit::(anonymous namespace)::createOperatorFromC10(const c10::OperatorHandle&)::<lambda(torch::jit::Stack&)>&, std::vector<c10::IValue, std::allocator<c1 0::IValue> >&> /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/invoke.h:61 #54 0x3ff65a0897b in __invoke_r<void, torch::jit::(anonymous namespace)::createOperatorFromC10(const c10::OperatorHandle&)::<lambda(torch::jit::Stack&)>&, std::vector<c10::IValue, std::allocator<c10:: IValue> >&> /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/invoke.h:111 #55 0x3ff65a084e1 in _M_invoke /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/std_function.h:290 #56 0x3ff7eb2cb21 in std::function<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&)>::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >&) const /usr/lib/gcc/s390x-ibm-lin ux-gnu/11/include/g++-v11/bits/std_function.h:590 #57 0x3ff7eb1b659 in torch::jit::Operation::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /home/user/pytorch/aten/src/ATen/core/stack.h:41 #58 0x3ff7eb08449 in torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args, pybind11:: kwargs const&, c10::optional<c10::DispatchKey>) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:764 #59 0x3ff7eb09d85 in torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, c10::optional<c10::DispatchKey>) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:829 #60 0x3ff7e573eb9 in operator() /home/user/pytorch/torch/csrc/jit/python/init.cpp:1549 #61 0x3ff7e6728dd in call_impl<pybind11::object, torch::jit::initJITBindings(PyObject)::<lambda(const string&, const string&)>::<lambda(pybind11::args, pybind11::kwargs)>&, 0, 1, pybind11::detail::vo id_type> /home/user/pytorch/third_party/pybind11/include/pybind11/cast.h:1439 #62 0x3ff7e64312f in call<pybind11::object, pybind11::detail::void_type, torch::jit::initJITBindings(PyObject)::<lambda(const string&, const string&)>::<lambda(pybind11::args, pybind11::kwargs)>&> /h ome/user/pytorch/third_party/pybind11/include/pybind11/cast.h:1408 #63 0x3ff7e5da259 in operator() /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:249 #64 0x3ff7e5da441 in _FUN /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:224 #65 0x3ff7d317a1f in pybind11::cpp_function::dispatcher(_object, _object, _object) /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:929 #66 0x3ffa5ef5ae1 in cfunction_call Objects/methodobject.c:543 #67 0x3ffa5e843f3 in _PyObject_Call Objects/call.c:305 #68 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #69 0x3ffa5feb50d in do_call_core Python/ceval.c:5915 #70 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #71 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #72 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #73 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #74 0x3ffa5e83d1f in _PyObject_FastCallDictTstate Objects/call.c:142 #75 0x3ffa5e84937 in _PyObject_Call_Prepend Objects/call.c:431 #76 0x3ffa5f2f577 in slot_tp_call Objects/typeobject.c:7494 #77 0x3ffa5e843f3 in _PyObject_Call Objects/call.c:305 #78 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #79 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943 #80 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #81 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #82 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #83 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #84 0x3ffa5fd76a3 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #85 0x3ffa5fd772f in PyObject_Vectorcall Include/cpython/abstract.h:123 #86 0x3ffa5feb289 in call_function Python/ceval.c:5891 #87 0x3ffa5fe5c3b in _PyEval_EvalFrameDefault Python/ceval.c:4213 #88 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #89 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #90 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #91 0x3ffa5e841fb in PyVectorcall_Call Objects/call.c:255 #92 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290 #93 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #94 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943 #95 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #96 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #97 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #98 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #99 0x3ffa5e841fb in PyVectorcall_Call Objects/call.c:255 #100 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290 #101 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #102 0x3ff7f87a393 in torch::impl::dispatch::PythonKernelHolder::operator()(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) /home/user/pytorch /torch/csrc/utils/python_dispatch.cpp:175 #103 0x3ff7f8871a7 in c10::BoxedKernel::makeFromFunctor<torch::impl::dispatch::PythonKernelHolder>(std::unique_ptr<torch::impl::dispatch::PythonKernelHolder, std::default_delete<torch::impl::dispatch: :PythonKernelHolder> >)::{lambda(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >)#1}::operator()(c10::OperatorKernel, c10::O peratorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) const /home/user/pytorch/aten/src/ATen/core/boxing/BoxedKernel_impl.h:87 #104 0x3ff7f887261 in c10::BoxedKernel::makeFromFunctor<torch::impl::dispatch::PythonKernelHolder>(std::unique_ptr<torch::impl::dispatch::PythonKernelHolder, std::default_delete<torch::impl::dispatch: :PythonKernelHolder> >)::{lambda(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >)#1}::_FUN(c10::OperatorKernel, c10::Operato rHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) /home/user/pytorch/aten/src/ATen/core/boxing/BoxedKernel_impl.h:86 #105 0x3ff7e0d10ab in c10::BoxedKernel::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) const /home/user/pytorch/aten/src/ATen/core/ boxing/BoxedKernel_impl.h:41 #106 0x3ff7e0d1459 in c10::KernelFunction::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) const /home/user/pytorch/aten/src/ATen/co re/boxing/KernelFunction_impl.h:43 #107 0x3ff7f876421 in c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h: 691 #108 0x3ff4d22bcdd in c10::OperatorHandle::callBoxed(std::vector<c10::IValue, std::allocator<c10::IValue> >) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:417 #109 0x3ff65a092d5 in c10::OperatorHandle::callBoxed(std::vector<c10::IValue, std::allocator<c10::IValue> >&) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:421 #110 0x3ff65a05641 in operator() /home/user/pytorch/torch/csrc/jit/runtime/register_c10_ops.cpp:15 #111 0x3ff65a08cb5 in __invoke_impl<void, torch::jit::(anonymous namespace)::createOperatorFromC10(const c10::OperatorHandle&)::<lambda(torch::jit::Stack&)>&, std::vector<c10::IValue, std::allocator<c 10::IValue> >&> /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/invoke.h:61 #112 0x3ff65a0897b in __invoke_r<void, torch::jit::(anonymous namespace)::createOperatorFromC10(const c10::OperatorHandle&)::<lambda(torch::jit::Stack&)>&, std::vector<c10::IValue, std::allocator<c10: :IValue> >&> /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/invoke.h:111 #113 0x3ff65a084e1 in _M_invoke /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/std_function.h:290 #114 0x3ff7eb2cb21 in std::function<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&)>::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >&) const /usr/lib/gcc/s390x-ibm-li nux-gnu/11/include/g++-v11/bits/std_function.h:590 #115 0x3ff7eb1b659 in torch::jit::Operation::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /home/user/pytorch/aten/src/ATen/core/stack.h:41 #116 0x3ff7eb08449 in torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args, pybind11: :kwargs const&, c10::optional<c10::DispatchKey>) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:764 #117 0x3ff7eb09d85 in torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, c10::optional<c10::DispatchKey>) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:829 #118 0x3ff7e573eb9 in operator() /home/user/pytorch/torch/csrc/jit/python/init.cpp:1549 #119 0x3ff7e6728dd in call_impl<pybind11::object, torch::jit::initJITBindings(PyObject)::<lambda(const string&, const string&)>::<lambda(pybind11::args, pybind11::kwargs)>&, 0, 1, pybind11::detail::v oid_type> /home/user/pytorch/third_party/pybind11/include/pybind11/cast.h:1439 #120 0x3ff7e64312f in call<pybind11::object, pybind11::detail::void_type, torch::jit::initJITBindings(PyObject)::<lambda(const string&, const string&)>::<lambda(pybind11::args, pybind11::kwargs)>&> / home/user/pytorch/third_party/pybind11/include/pybind11/cast.h:1408 #121 0x3ff7e5da259 in operator() /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:249 #122 0x3ff7e5da441 in _FUN /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:224 #123 0x3ff7d317a1f in pybind11::cpp_function::dispatcher(_object, _object, _object) /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:929 #124 0x3ffa5ef5ae1 in cfunction_call Objects/methodobject.c:543 #125 0x3ffa5e843f3 in _PyObject_Call Objects/call.c:305 #126 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #127 0x3ffa5feb50d in do_call_core Python/ceval.c:5915 #128 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #129 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #130 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #131 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #132 0x3ffa5e83d1f in _PyObject_FastCallDictTstate Objects/call.c:142 #133 0x3ffa5e84937 in _PyObject_Call_Prepend Objects/call.c:431 #134 0x3ffa5f2f577 in slot_tp_call Objects/typeobject.c:7494 #135 0x3ffa5e843f3 in _PyObject_Call Objects/call.c:305 #136 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #137 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943 #138 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #139 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #140 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #141 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #142 0x3ffa5e87d2b in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #143 0x3ffa5e882dd in method_vectorcall Objects/classobject.c:83 #144 0x3ffa5e836d3 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #145 0x3ffa5e84b6f in _PyObject_CallFunctionVa Objects/call.c:485 #146 0x3ffa5e84f2d in callmethod Objects/call.c:557 #147 0x3ffa5e85039 in PyObject_CallMethod Objects/call.c:577 #148 0x3ff7f7efa05 in torch::handle_torch_function_no_python_arg_parser(c10::ArrayRef<pybind11::handle>, _object, _object, char const, _object, char const, torch::TorchFunctionName) /home/user/py torch/torch/csrc/utils/python_arg_parser.cpp:338 #149 0x3ff7eb09b67 in torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, c10::optional<c10::DispatchKey>) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:827 #150 0x3ff7e573eb9 in operator() /home/user/pytorch/torch/csrc/jit/python/init.cpp:1549 #151 0x3ff7e6728dd in call_impl<pybind11::object, torch::jit::initJITBindings(PyObject)::<lambda(const string&, const string&)>::<lambda(pybind11::args, pybind11::kwargs)>&, 0, 1, pybind11::detail::v oid_type> /home/user/pytorch/third_party/pybind11/include/pybind11/cast.h:1439 #152 0x3ff7e64312f in call<pybind11::object, pybind11::detail::void_type, torch::jit::initJITBindings(PyObject)::<lambda(const string&, const string&)>::<lambda(pybind11::args, pybind11::kwargs)>&> / home/user/pytorch/third_party/pybind11/include/pybind11/cast.h:1408 #153 0x3ff7e5da259 in operator() /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:249 #154 0x3ff7e5da441 in _FUN /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:224 #155 0x3ff7d317a1f in pybind11::cpp_function::dispatcher(_object, _object, _object*) /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:929 #156 0x3ffa5ef5ae1 in cfunction_call Objects/methodobject.c:543 #157 0x3ffa5e843f3 in _PyObject_Call Objects/call.c:305 #158 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #159 0x3ffa5feb50d in do_call_core Python/ceval.c:5915 #160 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #161 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #162 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #163 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #164 0x3ffa5e83d1f in _PyObject_FastCallDictTstate Objects/call.c:142 #165 0x3ffa5e84937 in _PyObject_Call_Prepend Objects/call.c:431 #166 0x3ffa5f2f577 in slot_tp_call Objects/typeobject.c:7494 #167 0x3ffa5e84027 in _PyObject_MakeTpCall Objects/call.c:215 #168 0x3ffa5fd767b in _PyObject_VectorcallTstate Include/cpython/abstract.h:112 #169 0x3ffa5fd772f in PyObject_Vectorcall Include/cpython/abstract.h:123 #170 0x3ffa5feb289 in call_function Python/ceval.c:5891 #171 0x3ffa5fe5ad1 in _PyEval_EvalFrameDefault Python/ceval.c:4181 #172 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #173 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #174 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #175 0x3ffa5fd76a3 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #176 0x3ffa5fd772f in PyObject_Vectorcall Include/cpython/abstract.h:123 #177 0x3ffa5feb289 in call_function Python/ceval.c:5891 #178 0x3ffa5fe5c3b in _PyEval_EvalFrameDefault Python/ceval.c:4213 #179 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #180 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #181 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #182 0x3ffa5e8427f in PyVectorcall_Call Objects/call.c:267 #183 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290 #184 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #185 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943 #186 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #187 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #188 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #189 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #190 0x3ffa5e841fb in PyVectorcall_Call Objects/call.c:255 #191 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290 #192 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #193 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943 #194 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #195 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #196 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #197 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #198 0x3ffa5e841fb in PyVectorcall_Call Objects/call.c:255 #199 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290 #200 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #201 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943 #202 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #203 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #204 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #205 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #206 0x3ffa5e841fb in PyVectorcall_Call Objects/call.c:255 #207 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290 #208 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #209 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943 #210 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #211 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #212 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #213 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #214 0x3ffa5e83d1f in _PyObject_FastCallDictTstate Objects/call.c:142 #215 0x3ffa5e84937 in _PyObject_Call_Prepend Objects/call.c:431 #216 0x3ffa5f2f577 in slot_tp_call Objects/typeobject.c:7494 #217 0x3ffa5e843f3 in _PyObject_Call Objects/call.c:305 #218 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #219 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943 #220 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #221 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #222 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #223 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #224 0x3ffa5fd76a3 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #225 0x3ffa5fd772f in PyObject_Vectorcall Include/cpython/abstract.h:123 #226 0x3ffa5feb289 in call_function Python/ceval.c:5891 #227 0x3ffa5fe5b21 in _PyEval_EvalFrameDefault Python/ceval.c:4198 #228 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #229 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #230 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #231 0x3ffa5e8427f in PyVectorcall_Call Objects/call.c:267 #232 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290 #233 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #234 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943 #235 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #236 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #237 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #238 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #239 0x3ffa5e8427f in PyVectorcall_Call Objects/call.c:267 #240 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290 #241 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #242 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943 #243 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #244 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #245 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #246 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #247 0x3ffa5e8427f in PyVectorcall_Call Objects/call.c:267 #248 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290 #249 0x3ffa5e84483 in PyObject_Call Objects/call.c:317 #250 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943 #251 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #252 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #253 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065 #254 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342 #255 0x3ffa5e8427f in PyVectorcall_Call Objects/call.c:267 0x03ff70f54570 is located 0 bytes to the right of global variable 'Sleef_rempitabsp' defined in '/home/user/pytorch/third_party/sleef/src/libm/rempitab.c:986:34' (0x3ff70f53f00) of size 1648 SUMMARY: AddressSanitizer: global-buffer-overflow /home/user/pytorch/third_party/sleef/src/arch/helpers390x_128.h:129 in vgather_vf_p_vi2 Shadow bytes around the buggy address: 0x10007fee1ea850: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x10007fee1ea860: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x10007fee1ea870: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x10007fee1ea880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x10007fee1ea890: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x10007fee1ea8a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00[f9]f9 0x10007fee1ea8b0: f9 f9 f9 f9 00 00 00 00 00 00 00 00 00 00 00 00 0x10007fee1ea8c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x10007fee1ea8d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x10007fee1ea8e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x10007fee1ea8f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Shadow gap: cc ==2030580==ABORTING ``` </details> It reproduces when running `pytest -v test/test_ops.py -k test_python_ref__refs_cos_cpu_bfloat16` under address sanitizer on s390x. See also: https://github.com/shibatch/sleef/issues/464 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102266 Approved by: https://github.com/malfet	2023-05-26 20:59:42 +00:00
Matthew Hoffman	0ed22fce97	Merge type stubs torch nn parallel (#102194 ) Fixes merge issue for #101528 In the above PR, `torch.nn.parallel.parallel_apply.get_a_var` was marked private to appease the [public interface linter](https://github.com/pytorch/pytorch/actions/runs/4999216467/jobs/8955582204#step:14:21666): `ceeb242bc7` This broke CI pipelines running external dependencies that expected `get_a_var`'s name to not change. In this PR, we change the name back to `get_a_var` and include it in the `__all__` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102194 Approved by: https://github.com/ezyang	2023-05-26 20:10:47 +00:00
Yanbo Liang	7b6438da9e	[Dynamo] Fix if condition on NNModuleVariable (#102335 ) Fixes #102315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102335 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-05-26 17:00:43 +00:00
chunyuan	3469f100f3	support ConvUnary in Inductor cpp wrapper (#101392 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101392 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/EikanWang	2023-05-26 15:52:06 +00:00
Nikita Karetnikov	0d5b74da0c	[pt2] add `SymInt` support for `linalg.pinv` (#102367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102367 Approved by: https://github.com/lezcano	2023-05-26 15:20:34 +00:00
Avik Chaudhuri	8751002215	equality assertions (#102256 ) Previously we had runtime asserts for range constraints. This diff adds runtime asserts for equality constraints. This requires a bit of refactoring that is worth calling out. 1. [Minor] Some of the data structures produced by export and consumed by the runtime assertion pass need to be broadened. This is a WIP. There are some associated code improvements that are included in this diff, but by and large the structures are similar to what exists now. Meanwhile @angelayi and I are chatting about how to make it qualitatively better: briefly, we want to index everything by symbols, which are 1-1 with (name, dim) pairs. 2. [Major] The order in which runtime asserts are emitted is changed. Previously we used to do the work in `placeholder`, now this diff adds a hook for "post-processing" after processing of all placeholders is done. This is needed because equality constraints can mention different placeholders. This change also opens the way to optimizing codegen: e.g., each (name, dim) pair should correspond to a single intermediate variable that is reused across runtime asserts. This is future work. Differential Revision: [D46177642](https://our.internmc.facebook.com/intern/diff/D46177642/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102256 Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi	2023-05-26 14:57:31 +00:00
Kimish Patel	9b5e4c308c	[PT2][Quant][BE] Apply formatting to test_quantize_pt2e (#102275 ) Summary: Just formatting diff Test Plan: CI Reviewed By: jerryzh168 Differential Revision: D45948056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102275 Approved by: https://github.com/andrewor14	2023-05-26 14:24:34 +00:00
Nikita Shulga	efd774a295	Document faster builds for C++ changes (#102316 ) Update `CONTRIBUTING.md` with tip on how to avoid rebuilding/copying libs every time one makes a small change to the native code. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at f5e8394</samp> > _`setup.py` docs_ > _Link to source and build dirs_ > _Winter of testing_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/102316 Approved by: https://github.com/kit1980, https://github.com/huydhn	2023-05-26 14:11:08 +00:00
dependabot[bot]	c05a317371	Bump requests from 2.30.0 to 2.31.0 in /tools/build/bazel (#102059 ) * Bump requests from 2.30.0 to 2.31.0 in /tools/build/bazel Bumps [requests](https://github.com/psf/requests) from 2.30.0 to 2.31.0. - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](https://github.com/psf/requests/compare/v2.30.0...v2.31.0) --- updated-dependencies: - dependency-name: requests dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Apply suggestions from code review --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Nikita Shulga <nshulga@meta.com>	2023-05-26 07:01:22 -07:00
PyTorch MergeBot	6c9b94dcda	Revert "add additional stream priority for cuda streams (#101956 )" This reverts commit 5da497cabbbef96061a7840ea7e5f10730ccc2a0. Reverted https://github.com/pytorch/pytorch/pull/101956 on behalf of https://github.com/osalpekar due to Broke internal builds that used -Wunused-function since this PR removed the call to StreamIdType::<< ([comment](https://github.com/pytorch/pytorch/pull/101956#issuecomment-1563875493))	2023-05-26 06:35:23 +00:00
Rohan Varma	3dfa755a1f	[MTPG] Enable for some tests in test_fsdp_misc (#102043 ) Enables MTPG for some FSDP tests in this file. Tests that need the backward pass and warning logging are left as follow up work. Backward pass issue: It seems that there is a hang with all_gather. Will sync with @kumpera on this. Warning issue: We have a couple tests that regex check on warnings, but in the multithreaded scenario these warnings are somehow not logged. Differential Revision: [D43209769](https://our.internmc.facebook.com/intern/diff/D43209769/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102043 Approved by: https://github.com/awgu	2023-05-26 06:21:25 +00:00
Wang, Eikan	ce41faa2ae	Add cpp.max_horizontal_fusion_size to control the granularity of horizontal fusion (#99828 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99828 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-05-26 05:20:49 +00:00
PyTorch UpdateBot	e1dc793ef0	[vision hash update] update the pinned vision hash (#102318 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102318 Approved by: https://github.com/pytorchbot	2023-05-26 04:22:10 +00:00
AllenTiTaiWang	fb468b6792	[ONNX] Support aten::scatter_reduce (#102048 ) Fixes #84260 `reduce='mean'` is not supported, as it's not in ONNX spec (https://github.com/onnx/onnx/issues/5100) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102048 Approved by: https://github.com/abock	2023-05-26 02:51:41 +00:00
Driss Guessous	ef13fde290	Increase mem eff backward performance (#101847 ) # Summary This is another upstream which is much smaller than the previous. This bumps the kernel versions from xformers Current: [6425fd0cacb1a6579aa2f0c4a570b737cb10e9c3](`6425fd0cac`) With this PR: [1d635e193e169fc677b2e7fa42dad7ebe88eec9e](`1d635e193e`) ### Notable Changes: - Drastically improve the BW pass in multiple cases (especially when BnumHeads < 100) - H100 Support: Warning* While these kernels have been added, we don't have the CI/CD machines to test. - Enables a deterministic mode. ## Specific Changes - Updates to the backward kernel. - Added num_splits_key which we hard code to -1. (This is a another performance knob that we set to the heuristic) - Update gen_code and kernels to produce h100 instantiations. ### Due Diligence Checks: * CUDA_lib size: No changes in size #### Peformance * Micro Benchmark: (batch_size: 1, num_heads=25, seq_len=4096, embed_dim = 64 \| grid:[1,25,1]block: [128,1,1]) * MemEfficientAttention Backward Kernel: 27.972 ms * After the updated Xformers code(https://github.com/pytorch/pytorch/pull/100583): 23.958 ms * With this PR: 4.085 ms * Ran micro benchmarks on sdpa_forw().sum().backward() over a range of dtypes, and input shapes * Geo_mean increase -> 1.17x * Max increase -> 2.95x * min_increase -> 0.8x Pull Request resolved: https://github.com/pytorch/pytorch/pull/101847 Approved by: https://github.com/cpuhrsch	2023-05-26 02:25:31 +00:00
Wang, Eikan	6f464e0cf8	Invoke the bf16 load w/o #elements to bypass the temporary buffer allocation from the performance perspective. (#99822 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99822 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-05-26 02:10:41 +00:00
Wang, Eikan	c3550d8376	Add fast path for BF16 kernel if all the operations within the kernel support bf16 (#99814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99814 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-05-26 02:08:53 +00:00
Natalia Gimelshein	68816e4fa9	Remove inplace buffers when original and mutation are both removed (#102289 ) Currently if we have an inplaced buffer that's completely internal to a fused kernel and thus doesn't need to be allocated, we are still allocating it and sending unused argument to a kernel, because our analysis for removing buffers treats it separately (assuming that either original or mutated value are still needed). This PR extends buffer removal to inplaced buffers that can be removed. Generated kernel for e.g. ln changes from ``` def triton_(in_out_ptr0, in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr): ``` where in_out_ptr0 is unused in the kernel to ``` def triton_(in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr): ``` and corresponding allocation/reuse lines in the wrapper are removed. The `in_out_ptr1` is also mislabeled - it's not `in_out`, it's only written to, but this PR doesn't fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102289 Approved by: https://github.com/jansel	2023-05-26 02:06:36 +00:00
Li-Huai (Allan) Lin	0db704d240	[OpInfo] Add multi_head_attention_forward (#100153 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 8f8d620</samp> This pull request improves the testing of the `nn.functional.multi_head_attention_forward` function by adding it to the `OpInfo` framework, adjusting the tolerance and skipping criteria for some test cases, and restricting the dtype for the `MetaProgrammingSystem` tests. These changes aim to address the randomness and numerical precision issues of the function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100153 Approved by: https://github.com/drisspg	2023-05-26 01:58:17 +00:00
PyTorch MergeBot	8aa48315de	Revert "Disallow _foreach_utils.py, but allow it to be inlined (#102221 )" This reverts commit 552299c42c45dda93e2a473639e092dae4d548b9. Reverted https://github.com/pytorch/pytorch/pull/102221 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. It starts to break dynamo jobs in trunk `552299c42c` and it looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/102221#issuecomment-1563694599))	2023-05-26 01:27:19 +00:00
eqy	54f38381a0	[CUDA][DLPack] Try ~~bumping sleep interval~~ running on explicit side-stream for Windows `dlpack` test (#102283 ) (attempted fix for Windows failure in #101318) CC @huydhn If this doesn't work, will try adding an explicit side stream in case that is causing the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102283 Approved by: https://github.com/huydhn	2023-05-26 00:57:55 +00:00
atannous	b469ed72d0	Integrating new API usage metadata logger (#101762 ) Summary: The new logger allows passing metadata into the api usage logger. The immediate use case is to pass the serialization_id to the save and load events to be enable tracking serialized models in API events. It could be extended to add more metadata in the future. Test Plan: ``` buck2 test @//mode/dev //caffe2/caffe2/serialize:inline_container_test ``` Reviewed By: davidberard98 Differential Revision: D45683697 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101762 Approved by: https://github.com/davidberard98	2023-05-26 00:24:26 +00:00
Will Constable	ae5606bb2f	Make test_inductor_collectives use self.assert* (#102274 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102274 Approved by: https://github.com/wanchaol, https://github.com/voznesenskym	2023-05-26 00:02:02 +00:00
XiaobingSuper	b628eb524b	simplify BinaryDivFloorKernel.cu code (#102168 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102168 Approved by: https://github.com/ngimel	2023-05-25 23:52:35 +00:00
Michael Lazos	552299c42c	Disallow _foreach_utils.py, but allow it to be inlined (#102221 ) This function should not be allowed, but should be inlineable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102221 Approved by: https://github.com/anijain2305	2023-05-25 23:48:36 +00:00
Denis Vieriu	de7ec2ddd7	[MPS] Allow saved models to be loaded directly to MPS through torch.jit.load (#102204 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 94eed69</samp> This pull request adds support for serializing and deserializing tensors on the `mps` device using JIT. It includes a test case in `test/test_mps.py` and a device handling logic in `torch/csrc/jit/serialization/unpickler.cpp`. Fixes https://github.com/pytorch/pytorch/issues/88820, https://github.com/pytorch/pytorch/issues/87504 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102204 Approved by: https://github.com/kulinseth, https://github.com/malfet	2023-05-25 23:32:29 +00:00
Bin Bao	836798e0f3	[inductor] Support precomputed_sizes in CppWrapperCodeGen (#102083 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102083 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-05-25 23:14:28 +00:00
AllenTiTaiWang	053dff1111	[ONNX] Bump ORT version to 1.15.0 (#102248 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102248 Approved by: https://github.com/abock	2023-05-25 23:11:52 +00:00
William Wen	3c77310752	fix benchmarks/dynamo/runner.py (#102311 ) Benchmark performance csv's can now contain `infra_error` strings, leading to failed parses. Fix by converting strings in data to 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102311 Approved by: https://github.com/yanboliang	2023-05-25 22:42:03 +00:00
Thomas J. Fan	0d17bd5fa4	DOC Fixes unpacking issue in dynamo explain docs (#101761 ) This PR updates the docs to be consistent with `torch.explain` which currently returns 6 items: `bfb3941ad8/torch/_dynamo/eval_frame.py (L622-L629)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101761 Approved by: https://github.com/desertfire	2023-05-25 22:32:15 +00:00
Fuzzkatt	5b01c8dc6a	fix functorch/test_ops.py test_vjp flash attention unexpected success (#102131 ) add isSm90 check for expected failure in nn.functional.scaled_dot_product_attention in functorch/test_ops.py Fixes #102029 Uses solution https://github.com/pytorch/pytorch/issues/102029#issuecomment-1560052965 which was verified by https://github.com/pytorch/pytorch/issues/102029#issuecomment-1560071148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102131 Approved by: https://github.com/zou3519	2023-05-25 22:17:25 +00:00
Ran Ding	184d4f1ba3	[ez] add `docs/source/compile/generated/` to .gitignore (#101094 ) as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/101094 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-05-25 21:52:26 +00:00
Michael Lazos	80f7264804	Foreach kernel codegen in inductor (#99975 ) [design doc](https://docs.google.com/document/d/1JLr5yMAR8TuKW78ixKeqzfDHhcazwxKo_JXQnP_-wyY/edit?kh_source=GDOCS#heading=h.8x4z4mmet3im) Add foreach kernel codegen for a single overload of foreach add in Inductor. Coverage will expand to more ops in subsequent PRs. [example](https://gist.github.com/mlazos/9606fe64100ea2a5ec8265df1739fbe2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99975 Approved by: https://github.com/jansel	2023-05-25 21:48:41 +00:00
Elias Ellison	1f80b972a6	[CUDAGraph Trees] Fix empty storages handling (#102273 ) We don't need to handle managing their memory since they dont have any. Previously you would get error `RuntimeError: These storage data ptrs are not allocated in pool (0, 2) but should be {0}` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102273 Approved by: https://github.com/ngimel	2023-05-25 21:45:12 +00:00
Wanchao Liang	c1db235040	[dynamo] fix module buffers call (#102251 ) This PR fixes module buffers call and extract module.buffers similar to module.parameters Pull Request resolved: https://github.com/pytorch/pytorch/pull/102251 Approved by: https://github.com/wconstab	2023-05-25 21:26:09 +00:00
Wanchao Liang	d40f4f12f6	[dynamo] add itertools.chain support (#102247 ) This PR adds itertools chain support to dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/102247 Approved by: https://github.com/jansel	2023-05-25 21:26:09 +00:00
fuwenguang	c2498d3deb	Fixed indentation error in test_binary_ufuncs.py (#102244 ) Fixes #102147 Move the code where calling _scalar_helper out of its defination scope. Otherwise test_div_and_floordiv_vs_python will test nothing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102244 Approved by: https://github.com/kit1980	2023-05-25 21:21:30 +00:00
Iris	080d86acfb	[DCP] Add API logging for checkpoint high level API (#102278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102278 Approved by: https://github.com/fduwjj	2023-05-25 21:13:29 +00:00
dependabot[bot]	bd39767408	Bump requests from 2.26 to 2.31.0 in /.github (#102057 ) Bumps [requests](https://github.com/psf/requests) from 2.26 to 2.31.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/psf/requests/releases">requests's releases</a>.</em></p> <blockquote> <h2>v2.31.0</h2> <h2>2.31.0 (2023-05-22)</h2> <p><strong>Security</strong></p> <ul> <li> <p>Versions of Requests between v2.3.0 and v2.30.0 are vulnerable to potential forwarding of <code>Proxy-Authorization</code> headers to destination servers when following HTTPS redirects.</p> <p>When proxies are defined with user info (<a href="https://user:pass@proxy:8080">https://user:pass@proxy:8080</a>), Requests will construct a <code>Proxy-Authorization</code> header that is attached to the request to authenticate with the proxy.</p> <p>In cases where Requests receives a redirect response, it previously reattached the <code>Proxy-Authorization</code> header incorrectly, resulting in the value being sent through the tunneled connection to the destination server. Users who rely on defining their proxy credentials in the URL are <em>strongly</em> encouraged to upgrade to Requests 2.31.0+ to prevent unintentional leakage and rotate their proxy credentials once the change has been fully deployed.</p> <p>Users who do not use a proxy or do not supply their proxy credentials through the user information portion of their proxy URL are not subject to this vulnerability.</p> <p>Full details can be read in our <a href="https://github.com/psf/requests/security/advisories/GHSA-j8r2-6x86-q33q">Github Security Advisory</a> and <a href="https://nvd.nist.gov/vuln/detail/CVE-2023-32681">CVE-2023-32681</a>.</p> </li> </ul> <h2>v2.30.0</h2> <h2>2.30.0 (2023-05-03)</h2> <p><strong>Dependencies</strong></p> <ul> <li> <p>⚠️ Added support for urllib3 2.0. ⚠️</p> <p>This may contain minor breaking changes so we advise careful testing and reviewing <a href="https://urllib3.readthedocs.io/en/latest/v2-migration-guide.html">https://urllib3.readthedocs.io/en/latest/v2-migration-guide.html</a> prior to upgrading.</p> <p>Users who wish to stay on urllib3 1.x can pin to <code>urllib3<2</code>.</p> </li> </ul> <h2>v2.29.0</h2> <h2>2.29.0 (2023-04-26)</h2> <p><strong>Improvements</strong></p> <ul> <li>Requests now defers chunked requests to the urllib3 implementation to improve standardization. (<a href="https://redirect.github.com/psf/requests/issues/6226">#6226</a>)</li> <li>Requests relaxes header component requirements to support bytes/str subclasses. (<a href="https://redirect.github.com/psf/requests/issues/6356">#6356</a>)</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's changelog</a>.</em></p> <blockquote> <h2>2.31.0 (2023-05-22)</h2> <p><strong>Security</strong></p> <ul> <li> <p>Versions of Requests between v2.3.0 and v2.30.0 are vulnerable to potential forwarding of <code>Proxy-Authorization</code> headers to destination servers when following HTTPS redirects.</p> <p>When proxies are defined with user info (<a href="https://user:pass@proxy:8080">https://user:pass@proxy:8080</a>), Requests will construct a <code>Proxy-Authorization</code> header that is attached to the request to authenticate with the proxy.</p> <p>In cases where Requests receives a redirect response, it previously reattached the <code>Proxy-Authorization</code> header incorrectly, resulting in the value being sent through the tunneled connection to the destination server. Users who rely on defining their proxy credentials in the URL are <em>strongly</em> encouraged to upgrade to Requests 2.31.0+ to prevent unintentional leakage and rotate their proxy credentials once the change has been fully deployed.</p> <p>Users who do not use a proxy or do not supply their proxy credentials through the user information portion of their proxy URL are not subject to this vulnerability.</p> <p>Full details can be read in our <a href="https://github.com/psf/requests/security/advisories/GHSA-j8r2-6x86-q33q">Github Security Advisory</a> and <a href="https://nvd.nist.gov/vuln/detail/CVE-2023-32681">CVE-2023-32681</a>.</p> </li> </ul> <h2>2.30.0 (2023-05-03)</h2> <p><strong>Dependencies</strong></p> <ul> <li> <p>⚠️ Added support for urllib3 2.0. ⚠️</p> <p>This may contain minor breaking changes so we advise careful testing and reviewing <a href="https://urllib3.readthedocs.io/en/latest/v2-migration-guide.html">https://urllib3.readthedocs.io/en/latest/v2-migration-guide.html</a> prior to upgrading.</p> <p>Users who wish to stay on urllib3 1.x can pin to <code>urllib3<2</code>.</p> </li> </ul> <h2>2.29.0 (2023-04-26)</h2> <p><strong>Improvements</strong></p> <ul> <li>Requests now defers chunked requests to the urllib3 implementation to improve standardization. (<a href="https://redirect.github.com/psf/requests/issues/6226">#6226</a>)</li> <li>Requests relaxes header component requirements to support bytes/str subclasses. (<a href="https://redirect.github.com/psf/requests/issues/6356">#6356</a>)</li> </ul> <h2>2.28.2 (2023-01-12)</h2> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`147c8511dd`"><code>147c851</code></a> v2.31.0</li> <li><a href="`74ea7cf7a6`"><code>74ea7cf</code></a> Merge pull request from GHSA-j8r2-6x86-q33q</li> <li><a href="`3022253346`"><code>3022253</code></a> test on pypy 3.8 and pypy 3.9 on windows and macos (<a href="https://redirect.github.com/psf/requests/issues/6424">#6424</a>)</li> <li><a href="`b639e66c81`"><code>b639e66</code></a> test on py3.12 (<a href="https://redirect.github.com/psf/requests/issues/6448">#6448</a>)</li> <li><a href="`d3d504436e`"><code>d3d5044</code></a> Fixed a small typo (<a href="https://redirect.github.com/psf/requests/issues/6452">#6452</a>)</li> <li><a href="`2ad18e0e10`"><code>2ad18e0</code></a> v2.30.0</li> <li><a href="`f2629e9e3c`"><code>f2629e9</code></a> Remove strict parameter (<a href="https://redirect.github.com/psf/requests/issues/6434">#6434</a>)</li> <li><a href="`87d63de873`"><code>87d63de</code></a> v2.29.0</li> <li><a href="`51716c4ef3`"><code>51716c4</code></a> enable the warnings plugin (<a href="https://redirect.github.com/psf/requests/issues/6416">#6416</a>)</li> <li><a href="`a7da1ab349`"><code>a7da1ab</code></a> try on ubuntu 22.04 (<a href="https://redirect.github.com/psf/requests/issues/6418">#6418</a>)</li> <li>Additional commits viewable in <a href="https://github.com/psf/requests/compare/v2.26.0...v2.31.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=requests&package-manager=pip&previous-version=2.26&new-version=2.31.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102057 Approved by: https://github.com/huydhn	2023-05-25 21:06:44 +00:00
Syed Tousif Ahmed	870880236b	Enables configuration of NCCL communicators (#97394 ) NCCL 2.17+ introduces some user configurable parameters for NCCL communicators using [ncclConfig_t](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#c.ncclConfig_t) datatype and [ncclCommInitRankConfig](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcomminitrankconfig). This PR enables that feature. A user can tune the parameters as follows: ``` import torch.distributed as dist nccl_options = dist.ProcessGroupNCCL.Options() nccl_options.config.max_ctas = 32 nccl_options.config.min_ctas = 8 nccl_options.config.cga_cluster_size = 2 dist.init_process_group(backend='nccl', init_method='env://', pg_options=nccl_options) my_group = dist.new_group(pg_options=nccl_options) ``` The default values of these parameters are what is initialized by `NCCL_CONFIG_INITIALIZER`. Only for DistributedDataParallel, this PR sets the default value of cga_cluster_size to 2 (a heuristic that works well especially for DDP workloads). Tuning these parameters can lead to improvement in end-to-end performance, since it affects the communication-computation overlap for NCCL kernels. CC: @ptrblck @kwen2501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97394 Approved by: https://github.com/kwen2501	2023-05-25 20:46:19 +00:00
Yidi Wu	3cae6d2493	Make exir passes work with map_impl HigherOrderOperator. (#102009 ) Summary: Forward fix t53725825. New map implementation breaks multiple internal tests. forward fix it for some of them. To unblock others, mark unfixed ones are expectedFailure first. Test Plan: Test with CI. Reviewed By: angelayi Differential Revision: D46084287 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102009 Approved by: https://github.com/angelayi	2023-05-25 20:00:51 +00:00
Bin Bao	ee33bae5c7	Fix an issue where checking sameness throw an exception (#102279 ) Summary: currently the exception is caught by outside and marked as infra_error Pull Request resolved: https://github.com/pytorch/pytorch/pull/102279 Approved by: https://github.com/anijain2305	2023-05-25 19:49:23 +00:00
Elias Ellison	d64ec82d15	Turn on padding (#101915 ) 🚀 🚀 🚀 Turns on torchinductor mm padding. Gives 4% HF training win at 5s compilation time increase. Results for mm tuning are cached. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101915 Approved by: https://github.com/jansel	2023-05-25 18:54:28 +00:00
Elias Ellison	0833f475ce	Cache mm padding decision (#102200 ) Rebase of https://github.com/pytorch/pytorch/pull/100982 which was already accepted Pull Request resolved: https://github.com/pytorch/pytorch/pull/102200 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-05-25 17:57:28 +00:00
Khushi Agrawal	375446a0ea	[fix opinfo] empty_strided (#102088 ) Fixes #102024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102088 Approved by: https://github.com/ngimel	2023-05-25 17:39:06 +00:00
Jerry Zhang	ed87508b32	[quant][pt2e] Add support for SharedQuantizationSpec (#102184 ) Summary: This PR adds support for SharedQuantizationSpec, it's used to express the sharing between two Tensors in the prepared graph, the Tensor will either be input of some node (expressed as a Tuple of fx nodes) or output of some node (expressed as an fx Node) Test Plan: ``` buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e' buck2 test mode/opt caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)' ``` Differential Revision: D46043026 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102184 Approved by: https://github.com/kimishpatel, https://github.com/leslie-fang-intel	2023-05-25 17:31:59 +00:00
Elias Ellison	fab49823a5	Skip bandwidth bound mms (#102199 ) Speeds up compilation time, and was particularly needed for cm3leon_generate which has a ton of small matmuls of different sizes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102199 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-05-25 17:29:49 +00:00
Elias Ellison	9aaa12e328	Move mm padding to pattern matcher (#101913 ) There are a few reasons for this: 1. When I tried to enable padding via decompositions, I ran into weird errors with a number of models. I believe because we were making the type of a regular tensor a fake tensor. 2. This gives us flexibility to go before or after other graph passes 3. We can now also reason about the cost of the padding, and whether or not it can be fused since we have access to the graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/101913 Approved by: https://github.com/ngimel	2023-05-25 17:21:18 +00:00
Masaki Kozuki	0bb2b01541	Add forward mode AD to in-place foreach functions (#100695 ) Awkwardly implement fwd AD by - adding a few `CodeTemplate`s - allowing for the cases where a variable is initialized with i-th element of TensorList <!-- ### TODOs: - [x] ~~remove the first `_any_has_forward_grad_self`~~ make it a vector of bool - [ ] clean up mapping of names from reference impl to foreach impl - [x] add tests --> ### Rel: - #58833 - #96405 --- `_foreach_addcmul_.ScalarList` from `VariableType` ```c++ void _foreach_addcmul__ScalarList(c10::DispatchKeySet ks, at::TensorList self, at::TensorList tensor1, at::TensorList tensor2, at::ArrayRef<at::Scalar> scalars) { auto self_ = unpack(self, "self", 0); auto tensor1_ = unpack(tensor1, "tensor1", 1); auto tensor2_ = unpack(tensor2, "tensor2", 2); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self, tensor1, tensor2 ); std::vector<bool> _any_has_forward_grad_self(self.size()); for (const auto& i : c10::irange(self.size())) { _any_has_forward_grad_self[i] = isFwGradDefined(self[i]) \|\| isFwGradDefined(tensor1[i]) \|\| isFwGradDefined(tensor2[i]); } std::vector<c10::optional<at::Tensor>> original_selfs(self.size()); std::vector<std::shared_ptr<AddcmulBackward0>> grad_fns; if (_any_requires_grad) { for (const auto& i : c10::irange( self.size() )) { const auto ith_requires_grad = compute_requires_grad(self[i], tensor1[i], tensor2[i]); check_inplace(self[i], ith_requires_grad); grad_fns.push_back([&]() -> std::shared_ptr<AddcmulBackward0> { if (!ith_requires_grad) { return nullptr; } else { auto grad_fn = std::shared_ptr<AddcmulBackward0>(new AddcmulBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self[i], tensor1[i], tensor2[i] )); return grad_fn; } }()); } if (!grad_fns.empty()) { for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { grad_fn->self_scalar_type = self[i].scalar_type(); grad_fn->tensor1_scalar_type = tensor1[i].scalar_type(); if (grad_fn->should_compute_output(1)) { grad_fn->tensor2_ = SavedVariable(tensor2[i], false); } grad_fn->value = scalars[i]; if (grad_fn->should_compute_output(2)) { grad_fn->tensor1_ = SavedVariable(tensor1[i], false); } grad_fn->tensor2_scalar_type = tensor2[i].scalar_type(); } } } } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); std::vector<c10::optional<Storage>> tensor1__storage_saved(tensor1_.size()); for (const Tensor& tensor : tensor1_) tensor1__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> tensor1__impl_saved(tensor1_.size()); for (size_t i=0; i<tensor1_.size(); i++) if (tensor1_[i].defined()) tensor1__impl_saved[i] = tensor1_[i].getIntrusivePtr(); std::vector<c10::optional<Storage>> tensor2__storage_saved(tensor2_.size()); for (const Tensor& tensor : tensor2_) tensor2__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> tensor2__impl_saved(tensor2_.size()); for (size_t i=0; i<tensor2_.size(); i++) if (tensor2_[i].defined()) tensor2__impl_saved[i] = tensor2_[i].getIntrusivePtr(); #endif { at::AutoDispatchBelowAutograd guard; at::redispatch::_foreach_addcmul_(ks & c10::after_autograd_keyset, self_, tensor1_, tensor2_, scalars); } #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } for (size_t i=0; i<tensor1_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor1__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(tensor1_)) TORCH_INTERNAL_ASSERT(tensor1__storage_saved[i].value().is_alias_of(tensor1_[i].storage())); } for (size_t i=0; i<tensor1_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor1__impl_saved[i] && !at::impl::tensorlist_has_dispatch(tensor1_)) TORCH_INTERNAL_ASSERT(tensor1__impl_saved[i] == tensor1_[i].getIntrusivePtr()); } for (size_t i=0; i<tensor2_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor2__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(tensor2_)) TORCH_INTERNAL_ASSERT(tensor2__storage_saved[i].value().is_alias_of(tensor2_[i].storage())); } for (size_t i=0; i<tensor2_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor2__impl_saved[i] && !at::impl::tensorlist_has_dispatch(tensor2_)) TORCH_INTERNAL_ASSERT(tensor2__impl_saved[i] == tensor2_[i].getIntrusivePtr()); } #endif if (!grad_fns.empty()) { auto differentiable_outputs = flatten_tensor_args( self ); TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size()); for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { rebase_history(differentiable_outputs[i], grad_fns[i]); } } } std::vector<c10::optional<at::Tensor>> self_new_fw_grad_opts(self.size(), c10::nullopt); for (const auto& i : c10::irange(self_new_fw_grad_opts.size())) { if (_any_has_forward_grad_self[i]) { auto self_t_raw = toNonOptFwGrad(self[i]); auto self_tensor = toNonOptTensor(self[i]); auto self_t = (self_t_raw.defined() \|\| !self_tensor.defined()) ? self_t_raw : at::zeros(self_tensor.sizes(), self_tensor.options()); auto tensor1_t_raw = toNonOptFwGrad(tensor1[i]); auto tensor1_tensor = toNonOptTensor(tensor1[i]); auto tensor1_t = (tensor1_t_raw.defined() \|\| !tensor1_tensor.defined()) ? tensor1_t_raw : at::_efficientzerotensor(tensor1_tensor.sizes(), tensor1_tensor.options()); auto tensor1_p = toNonOptPrimal(tensor1[i]); auto tensor2_t_raw = toNonOptFwGrad(tensor2[i]); auto tensor2_tensor = toNonOptTensor(tensor2[i]); auto tensor2_t = (tensor2_t_raw.defined() \|\| !tensor2_tensor.defined()) ? tensor2_t_raw : at::_efficientzerotensor(tensor2_tensor.sizes(), tensor2_tensor.options()); auto tensor2_p = toNonOptPrimal(tensor2[i]); self_t = GradMode::is_enabled() ? self_t.clone() : self_t; self_new_fw_grad_opts[i] = self_t_raw.defined() ? self_t_raw.copy_(self_t + maybe_multiply(tensor1_t * tensor2_p, scalars[i]) + maybe_multiply(tensor2_t * tensor1_p, scalars[i])) : self_t + maybe_multiply(tensor1_t * tensor2_p, scalars[i]) + maybe_multiply(tensor2_t * tensor1_p, scalars[i]); } } for (const auto& i : c10::irange(self_new_fw_grad_opts.size())) { auto& self_new_fw_grad_opt = self_new_fw_grad_opts[i]; if (self_new_fw_grad_opt.has_value() && self_new_fw_grad_opt.value().defined() && self[i].defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. self[i]._set_fw_grad(self_new_fw_grad_opt.value(), /* level / 0, / is_inplace_op */ true); } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100695 Approved by: https://github.com/soulitzer	2023-05-25 15:39:48 +00:00
Nikita Vedeneev	6c7410ddc3	sampled_addmm: BSR support (#101163 ) This PR implements a `sampled_addmm` kernel that works with a BSR mask. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101163 Approved by: https://github.com/cpuhrsch	2023-05-25 12:33:50 +00:00
XiaobingSuper	4882cd0801	inductor: align cpp floordiv with python floordiv for dyanmic shape path (#102068 ) This PR does the following things: - Align the C++ behavior with Python for FloorDiv. - Always return expr dtype for some ops which not use expr's dtype to do the computation. After this PR, TIMM ```levit_128``` and ```volo_d1_224``` accuracy tests can be passed for dynamic shape path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102068 Approved by: https://github.com/jgong5, https://github.com/ngimel	2023-05-25 10:18:45 +00:00
Shabab Ayub	a896962f0a	[fx][2/n] Add metadata to placeholders (#102195 ) Summary: # Context In TorchRec's train pipeline, we need to fx trace a module to analyze the arguments on the forward call. In order to do this, we need to preserve some sort of meaning with each argument (a key or name of sorts that lets us identify the argument). The issue is, when you use concrete args, internally, fx will unflatten the arg into it's constituents (to locate PHs). Given a function that looks like this: ``` def process(batch: Dict[str, torch.Tensor]): .... symbolic_trace(process, concrete_args: {"batch": {"f1": PH, "f2": PH}}) # function will be rewritten to look like: def process(batch_1, batch_2): # batch_1 -> "f1", batch_2->"f2" ... ``` When you traverse through the nodes of the graph, the names of the argument nodes to the function are batch_1 and batch_2. This doesn't mean anything to the user who is fx tracing. There isn't anything indicating that batch_1 corresponds to key "f1" in the batch input. # Solution When fx sees a "PH", it creates a proxy node. The user does not have direct access to proxy creation, but only through the PH structure. Attach a piece of metadata, `ph_key`, to the PH when you set it in the concrete args, it will get passed into proxy + node creation. So when you traverse the graph, this metadata sticks onto the node as an attribute. This way you have a way of tagging that "batch_1" as "f1". Test Plan: added a unit test Reviewed By: dstaay-fb Differential Revision: D44947653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102195 Approved by: https://github.com/PaliC	2023-05-25 07:04:20 +00:00
Wanchao Liang	7b47cd0a6c	[c10d] add fake pg necessary collectives (#102238 ) This PR adds fake pg necessary collectives to enable e2e FSDP run with out multiprocess or multithreading Pull Request resolved: https://github.com/pytorch/pytorch/pull/102238 Approved by: https://github.com/ezyang	2023-05-25 05:01:16 +00:00
Wanchao Liang	9a19262556	[c10d] conslidate barrier after init logic (#102237 ) This PR consolidates the barrier after init logic to allow custom backend to set the env var when creating the pg, so that `init_process_group` would skip barrier Pull Request resolved: https://github.com/pytorch/pytorch/pull/102237 Approved by: https://github.com/ezyang	2023-05-25 05:01:16 +00:00
Elias Ellison	aa83a52742	Profiling doc (#101895 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101895 Approved by: https://github.com/msaroufim, https://github.com/shunting314	2023-05-25 04:57:38 +00:00
Edward Z. Yang	818d92f58c	Support resize on meta storage (#101988 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101988 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2023-05-25 04:41:45 +00:00
min-jean-cho	3ca068bc44	Location-shift MKL Exponential Distribution (#101720 ) Fixes #48841 , https://github.com/pytorch/pytorch/issues/101620 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101720 Approved by: https://github.com/lezcano, https://github.com/ngimel, https://github.com/mingfeima, https://github.com/jgong5	2023-05-25 04:15:44 +00:00
fduwjj	d4380edb9b	[TP] Add API logging for TP high level API (#102209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102209 Approved by: https://github.com/wz337, https://github.com/wanchaol	2023-05-25 03:33:00 +00:00
Avik Chaudhuri	d4f711b0b5	do not raise when constraint locals are not in signature (#102198 ) Summary: Fix forward for D46151668 Test Plan: none Differential Revision: D46161799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102198 Approved by: https://github.com/angelayi	2023-05-25 03:16:00 +00:00
Michael Lazos	69c7f710ba	Add meta registrations for some foreach ops (#102225 ) as title Pull Request resolved: https://github.com/pytorch/pytorch/pull/102225 Approved by: https://github.com/ngimel	2023-05-25 02:59:11 +00:00
PyTorch UpdateBot	2f08f9a66f	[vision hash update] update the pinned vision hash (#102230 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102230 Approved by: https://github.com/pytorchbot	2023-05-25 02:48:13 +00:00
Michael Lazos	2434a205de	Support unary not on lists (#102210 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/102210 Approved by: https://github.com/anijain2305	2023-05-25 02:45:36 +00:00
Wei Lu	a0e44284de	[pytorch] add Vulkan support for the `aten::cat` operator for 1d, 2d, 3d and 4d (#102128 ) Summary: Implement `torch.cat(tensors, dim=0)`, which concatenates a given sequence of tensors in the given dimension, for Vulkan backend. See the behavior of the operator here: https://pytorch.org/docs/stable/generated/torch.cat.html Test Plan: ``` (base) luwei@luwei-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="cat_" Downloaded 0/2 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules) Building: finished in 12.2 sec (100%) 471/471 jobs, 2/471 updated Total time: 12.2 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = cat_ [==========] Running 40 tests from 1 test suite. [----------] Global test environment set-up. [----------] 40 tests from VulkanAPITest [ RUN ] VulkanAPITest.cat_4d_dim0_invalidinputs_exceptions [ OK ] VulkanAPITest.cat_4d_dim0_invalidinputs_exceptions (73 ms) [ RUN ] VulkanAPITest.cat_4d_dim0_samebatch_success [ OK ] VulkanAPITest.cat_4d_dim0_samebatch_success (36 ms) [ RUN ] VulkanAPITest.cat_4d_dim0_diffbatch_success [ OK ] VulkanAPITest.cat_4d_dim0_diffbatch_success (20 ms) [ RUN ] VulkanAPITest.cat_4d_dim0_singledepth_success [ OK ] VulkanAPITest.cat_4d_dim0_singledepth_success (2 ms) [ RUN ] VulkanAPITest.cat_4d_dim0_singletensor_success [ OK ] VulkanAPITest.cat_4d_dim0_singletensor_success (4 ms) [ RUN ] VulkanAPITest.cat_4d_dim0_twotensors_success [ OK ] VulkanAPITest.cat_4d_dim0_twotensors_success (13 ms) [ RUN ] VulkanAPITest.cat_4d_dim0_negdim_success [ OK ] VulkanAPITest.cat_4d_dim0_negdim_success (38 ms) [ RUN ] VulkanAPITest.cat_4d_dim1_negdim_success [ OK ] VulkanAPITest.cat_4d_dim1_negdim_success (26 ms) [ RUN ] VulkanAPITest.cat_4d_dim2_negdim_success [ OK ] VulkanAPITest.cat_4d_dim2_negdim_success (31 ms) [ RUN ] VulkanAPITest.cat_4d_dim3_negdim_success [ OK ] VulkanAPITest.cat_4d_dim3_negdim_success (30 ms) [ RUN ] VulkanAPITest.cat_4d_dim1_singledepth_success [ OK ] VulkanAPITest.cat_4d_dim1_singledepth_success (2 ms) [ RUN ] VulkanAPITest.cat_4d_dim1_singletensor_success [ OK ] VulkanAPITest.cat_4d_dim1_singletensor_success (4 ms) [ DISABLED ] VulkanAPITest.DISABLED_cat_4d_dim1_twotensors_success [ RUN ] VulkanAPITest.cat_4d_dim1_bat1_mult4ch_success [ OK ] VulkanAPITest.cat_4d_dim1_bat1_mult4ch_success (4 ms) [ RUN ] VulkanAPITest.cat_4d_dim1_bat2_mult4ch_success [ OK ] VulkanAPITest.cat_4d_dim1_bat2_mult4ch_success (7 ms) [ RUN ] VulkanAPITest.cat_4d_dim1_mult4ch_mixed_success [ OK ] VulkanAPITest.cat_4d_dim1_mult4ch_mixed_success (19 ms) [ DISABLED ] VulkanAPITest.DISABLED_cat_4d_dim1_mult4ch_nonmult4ch_success [ RUN ] VulkanAPITest.cat_4d_dim2_sameheight_success [ OK ] VulkanAPITest.cat_4d_dim2_sameheight_success (23 ms) [ RUN ] VulkanAPITest.cat_4d_dim2_diffheight_success [ OK ] VulkanAPITest.cat_4d_dim2_diffheight_success (23 ms) [ RUN ] VulkanAPITest.cat_4d_dim2_singledepth_success [ OK ] VulkanAPITest.cat_4d_dim2_singledepth_success (2 ms) [ RUN ] VulkanAPITest.cat_4d_dim2_invalidinputs_exceptions [ OK ] VulkanAPITest.cat_4d_dim2_invalidinputs_exceptions (23 ms) [ RUN ] VulkanAPITest.cat_4d_dim3_invalidinputs_exceptions [ OK ] VulkanAPITest.cat_4d_dim3_invalidinputs_exceptions (23 ms) [ RUN ] VulkanAPITest.cat_4d_dim3_samewidth_success [ OK ] VulkanAPITest.cat_4d_dim3_samewidth_success (30 ms) [ RUN ] VulkanAPITest.cat_4d_dim3_diffwidth_success [ OK ] VulkanAPITest.cat_4d_dim3_diffwidth_success (22 ms) [ RUN ] VulkanAPITest.cat_3d_dim0_diff_channel_success [ OK ] VulkanAPITest.cat_3d_dim0_diff_channel_success (8 ms) [ RUN ] VulkanAPITest.cat_3d_dim0_same_channel_success [ OK ] VulkanAPITest.cat_3d_dim0_same_channel_success (5 ms) [ RUN ] VulkanAPITest.cat_3d_dim1_diffheight_success [ OK ] VulkanAPITest.cat_3d_dim1_diffheight_success (7 ms) [ RUN ] VulkanAPITest.cat_3d_dim1_same_height_success [ OK ] VulkanAPITest.cat_3d_dim1_same_height_success (6 ms) [ RUN ] VulkanAPITest.cat_3d_dim2_diffwidth_success [ OK ] VulkanAPITest.cat_3d_dim2_diffwidth_success (9 ms) [ RUN ] VulkanAPITest.cat_3d_dim2_samewidth_success [ OK ] VulkanAPITest.cat_3d_dim2_samewidth_success (4 ms) [ RUN ] VulkanAPITest.cat_3d_dim0_negdim_success [ OK ] VulkanAPITest.cat_3d_dim0_negdim_success (8 ms) [ RUN ] VulkanAPITest.cat_3d_dim1_negdim_success [ OK ] VulkanAPITest.cat_3d_dim1_negdim_success (8 ms) [ RUN ] VulkanAPITest.cat_3d_dim2_negdim_success [ OK ] VulkanAPITest.cat_3d_dim2_negdim_success (5 ms) [ RUN ] VulkanAPITest.cat_2d_dim0_same_height_success [ OK ] VulkanAPITest.cat_2d_dim0_same_height_success (2 ms) [ RUN ] VulkanAPITest.cat_2d_dim0_diff_height_success [ OK ] VulkanAPITest.cat_2d_dim0_diff_height_success (1 ms) [ RUN ] VulkanAPITest.cat_2d_dim1_same_width_success [ OK ] VulkanAPITest.cat_2d_dim1_same_width_success (1 ms) [ RUN ] VulkanAPITest.cat_2d_dim1_diff_width_success [ OK ] VulkanAPITest.cat_2d_dim1_diff_width_success (1 ms) [ RUN ] VulkanAPITest.cat_2d_dim0_negdim_success [ OK ] VulkanAPITest.cat_2d_dim0_negdim_success (1 ms) [ RUN ] VulkanAPITest.cat_2d_dim1_negdim_success [ OK ] VulkanAPITest.cat_2d_dim1_negdim_success (2 ms) [ RUN ] VulkanAPITest.cat_1d_dim0_same_width_success [ OK ] VulkanAPITest.cat_1d_dim0_same_width_success (0 ms) [ RUN ] VulkanAPITest.cat_1d_dim0_diff_width_success [ OK ] VulkanAPITest.cat_1d_dim0_diff_width_success (0 ms) [ RUN ] VulkanAPITest.cat_1d_dim0_negdim_success [ OK ] VulkanAPITest.cat_1d_dim0_negdim_success (0 ms) [----------] 40 tests from VulkanAPITest (543 ms total) [----------] Global test environment tear-down [==========] 40 tests from 1 test suite ran. (543 ms total) [ PASSED ] 40 tests. YOU HAVE 2 DISABLED TESTS ``` Reviewed By: SS-JIA Differential Revision: D46059444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102128 Approved by: https://github.com/SS-JIA	2023-05-25 02:29:15 +00:00
Michael Lazos	23dbdd900f	Full default dict support in dynamo (#102202 ) Allows arbitrary default dict factories and construction of a default dict in a compiled function - needed for [this function](`2e2a74670d/torch/utils/_foreach_utils.py (LL21C5-L21C395)`) used to group params in the foreach optimizer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102202 Approved by: https://github.com/yanboliang	2023-05-25 01:41:38 +00:00
Rohan Varma	f3e42f15e9	[FSDP] Start to generalize modules to ignore for mixed precision (#102010 ) The main use case here is that folks would like to ignore layer norm for mixed precision. This can now be enabled with: ``` mp_config = MixedPrecision( param_dtype=torch.float16, reduce_dtype=torch.float16, buffer_dtype=torch.float16, _mixed_precision_module_classes_to_ignore=[_BatchNorm, nn.LayerNorm], ) ``` This is done by classes of types in `_mixed_precision_module_classes_to_ignore` being wrapped in their own FSDP unit with mixed preicsion disabled. This is only enabled for auto wrapping. We also add module pre and post hooks to cast / downcast inputs to the appropriate full precision. Differential Revision: [D46079957](https://our.internmc.facebook.com/intern/diff/D46079957/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102010 Approved by: https://github.com/awgu	2023-05-25 00:45:54 +00:00
Animesh Jain	c2093de5d9	[partitioner] fix for rng ops (#102123 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102123 Approved by: https://github.com/Chillee	2023-05-25 00:35:07 +00:00
Fuzzkatt	2763b50803	update thresholds for various ops in functorch/test_ops.py (#102016 ) update thresholds for following ops: linalg.multi_dot, pca_lowrank matrix_exp matmul, __rmatmul__ Pull Request resolved: https://github.com/pytorch/pytorch/pull/102016 Approved by: https://github.com/ngimel, https://github.com/zou3519	2023-05-25 00:30:57 +00:00
Denis Vieriu	e274c2e4fd	[MPS] Restride output strides to contiguous format for inverse op (#102122 ) Remove unnecessary output allocation and reuse the current allocated output Tensor. This change restrides the output strides to contiguous format for inverse op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102122 Approved by: https://github.com/kulinseth	2023-05-25 00:21:43 +00:00
Howard Huang	11d1cd899a	Replace require_backend with require_backend_is_available (#101891 ) [BE] `require_backend_is_available` offers the a more thorough check as `require_backend` but both are often used together. This remove `require_backend` and centralizes on the `require_backend_is_available` decorator Pull Request resolved: https://github.com/pytorch/pytorch/pull/101891 Approved by: https://github.com/awgu	2023-05-25 00:00:06 +00:00
FindHao	3e08988cd3	Fix redudant kernel generations (#102104 ) ## Issue description The PR https://github.com/pytorch/pytorch/pull/100064 introduces a new RNG operation process. However, it causes every `randint` to load a separate random seed by default. TorchInductor generates a buffer to store all necessary random seeds and places the offsets as constant values in the subsequent compute buffers. In ir_pre_fusion generated by TorchInductor, some buffers only differ by one line, which is the load random seed with the corresponding offset. Subsequently, the codegen generates Triton kernels following the same rule. Finally, in the output_code.py, some Triton kernels only differ by one line, meaning that redundant kernels are being generated. ## Solution This PR captures the seed offset and adds it to the existing `self.sizevars` structure. It generates variable names as placeholders, allowing the code wrapper to pass the offset as an argument to the kernels. I've also modified the divisible_by_16 check to exclude this argument. This PR reduces the number of generated kernels from 50 to 17 for BertForMaskedLM forward. According to tests on my own environment, the compilation time of attention_is_all_you_need_pytorch has been reduced from 94s to 66s. The speedup remains largely unchanged, at 1.37X. The following is a comparison for a simple example. Before: ``` triton_poi_fused_0 = async_compile.triton('triton_', ''' ... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): ... tmp0 = tl.load(in_ptr0 + 0) tmp1 = x0 tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 10) triton_poi_fused_1 = async_compile.triton('triton_', ''' ... def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): ... tmp0 = tl.load(in_ptr0 + 1) tmp1 = x0 tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 10) ...''') def call(args): triton_poi_fused_0.run(buf0, buf1, 1024, grid=grid(1024), stream=stream0) triton_poi_fused_1.run(buf0, buf2, 1024, grid=grid(1024), stream=stream0) ``` After: ``` triton_poi_fused_0 = async_compile.triton('triton_', ''' ... def triton_(in_ptr0, out_ptr0, load_seed_offset, xnumel, XBLOCK : tl.constexpr): ... tmp0 = tl.load(in_ptr0 + load_seed_offset) tmp1 = x0 tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 10) .... def call(args): triton_poi_fused_0.run(buf0, buf1, 0, 1024, grid=grid(1024), stream=stream0) triton_poi_fused_0.run(buf0, buf2, 1, 1024, grid=grid(1024), stream=stream0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102104 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-05-24 23:56:53 +00:00
Wei Lu	9ce95ce157	[pytorch] add Vulkan support for the `t` and `transpose` operators for 2d, 3d and 4d tensors (#101808 ) Summary: Use the existing permute shader to implement the following two operators for Vulkan backend - `aten::transpose` The behavior of the operator is shown in https://pytorch.org/docs/stable/generated/torch.transpose.html. - `aten::t` The behavior of the operator is shown in https://pytorch.org/docs/stable/generated/torch.t.html#torch.t. 1d tensors are returned as is. When input is a 2d tensor this is equivalent to `aten::transpose(input, 0, 1)`. Test Plan: At local repo of fbsource on MacBook, run `buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1` - Full test results P739033174. - `aten::t` and `aten::tranpose` related results shown below ``` (base) luwei@luwei-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 [... other tests ...] [ RUN ] VulkanAPITest.transpose_t_1d [ OK ] VulkanAPITest.transpose_t_1d (0 ms) [ RUN ] VulkanAPITest.transpose_t_2d_small [ OK ] VulkanAPITest.transpose_t_2d_small (1 ms) [ RUN ] VulkanAPITest.transpose_t_2d_medium [ OK ] VulkanAPITest.transpose_t_2d_medium (0 ms) [ RUN ] VulkanAPITest.transpose_t_2d_large [ OK ] VulkanAPITest.transpose_t_2d_large (0 ms) [ RUN ] VulkanAPITest.transpose_2d_height_and_width_small [ OK ] VulkanAPITest.transpose_2d_height_and_width_small (0 ms) [ RUN ] VulkanAPITest.transpose_2d_height_and_width_medium [ OK ] VulkanAPITest.transpose_2d_height_and_width_medium (0 ms) [ RUN ] VulkanAPITest.transpose_2d_height_and_width_large [ OK ] VulkanAPITest.transpose_2d_height_and_width_large (0 ms) [ RUN ] VulkanAPITest.transpose_2d_height_and_height_large [ OK ] VulkanAPITest.transpose_2d_height_and_height_large (0 ms) [ RUN ] VulkanAPITest.transpose_2d_width_and_width_large [ OK ] VulkanAPITest.transpose_2d_width_and_width_large (0 ms) [ RUN ] VulkanAPITest.transpose_3d_height_and_width_small [ OK ] VulkanAPITest.transpose_3d_height_and_width_small (0 ms) [ RUN ] VulkanAPITest.transpose_3d_height_and_width_medium [ OK ] VulkanAPITest.transpose_3d_height_and_width_medium (1 ms) [ RUN ] VulkanAPITest.transpose_3d_height_and_width_large [ OK ] VulkanAPITest.transpose_3d_height_and_width_large (1 ms) [ RUN ] VulkanAPITest.transpose_3d_width_and_width_large [ OK ] VulkanAPITest.transpose_3d_width_and_width_large (0 ms) [ RUN ] VulkanAPITest.transpose_3d_depth_and_width_small [ OK ] VulkanAPITest.transpose_3d_depth_and_width_small (0 ms) [ RUN ] VulkanAPITest.transpose_3d_depth_and_width_medium [ OK ] VulkanAPITest.transpose_3d_depth_and_width_medium (0 ms) [ RUN ] VulkanAPITest.transpose_3d_depth_and_width_large [ OK ] VulkanAPITest.transpose_3d_depth_and_width_large (0 ms) [ RUN ] VulkanAPITest.transpose_3d_depth_and_depth_large [ OK ] VulkanAPITest.transpose_3d_depth_and_depth_large (0 ms) [ RUN ] VulkanAPITest.transpose_3d_depth_and_height_small [ OK ] VulkanAPITest.transpose_3d_depth_and_height_small (0 ms) [ RUN ] VulkanAPITest.transpose_3d_depth_and_height_medium [ OK ] VulkanAPITest.transpose_3d_depth_and_height_medium (0 ms) [ RUN ] VulkanAPITest.transpose_3d_depth_and_height_large [ OK ] VulkanAPITest.transpose_3d_depth_and_height_large (2 ms) [ RUN ] VulkanAPITest.transpose_3d_height_and_height_large [ OK ] VulkanAPITest.transpose_3d_height_and_height_large (1 ms) [ RUN ] VulkanAPITest.transpose_4d_batch_and_batch_large [ OK ] VulkanAPITest.transpose_4d_batch_and_batch_large (1 ms) [ RUN ] VulkanAPITest.transpose_4d_depth_and_depth_large [ OK ] VulkanAPITest.transpose_4d_depth_and_depth_large (1 ms) [ RUN ] VulkanAPITest.transpose_4d_height_and_height_large [ OK ] VulkanAPITest.transpose_4d_height_and_height_large (1 ms) [ RUN ] VulkanAPITest.transpose_4d_width_and_width_large [ OK ] VulkanAPITest.transpose_4d_width_and_width_large (0 ms) [ RUN ] VulkanAPITest.transpose_4d_batch_and_depth_large [ OK ] VulkanAPITest.transpose_4d_batch_and_depth_large (1 ms) [ RUN ] VulkanAPITest.transpose_4d_batch_and_height_large [ OK ] VulkanAPITest.transpose_4d_batch_and_height_large (2 ms) [ RUN ] VulkanAPITest.transpose_4d_batch_and_width_large [ OK ] VulkanAPITest.transpose_4d_batch_and_width_large (2 ms) [ RUN ] VulkanAPITest.transpose_4d_depth_and_height_large [ OK ] VulkanAPITest.transpose_4d_depth_and_height_large (2 ms) [ RUN ] VulkanAPITest.transpose_4d_depth_and_width_large [ OK ] VulkanAPITest.transpose_4d_depth_and_width_large (2 ms) [ RUN ] VulkanAPITest.transpose_4d_height_and_width_large [ OK ] VulkanAPITest.transpose_4d_height_and_width_large (1 ms) [... other tests ...] ``` Reviewed By: SS-JIA Differential Revision: D45878333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101808 Approved by: https://github.com/SS-JIA	2023-05-24 23:50:07 +00:00
Edward Z. Yang	c903b12cb8	Add fake process group (#102180 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102180 Approved by: https://github.com/wanchaol	2023-05-24 23:27:40 +00:00
Natalia Gimelshein	5da497cabb	add additional stream priority for cuda streams (#101956 ) Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions. Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956 Approved by: https://github.com/ezyang	2023-05-24 23:26:47 +00:00
Fuzzkatt	f8896b7b0e	update tf32 thresholds in nn/test_convolution.py (#102015 ) updated tf32 thresholds for test_cudnn_convolution_relu, test_cudnn_convolution_add_relu Pull Request resolved: https://github.com/pytorch/pytorch/pull/102015 Approved by: https://github.com/ngimel	2023-05-24 22:42:25 +00:00
Huy Do	dedcf8f70f	No need to run non-CUDA jobs in memory leak check mode (#102188 ) Memory leak check mode is only meant for runner with GPU such as CUDA and ROCm https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py#L1093. So it's a waste of time and resource to run them for CPU-only jobs ### Testing CUDA jobs have both `mem_leak_check` and `rerun_disabled_tests`: * https://github.com/pytorch/pytorch/actions/runs/5072143038/jobs/9109448858#step:9:131 * https://github.com/pytorch/pytorch/actions/runs/5072143038/jobs/9109449417#step:9:123 * https://github.com/pytorch/pytorch/actions/runs/5072143038/jobs/9109452338#step:9:111 Same goes for Bazel CUDA job: * https://github.com/pytorch/pytorch/actions/runs/5072143038/jobs/9109451535#step:3:132 And ROCM job: * https://github.com/pytorch/pytorch/actions/runs/5072143038/jobs/9109451353#step:9:117 Non CUDA or ROCM jobs have only `rerun_disabled_tests` mode, for example: * https://github.com/pytorch/pytorch/actions/runs/5072143038/jobs/9109449894#step:9:127 * ASAN https://github.com/pytorch/pytorch/actions/runs/5072143038/jobs/9109449157#step:9:126 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102188 Approved by: https://github.com/clee2000	2023-05-24 22:36:31 +00:00
Peter Bell	ce42010722	[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812 Approved by: https://github.com/lezcano	2023-05-24 22:17:32 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	dbf6912be6	Populate all args with fake tensor value (#102129 ) Summary: We don't need to leak matched input positions from dynamo anymore if we can just populate all args with corresponding fake tensors. Test Plan: CI Differential Revision: D46131556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102129 Approved by: https://github.com/angelayi	2023-05-24 22:01:47 +00:00
PyTorch MergeBot	210fc28d5e	Revert "Support resize on meta storage (#101988 )" This reverts commit 7d1ba0a92adededec1ce3488e39c1d399ecf6b6c. Reverted https://github.com/pytorch/pytorch/pull/101988 on behalf of https://github.com/osalpekar due to Need to revert and rebase this in order to unblock train import ([comment](https://github.com/pytorch/pytorch/pull/101988#issuecomment-1561970230))	2023-05-24 21:51:33 +00:00
Yeonju Ro	06f656c5d1	[distributed] implemented find_all_descendants (#102138 ) Fixes #100397 Implemented find_all_descendants function that identifies the list of nodes that need to be moved. Added unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102138 Approved by: https://github.com/fegin	2023-05-24 21:47:59 +00:00
Animesh Jain	5d6810a4ee	[dynamo][higher order op] Support nn.Module calls (#102022 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102022 Approved by: https://github.com/zou3519	2023-05-24 21:39:58 +00:00
Bin Bao	e6af31a5a2	[dynamo] Add astunparse dependency (#102120 ) Summary: https://github.com/pytorch/pytorch/pull/98488 implements CSE for dynamo guards, and it relies on astunparse to perform the optimization. `test_guards_cse_pass_single` was broken and later was fixed by introducing a check_and_skip_if_needed. This actually fixes the root cause on fbcode and should bring some perf gain internally. Test Plan: `buck2 test @//mode/opt //caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::DynamicShapesMiscTests::test_guards_cse_pass_single' --run-disabled` Reviewed By: malfet Differential Revision: D46126742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102120 Approved by: https://github.com/malfet	2023-05-24 21:24:24 +00:00
Daniil Kutz	e6fc7d814d	Segmentation fault in flatbuffers when parsing malformed modules (#95221 ) Fixes #95061, #95062 Add Flatbuffer verification before parsing to avoid crashing on malformed modules. Flatbuffers doesn't perform boundary checks at runtime for the sake of performance, so when parsing untrusted modules it is highly recommended to verify overall buffer integrity. This bug can be triggered both by C++ (`torch::jit::load`, `torch::jitload_jit_module_from_file`) and Python API (`torch.jit.load`, `torch.jit.jit_module_from_flatbuffer`). Crash files to reproduce: [crash-1feb368861083e3d242e5c3fcb1090869f4819c4.txt](https://github.com/pytorch/pytorch/files/10795267/crash-1feb368861083e3d242e5c3fcb1090869f4819c4.txt) [crash-7e8ffd314223be96b43ca246d3d3481702869455.txt](https://github.com/pytorch/pytorch/files/10795268/crash-7e8ffd314223be96b43ca246d3d3481702869455.txt) [crash-ad4d7c6183af8f34fe1cb5c8133315c6389c409f.txt](https://github.com/pytorch/pytorch/files/10795279/crash-ad4d7c6183af8f34fe1cb5c8133315c6389c409f.txt) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95221 Approved by: https://github.com/qihqi, https://github.com/davidberard98	2023-05-24 21:16:19 +00:00
Nikita Vedeneev	2e2a74670d	torch.sparse.softmax: allow negative dim (#102172 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102172 Approved by: https://github.com/cpuhrsch	2023-05-24 19:43:47 +00:00
Riley Dulin	424c930f76	Add quantization lowering for nn.PixelShuffle and nn.PixelUnshuffle (#101926 ) Similar to https://github.com/pytorch/pytorch/pull/96160 but for the modules nn.PixelShuffle and nn.PixelUnshuffle. torch.nn.PixelUnshuffle accepts both float and quantized inputs. However, previously we would unnecessarily dequantize quantized inputs into floats before passing them to the function. This commit fixes this by lowering the pattern [dequant - PixelShuffle - quant]. [dequant - PixelUnshuffle - quant]. Test Plan: python test/test_quantization.py TestQuantizeFxOps.test_pixel_shuffle_module python test/test_quantization.py TestQuantizeFxOps.test_pixel_unshuffle_module Pull Request resolved: https://github.com/pytorch/pytorch/pull/101926 Approved by: https://github.com/jerryzh168	2023-05-24 19:33:26 +00:00
Yanli Zhao	956bd03808	add ignored_states to FSDP/fully_shard (#102056 ) Add 'ignored_states' that accepts either a list of ignored_parameters or a list of nn modules for FSDP model wrapper and fully_shard composable APIs, it is recommended to use 'ignored_states' over 'ignored_modules' moving forward Pull Request resolved: https://github.com/pytorch/pytorch/pull/102056 Approved by: https://github.com/awgu	2023-05-24 18:36:48 +00:00
PyTorch MergeBot	023bc30b17	Revert "Merge type stubs for torch.nn.parallel (#101528 )" This reverts commit 6cabc105bb7c9cce4e23bdcc4a921613caae0f9a. Reverted https://github.com/pytorch/pytorch/pull/101528 on behalf of https://github.com/kit1980 due to Broke inductor tests https://github.com/pytorch/pytorch/actions/runs/5071348299/jobs/9107880424 ImportError: cannot import name 'get_a_var' from 'torch.nn.parallel.parallel_apply' ([comment](https://github.com/pytorch/pytorch/pull/101528#issuecomment-1561732862))	2023-05-24 18:23:52 +00:00
Wanchao Liang	d316a2dd5c	[spmd] Enable data parallel to work with non 0 batch dim (#100073 ) This PR enables data parallel to work with non 0 batch dim, the only thing we need to do is to expose the input_batch_dim to DataParallelMode and the data parallel expansion automatically works as we have done things correctly in batch dim analysis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100073 Approved by: https://github.com/mrshenli	2023-05-24 17:55:10 +00:00
Wanchao Liang	d378837039	[spmd] add more decomp and fix a sharding bug (#100938 ) This PR adds native_layernorm_backward op to the decomp table and fixes a sharding bug to not automatically do padding Pull Request resolved: https://github.com/pytorch/pytorch/pull/100938 Approved by: https://github.com/mrshenli	2023-05-24 17:55:10 +00:00
Wanchao Liang	dd1f295201	[spmd] Improve activation handling, factory ops and batch dim reduction (#100853 ) This PR improves the activation handling logic of data parallel, to support the cases where there're tensor factory ops that does not depend on any input node, it would still produce activation, with either sharded act (i.e. if output shape have batch size) or replcate act It also significantly simplify the full reduction logic, now we don't need the full reduction detection, we only need to ensure that when compute the batch dim, we detected full reduction and mark it as sharded Pull Request resolved: https://github.com/pytorch/pytorch/pull/100853 Approved by: https://github.com/mrshenli	2023-05-24 17:55:09 +00:00
Wanchao Liang	4d55ea8548	[spmd] enhance batch dim analysis of data parallel (#100852 ) This PR enhances batch dim analysis of data parallel to understand more on the cases where batch dim get flattened or split, using dtensor's view ops, we could be able to track the batch dim that got transformed in non-trival ways. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100852 Approved by: https://github.com/mrshenli	2023-05-24 17:55:07 +00:00
Wanchao Liang	b2eaba6b62	[spmd] by default average gradients for nccl backend (#99964 ) This PR by default average gradient for NCCL backend, this allows SPMD's data parallel match with DDP/FSDP results. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99964 Approved by: https://github.com/mrshenli	2023-05-24 17:55:06 +00:00
Wanchao Liang	942cd12d55	[spmd] add option to preserve node types (#100072 ) This PR adds a option to preserve node types for the entire graph, this could allow some exploration about using those node types to do things like act checkpoint, etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100072 Approved by: https://github.com/mrshenli	2023-05-24 17:55:05 +00:00
Catherine Lee	2232cce69c	No cpp + step current (#102001 ) stepcurrent cannot handle xdist Pull Request resolved: https://github.com/pytorch/pytorch/pull/102001 Approved by: https://github.com/huydhn	2023-05-24 17:39:32 +00:00
Devashish Shankar	fcf812c35a	Unbind Cat pattern (#101767 ) In continuation to previous diffs, this diff merges unbind->cat / unbind-> stack pattern. In combination with previous diffs, this can handle split->squeeze->[cat/stack] Since many of the cases are similar to split->cat, we reuse SplitCatSimplifier Differential Revision: [D45955486](https://our.internmc.facebook.com/intern/diff/D45955486/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101767 Approved by: https://github.com/jansel	2023-05-24 17:04:14 +00:00
Matthew Hoffman	6cabc105bb	Merge type stubs for torch.nn.parallel (#101528 ) Fixes #91648 As explained in the tracking issue, the incomplete type stubs in `torch/nn/parallel` mask `DataParallel` methods relevant for subclassing and also mask type issues present in the code as well. One notable change here is the addition of [`allow_redefinition = True`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-allow_redefinition) in `mypy.ini`, which allows for a common pattern: > Allows variables to be redefined with an arbitrary type, as long as the redefinition is in the same block and nesting level as the original definition. This is added specifically to allow for the type narrowing of `device_ids` in `torch.nn.parallel.data_parallel.data_parallel` from `Sequence[Union[int, torch.device]]` to `Sequence[int]`. Other than this, there are various renamings and `type: ignore` comments added to bypass errors that arose from the merging. @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/101528 Approved by: https://github.com/ezyang	2023-05-24 16:52:13 +00:00
Devashish Shankar	fdd28399dc	Replace unsqueeze transform with stack (#101766 ) As part of split-cat transforms, we needed to unsqueeze additional inputs (not coming from split) but going to the cat/stack nodes. However, this leads to patterns like: ``` split -> unsqueeze -> cat ``` when there are multiple splits going into cat. An alternative is to use stack rather than unsqueeze, leading to patterns like: ``` split -> stack -> cat ``` This is much better, as repeated applications of the same pattern will further simplify "split->stack", which is not trivial in case of "split->unsqueeze->cat". Another nice side-effect is lesser number of nodes in the graph overall. Differential Revision: [D45952452](https://our.internmc.facebook.com/intern/diff/D45952452/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101766 Approved by: https://github.com/jansel	2023-05-24 16:43:32 +00:00
Devashish Shankar	c0d0a9f7a0	Replace split-squeeze pattern (#101765 ) Replaces split-squeeze (same dimension) with an unbind. This will be used in combination with later patterns to remove the unbind if it follows a cat/stack Differential Revision: [D45758181](https://our.internmc.facebook.com/intern/diff/D45758181/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101765 Approved by: https://github.com/jansel	2023-05-24 16:40:03 +00:00
AllenTiTaiWang	0eb4f07282	[ONNX] Introduce FX-ONNX dispatcher (#100660 ) Needs https://github.com/microsoft/onnxscript/pull/721 The current FX exporter is using manually maintained dictionary to map ATen op to its OnnxFunction. However, the issue arises when ATen op has overloads or OnnxFunction has overloads, which is not resolvable by the one to one mapping . For example, `aten::arange` has onverloads: `aten::arange.start` and `aten::arange.start_step`, or for `aten::argmax`, torchlib provides two function: aten_argmax, and aten_argmax_dim. This PR utilizes newly introduced [ONNX OpSchema](https://github.com/microsoft/onnxscript/pull/626) to match the input arguments of an ATen operator to find the correct overload. ### OnnxRegistry Heavily reference on [TorchScript Registry](https://github.com/pytorch/pytorch/pull/84382). The only difference is that in FX registry, an ATen operator with specific opset version is mapped to a list of overloaded functions. * No longer use global registry. The registry is initialized in `ResolvedExportOptions` with torchlib, and will be exposed to users in the future. * Multiple opset version layer is kept through `_SymbolicFunctionGroup` , but torchlib now only supports 18. * Basic API of custom operator support: `register`, `unregister`, and `is_register_op` are kept for future development. To further complete them, the follow-up PRs should address: - How to allow users to remove/override specific overload? Using OpSchema to differentiate? - User registers a new overload with the same OpSchema as one of registered overload. ### OnnxDispatcher Dispatch ATen operators to the matched overload by comparing OpSchema with input arguments. * `OpSchemaWrapper` wrap the onnx schema, and record matching score. * `dispatch` uses `OpSchemaWrapper` to compare data types to find the best matched overload. If the match isn't perfect, record warning in diagnostics. * `dispatch_opset_version` is referenced from #84382 and kept, but torchlib doesn't support opset version != 18. * Because right now (1) OnnxFunction arguments are manually typed, and (2) ORT could unfollow ONNX type spec, we relax the schema match with `matching score system`. * To include more supports: the follow-up PRs should address: - How to add op.Cast with autocast? In torchlib or converter? - The need of type promotion can be captured by dispatcher, but needs OpSchema shows the T1/T2 information. ### OpSchemaWrapper - Matching Score Mechanism #### The matching score system: This is a temporary solution to how we target the correct ONNX overloads given that we only have manually annotated arguments (potentially inaccurate schema) and limited supports on AttributeProto. 1. Perfect match exam: If all arguments/kwargs are all matched, return the function without any warnings. 2. Best match exam: The system add the each correct matching input counts orderly, and subtract the symmetrical difference between their attributes to calculate the matching score. And select the one with the highest score in the end. If the selection is not a perfect match, a warning message is sent to SARIF. #### Example of overloads 1. Different types: Caused by the difference between the ONNX spec and PyTorch. The matching system finds the correct one. ```python @torch_op("aten::mul") def aten_mul(self: TReal, other: TReal) -> TReal: ... @torch_op("aten::mul") def aten_mul_bool(self: BOOL, other: BOOL) -> BOOL: ... ``` 2. Optional dim: caused by unsupported op.OptionalHasElement (will support on opset version == 20). dim could be "None" ```python @torch_op("aten::argmax", trace_only=True) def aten_argmax( self: TrealOrUInt8, dim: Optional[int] = None, keepdim: bool = False ) -> TrealOrUInt8: ... @torch_op("aten::argmax", private=True) def _aten_argmax_dim(self: TrealOrUInt8, dim: int, keepdim: bool = False) -> TrealOrUInt8: ... ``` This case is impossible to differentiate, as they both might have dim in kwargs, so in this case, please make sure you turn the one with `dim: int` to private function. 3. Optional dtype: dtype could be "unprovided". The difference from 2 is that dtype would not be None. ```python @torch_op("aten::new_full") def aten_new_full(self: TTensor, size: INT64, fill_value: TTensor) -> TTensor: ... @torch_op("aten::new_full") def aten_new_full_dtype(self: TTensor, size: INT64, fill_value: TTensor, dtype: int) -> TTensor: ... ``` Depends on dtype is provided or not, matching system will dispatch the ATen op to the correct one. 4. `None` and `[]` and `NoneType` are considered failing the match. 5. Two functions have the same score is recorded into SARIFs. ### TODOs 1. Type promotion can be captured by dispatcher only if OpSchema can provide it. However, the implementation of "graph-level" pass vs "in-op"" promotion can be further discussed in https://github.com/microsoft/onnxscript/issues/563. 5. torchlib should provide the "opset version" to OnnxRegistry. 7. How to expose OnnxRegistry with custom add/remove ops APIs nneds to be further discussed. Co-authored-by: Justin Chu <justinchuby@microsoft.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100660 Approved by: https://github.com/thiagocrepaldi	2023-05-24 16:39:22 +00:00
Devashish Shankar	47b4136439	Refactor normalize passes to use @register_graph_pattern (#101764 ) Cleans up normalize passes by using register_graph_pattern decorator Differential Revision: [D45973543](https://our.internmc.facebook.com/intern/diff/D45973543/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D45973543/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/101764 Approved by: https://github.com/jansel	2023-05-24 16:27:21 +00:00
eqy	66f6e0e605	[CUDA][DLPack] Handle legacy default streams for DLPack conversion (#101318 ) It seems that some legacy default stream logic (e.g., present in `a8ff647e42/torch/utils/dlpack.py (L114)` ) is not handled on the potential receiving end in `torch/_tensor.py`. Open to suggestions on how to make the test case less clunky, as this was the combination we arrived at after discovering flakiness in alternate versions. Thanks to Olga Andreeva for surfacing this issue and providing a repro. CC @Aidyn-A @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/101318 Approved by: https://github.com/ngimel	2023-05-24 16:14:50 +00:00
Jerry Zhang	3baa67caee	[quant][pt2e][be] Move annotate helper function to quantizer/utils.py (#102127 ) Summary: att Test Plan: ``` buck2 test mode/opt caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)' ``` Reviewed By: kimishpatel Differential Revision: D46001285 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102127 Approved by: https://github.com/kimishpatel	2023-05-24 16:13:28 +00:00
Bin Bao	5f0463a6d7	[inductor] Move two cpu tests to test_cpu_repro.py (#101887 ) Summary: The two are cpu only tests. Test Plan: ``` buck2 test @//mode/dev-nosan //caffe2/test/inductor:test_inductor -- --exact 'caffe2/test/inductor:test_inductor - test_in_out_buffer_cuda (caffe2.test.inductor.test_torchinductor.CudaTests)' --run-disabled ``` Reviewed By: bertmaher Differential Revision: D46011571 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101887 Approved by: https://github.com/bertmaher	2023-05-24 15:41:06 +00:00
vfdev-5	e3d97b6213	[inductor] Added `smooth_l1_loss` refs (#102077 ) Added `smooth_l1_loss` to refs + tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/102077 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-05-24 15:07:08 +00:00
Brian Hirsh	ddf4f7bc89	fix inference_mode with torch.compile (#101219 ) It looks like inference_mode wasn't playing well with functionalization. If you run torch.compile on a function, and the inputs to the function are tensors created outside of inference mode, then we need to make sure that when we created functional tensor wrappers for those inputs during compilation, those functional wrappers properly mirror whether or not the original tensor is an inference tensor. Hopefully fixes https://github.com/pytorch/pytorch/issues/101151 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101219 Approved by: https://github.com/albanD, https://github.com/ezyang	2023-05-24 14:58:40 +00:00
Brian Hirsh	98ab11a2c3	separate out dynamo .requires_grad and .is_grad_enabled guards (#100570 ) Fixes https://github.com/pytorch/pytorch/issues/100977 This will hopefully fix this error (from [issue](https://github.com/pytorch/pytorch/issues/99616)) This PR fixes an internal model: we were running an inductor inference graph, but `torch.is_grad_enabled()` was True, causing us to error inside of the inference graph when we encountered an out= operator. I haven't been able to create a smaller repro - before landing this, I want to create a smaller repro to convince myself of why we need to separate out these guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100570 Approved by: https://github.com/ezyang	2023-05-24 14:58:40 +00:00
Nisanth M P	32643bc926	Remove vsx suffix in sleef calls (#100149 ) Sleef has automatic architecture selection for Power. There is no need to call architecture specific interfaces. If we call the generic interface, Sleef will correctly choose the architecture specific code, based on the architecure (vsx for Power8, vsx3 for Power9 and Power10). So, the vsx suffix in Sleef calls in PyTorch are removed, so that the architecture specific code selection is handled by Sleef internally. Fixes the issue wherein older (and slower) vsx code in Sleef was getting executed on newer Power9 and Power10 processors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100149 Approved by: https://github.com/jgong5	2023-05-24 14:24:38 +00:00
Richard Zou	d08066a438	[Reland][functorch] test for compiling functorch transforms (#100718 ) Original PR over at #100151. Was reverted due to internal test failures. I have fixed the internal build system. Differential Revision: [D45608453](https://our.internmc.facebook.com/intern/diff/D45608453) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100718 Approved by: https://github.com/kshitij12345, https://github.com/atalman	2023-05-24 14:21:38 +00:00
Richard Zou	08fb648fe1	Add mechanism to turn any RAII guard into a Python Context Manager (#102037 ) This PR: - adds a mechanism to turn any RAII guard into a Python Context Manager - turns ExcludeDispatchKeyGuard into a context manager, and purges usages of the older torch._C.ExcludeDispatchKeyGuard from the codebase. The mechanism is that given a RAII guard, we construct a context manager object that holds an optional guard. When we enter the context manager we populate the guard, when we exit we reset it. We don't delete torch._C.ExcludeDispatchKeyGuard for BC reasons (people are using it in fbcode). If this code actually sticks (it is using C++17 and that worries me a bit), then I'll apply the change to other RAII guards we have, otherwise, we can write our own std::apply. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102037 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2023-05-24 14:20:52 +00:00
medivh-xp	8b7bd81902	determined collective device by _get_pg_default_device rather than explicit cuda (#101533 ) There are many communication operations for shardedTensor in the state dict of fsdp. They use the external passed-in pg (or the default pg), which currently supports cuda devices. Before communication, the memory will be moved to cuda, which is implicit (because it is essentially moving data to the memory type required by pg, not the computing device type). Similarly, when users use fsdp on a custom backend, they will pass in a custom pg (which does not support cuda devices), which may cause fsdp to not work properly in some cases. This PR obtains the memory type supported by the pg through _get_pg_default_device during communication, and moves the data to it when needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101533 Approved by: https://github.com/awgu	2023-05-24 13:48:43 +00:00
Bin Bao	fd1d442185	[inductor] Add more dynamic shapes support for CudaWrapperCodeGen (#102019 ) Summary: Use size hint for autotuning; Fix some symbol arg codegen problem. More PRs coming for fixing unit test failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102019 Approved by: https://github.com/jansel	2023-05-24 13:29:47 +00:00
Iris	ee95e37a69	[c10d] Record time spent for init_process_group, new_group, _store_based_barrier (#101912 ) 1. Record time spent for init_process_group, new_group, _store_based_barrier 2. Rename c10d_error_logger to c10d_logger for generalization. 3. Refactor to move logger wrappers in distributed_c10d.py to logger to c10d_logger.py. 4. Rename the logger wrappers (bc breaking). Exception_handler is renamed to exception_logger to avoid confusion with logging handler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101912 Approved by: https://github.com/fduwjj	2023-05-24 09:36:34 +00:00
Sun, Jiayi	d6afa7d003	add Half support for sinh, cosh, ploygamma, entr and i0e on CPU (#99002 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99002 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/ngimel	2023-05-24 08:42:14 +00:00
dependabot[bot]	8aea9dad8f	Bump mpmath from 1.2.1 to 1.3.0 in /.github/requirements (#102058 ) Bumps [mpmath](https://github.com/fredrik-johansson/mpmath) from 1.2.1 to 1.3.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/fredrik-johansson/mpmath/releases">mpmath's releases</a>.</em></p> <blockquote> <h2>1.3.0</h2> <p>Security issues:</p> <ul> <li>Fixed ReDOS vulnerability in mpmathify() (CVE-2021-29063) (Vinzent Steinberg)</li> </ul> <p>Features:</p> <ul> <li>Added quadsubdiv() for numerical integration with adaptive path splitting (Fredrik Johansson)</li> <li>Added the Cohen algorithm for inverse Laplace transforms (Guillermo Navas-Palencia)</li> <li>Some speedup of matrix multiplication (Fredrik Johansson)</li> <li>Optimizations to Carlson elliptic integrals (Paul Masson)</li> <li>Added signal functions (squarew(), trianglew(), sawtoothw(), unit_triangle() sigmoidw()) (Nike Dattani, Deyan Mihaylov, Tina Yu)</li> </ul> <p>Bug fixes:</p> <ul> <li>Correct mpf initialization from tuple for finf and fninf (Sergey B Kirpichev)</li> <li>Support QR decomposition for matrices of width 0 and 1 (Clemens Hofreither)</li> <li>Fixed some cases where elliprj() gave inaccurate results (Fredrik Johansson)</li> <li>Fixed cases where digamma() hangs for complex input (Fredrik Johansson)</li> <li>Fixed cases of polylog() with integer-valued parameter with complex type (Fredrik Johansson)</li> <li>Fixed fp.nsum() with Euler-Maclaurin algorithm (Fredrik Johansson)</li> </ul> <p>Maintenance:</p> <ul> <li>Dropped support for Python 3.4 (Sergey B Kirpichev)</li> <li>Documentation cleanup (Sergey B Kirpichev)</li> <li>Removed obsolete files (Sergey B Kirpichev)</li> <li>Added options to runtests.py to skip tests and exit on failure (Jonathan Warner)</li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/mpmath/mpmath/blob/master/CHANGES">mpmath's changelog</a>.</em></p> <blockquote> <p>--1.3.0-- Released March 7, 2023</p> <p>Security issues:</p> <ul> <li>Fixed ReDOS vulnerability in mpmathify() (CVE-2021-29063) (Vinzent Steinberg)</li> </ul> <p>Features:</p> <ul> <li>Added quadsubdiv() for numerical integration with adaptive path splitting (Fredrik Johansson)</li> <li>Added the Cohen algorithm for inverse Laplace transforms (Guillermo Navas-Palencia)</li> <li>Some speedup of matrix multiplication (Fredrik Johansson)</li> <li>Optimizations to Carlson elliptic integrals (Paul Masson)</li> <li>Added signal functions (squarew(), trianglew(), sawtoothw(), unit_triangle() sigmoidw()) (Nike Dattani, Deyan Mihaylov, Tina Yu)</li> </ul> <p>Bug fixes:</p> <ul> <li>Correct mpf initialization from tuple for finf and fninf (Sergey B Kirpichev)</li> <li>Support QR decomposition for matrices of width 0 and 1 (Clemens Hofreither)</li> <li>Fixed some cases where elliprj() gave inaccurate results (Fredrik Johansson)</li> <li>Fixed cases where digamma() hangs for complex input (Fredrik Johansson)</li> <li>Fixed cases of polylog() with integer-valued parameter with complex type (Fredrik Johansson)</li> <li>Fixed fp.nsum() with Euler-Maclaurin algorithm (Fredrik Johansson)</li> </ul> <p>Maintenance:</p> <ul> <li>Dropped support for Python 3.4 (Sergey B Kirpichev)</li> <li>Documentation cleanup (Sergey B Kirpichev)</li> <li>Removed obsolete files (Sergey B Kirpichev)</li> <li>Added options to runtests.py to skip tests and exit on failure (Jonathan Warner)</li> </ul> <p>--1.2.0-- Released February 1, 2021</p> <p>Features and optimizations:</p> <ul> <li>Support @ operator for matrix multiplication (Max Gaukler)</li> <li>Add eta() implementing the Dedekind eta function</li> <li>Optimized the python_trailing function (adhoc-king)</li> <li>Implement unary plus for matrices (Max Gaukler)</li> <li>Improved calculation of gram_index (p15-git-acc)</li> </ul> <p>Compatibility:</p> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`b5c04506ef`"><code>b5c0450</code></a> version 1.3.0</li> <li><a href="`a27581ca77`"><code>a27581c</code></a> Merge pull request <a href="https://redirect.github.com/fredrik-johansson/mpmath/issues/656">#656</a> from cclauss/patch-2</li> <li><a href="`9d7884bf96`"><code>9d7884b</code></a> don't use .ae method in library code</li> <li><a href="`967de83d51`"><code>967de83</code></a> Downgrade to ubuntu-20.04 for Py35 and Py36</li> <li><a href="`6425c6aa41`"><code>6425c6a</code></a> build: strategy: fail-fast: false</li> <li><a href="`e2341c762e`"><code>e2341c7</code></a> GitHub Actions: Test on Python 3.11 production release</li> <li><a href="`1258e33e16`"><code>1258e33</code></a> fix failing doctests</li> <li><a href="`b7c15d668c`"><code>b7c15d6</code></a> include signals documentation; remove duplicate docstrings</li> <li><a href="`1b476ea230`"><code>1b476ea</code></a> update doc building instructions</li> <li><a href="`5f57beb1e3`"><code>5f57beb</code></a> Merge pull request <a href="https://redirect.github.com/fredrik-johansson/mpmath/issues/646">#646</a> from cclauss/patch-1</li> <li>Additional commits viewable in <a href="https://github.com/fredrik-johansson/mpmath/compare/1.2.1...1.3.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=mpmath&package-manager=pip&previous-version=1.2.1&new-version=1.3.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102058 Approved by: https://github.com/huydhn	2023-05-24 08:40:35 +00:00
xiaolil1	faa7eb81c6	change error_message for XPU Autocast data type check (#102073 ) XPU autocast supports bf16 and fp16 data types, we are going to change the error_message for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102073 Approved by: https://github.com/jgong5	2023-05-24 08:36:43 +00:00
Huy Do	d06802778e	No need to run C++ tests under rerun disabled tests mode (#102132 ) Per title. I extract this part out of the draft PR that I'm working on https://github.com/pytorch/pytorch/pull/102107 because the remaining issues with rerun disabled tests: log size and unexpected runner failures requires some further investigations while this one is clearing breaking in trunk atm. Until we can support disable C++ tests, there is no need to run them in rerun disabled tests mode. ### Testing Coming from https://github.com/pytorch/pytorch/pull/102107, for example https://github.com/pytorch/pytorch/actions/runs/5062224659/jobs/9087747981 ``` 2023-05-23T22:46:50.1953318Z Running cpp/basic 1/1 ... [2023-05-23 22:46:50.195077] 2023-05-23T22:46:50.1953847Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2023-05-23T22:46:50.2066032Z Running cpp/atest 1/1 ... [2023-05-23 22:46:50.206348] 2023-05-23T22:46:50.2066435Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2023-05-23T22:46:52.2666743Z No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' 2023-05-23T22:46:52.2691817Z Ignoring disabled issues: [] ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102132 Approved by: https://github.com/clee2000	2023-05-24 07:45:48 +00:00
Matthew Hoffman	29da75cc55	Enable mypy allow redefinition (#102046 ) Related #101528 I tried to enable this in another PR but it uncovered a bunch of type errors: https://github.com/pytorch/pytorch/actions/runs/4999748262/jobs/8956555243?pr=101528#step:10:1305 The goal of this PR is to fix these errors. --- This PR enables [allow_redefinition = True](https://mypy.readthedocs.io/en/stable/config_file.html#confval-allow_redefinition) in `mypy.ini`, which allows for a common pattern: > Allows variables to be redefined with an arbitrary type, as long as the redefinition is in the same block and nesting level as the original definition. `allow_redefinition` allows mypy to be more flexible by allowing reassignment to an existing variable with a different type... for instance (from the linked PR): `4a1e9230ba/torch/nn/parallel/data_parallel.py (L213)` A `Sequence[Union[int, torch.device]]` is narrowed to `Sequence[int]` thru reassignment to the same variable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102046 Approved by: https://github.com/ezyang	2023-05-24 07:05:30 +00:00
Nikita Shulga	bf059e3925	[Typing] Export `torch.backends` as subpackage (#102099 ) So that `pyright` is happy. Do a little refactor in `mps/__init__.py` to avoid cyclical dependency on `torch.fx` by calling `mps._init()` implicitly. Fixes https://github.com/pytorch/pytorch/issues/101686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102099 Approved by: https://github.com/Skylion007	2023-05-24 07:03:17 +00:00
Huy Do	d26c8f26d1	Lower xdist processes from auto to NUM_PROCS (#102124 ) This is to avoid CUDA OOM issues when running C++ tests both regularly and in memory leak check mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102124 Approved by: https://github.com/clee2000	2023-05-24 06:50:55 +00:00
Edward Z. Yang	3318a832b3	Tighten FakeTensor reentrancy asserts, add debugging (#102091 ) When investigating failures in https://github.com/pytorch/pytorch/pull/100017 I realized that we were reentering FakeTensorMode even though there was already one on the stack. Although we have attempted assert for these cases in the past, e.g., as in https://github.com/pytorch/pytorch/pull/97186 it seems that the existing protections were insufficient. In this particular case, the reapplication of FakeTensorMode was due to an interaction with NotImplemented multiple dispatch handling. If proxy tensor mode detects an unrecognized tensor type (this includes FakeTensor, if it is not tracked with a proxy), it will return NotImplemented to give this tensor a chance to unpack itself into proxyable operation. However, this is never the right thing for FakeTensor, where no unpacking is possible. However, today, FakeTensor attempts to reapply the FakeTensorMode, resulting in FakeTensorMode being twice on the stack. This PR does a number of things: * It adds an assert in `FakeTensorMode.__torch_dispatch__` that you must not already have this mode on the stack, this is ALWAYS an error * It modifies `FakeTensor.__torch_dispatch__` to return `NotImplemented` if the mode is already active. This prevents us from readding the mode on the stack * It adds a new logging artifact `not_implemented` which you can use to get debug logs about all of the times a `__torch_dispatch__` handler returned NotImplemented and why it did so. Your subclass has to manually opt into this logging, but I inserted the necessary logs for ProxyTensorMode and FakeTensor(Mode) * `with fake_mode` now no-ops if the fake mode is already on the stack, which is what users want anyway * I am BREAKING pre-autograd tracing, because it is currently doing something weird with the original C++ mode stack. Brian is going to follow up with a fix next week. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102091 Approved by: https://github.com/thiagocrepaldi, https://github.com/eellison, https://github.com/wanchaol, https://github.com/bdhirsh	2023-05-24 05:37:51 +00:00
Avik Chaudhuri	38f8f756bf	group constraints by arg (#102096 ) Differential Revision: [D46110979](https://our.internmc.facebook.com/intern/diff/D46110979/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102096 Approved by: https://github.com/ydwu4	2023-05-24 05:27:54 +00:00
PyTorch UpdateBot	907cc6c11c	[vision hash update] update the pinned vision hash (#102136 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102136 Approved by: https://github.com/pytorchbot	2023-05-24 03:52:05 +00:00
Wang, Eikan	2e18dd2bdc	Improve bf16 neg by bypassing the convertion between BF16 and FP32 (#99711 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99711 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/desertfire	2023-05-24 03:25:23 +00:00
Fuzzkatt	45843c7f41	test_memory_format fix for test_modules.py (#102006 ) add with_tf32_off, add sm80 check for thresholds Pull Request resolved: https://github.com/pytorch/pytorch/pull/102006 Approved by: https://github.com/ngimel	2023-05-24 02:32:45 +00:00
Fuzzkatt	47e9dba765	move tf32_on_and_off fix for test_convolution.py (#102007 ) move tf32_on_and_off after @torch.backends.cudnn.flags(enabled=True, benchmark=False) due to @torch.backends.cudnn.flags(enabled=True, benchmark=False) overwriting tf32_on_and_off if after. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102007 Approved by: https://github.com/ngimel	2023-05-24 02:23:06 +00:00
Fuzzkatt	d805a53f1f	disable tf32 for rnn tests and norm tests (#102005 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102005 Approved by: https://github.com/ngimel	2023-05-24 02:22:58 +00:00
Michael Voznesensky	ea5eaa8692	Remove config check in specialize (#102098 ) Fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/102098 Approved by: https://github.com/ezyang	2023-05-24 01:26:22 +00:00
Michael Voznesensky	d55aad1f3e	Disable (Broken) CUDAStreamVariable in dynamo (#100766 ) While attempting to explore XLTransformers w/ PT2, I found that we leak tracing time objects (VariableTrackers) into the runtime: ``` Traceback (most recent call last): File "/scratch/voz/work/xlformers/train.py", line 686, in <module> main(cfg) File "/scratch/voz/work/xlformers/train.py", line 357, in main pred, _ = model(x) File "/scratch/voz/work/pytorch/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/scratch/voz/work/pytorch/torch/nn/modules/module.py", line 1511, in _call_impl return forward_call(args, *kwargs) File "/scratch/voz/work/pytorch/torch/_dynamo/eval_frame.py", line 282, in _fn return fn(args, *kwargs) File "/data/home/voz/miniconda3/envs/torch5/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1416, in forward self._lazy_init() File "/data/home/voz/miniconda3/envs/torch5/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1424, in <resume in forward> args, kwargs = cast_floats_to_right_precision(True, True, args, *kwargs) File "/data/home/voz/miniconda3/envs/torch5/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1434, in <resume in forward> self._rebuild_full_params() File "/scratch/voz/work/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "/data/home/voz/miniconda3/envs/torch5/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1932, in _rebuild_full_params def update_p_data(custom_output_tensor: Optional[torch.Tensor] = None) -> None: File "/data/home/voz/miniconda3/envs/torch5/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1932, in <resume in _rebuild_full_params> def update_p_data(custom_output_tensor: Optional[torch.Tensor] = None) -> None: File "/scratch/voz/work/pytorch/torch/cuda/__init__.py", line 464, in __enter__ if self.src_prev_stream.device != cur_stream.device: AttributeError: 'CUDAStreamVariable' object has no attribute 'device' ``` This indicates a serious bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100766 Approved by: https://github.com/ezyang	2023-05-24 01:22:21 +00:00
Feny Patel	cc233f4e23	integrate the new event with pytorch (#101025 ) Test Plan: This is a no-op diff Differential Revision: D45698169 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101025 Approved by: https://github.com/aaronenyeshi	2023-05-24 00:38:26 +00:00
Nikita Karetnikov	e79d9b9938	[pt2] add `SymInt` support for `linalg.matrix_power` (#101940 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101940 Approved by: https://github.com/lezcano, https://github.com/ezyang	2023-05-24 00:21:52 +00:00
Nikita Karetnikov	69f7b40949	[pt2] add `SymInt` support for `eye` (#101955 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101955 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2023-05-24 00:21:52 +00:00
Nikita Karetnikov	42b974e8f7	[pt2] add meta for `linalg_lu_solve` (#101836 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101836 Approved by: https://github.com/lezcano	2023-05-24 00:21:50 +00:00
Denis Vieriu	6c68116643	[MPS] Calculate nonzero count first before running nonzero op (#102052 ) Summary of changes: - Calculate nonzero count first before running nonzero op - allocate only 1 element when calling .item(), and blit only the size of destination Pull Request resolved: https://github.com/pytorch/pytorch/pull/102052 Approved by: https://github.com/kulinseth	2023-05-24 00:19:42 +00:00
Kazuaki Ishizaki	be5e77ca4c	Make _StorageBase.byteswap faster ( > 10000x) (#101925 ) This PR addresses #101690. This PR implement faster data elements swap in `_StorageBase` using C++ rather than using Python. This PR helps such a situation that a large model saved on a little-endian machine will be loaded on a big-endian machine. TODO: - [x] Add test cases - [x] Add performance comparison before and after the PR - [ ] (Optional) Investigate further opportunities for performance improvements by [SIMDization](https://dev.to/wunk/fast-array-reversal-with-simd-j3p) Fixes #101690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101925 Approved by: https://github.com/mikaylagawarecki	2023-05-24 00:13:41 +00:00
Jerry Zhang	94ed26d177	[quant][pt2e] prepare_pt2e use quantization spec directly (#102054 ) Summary: In this PR we aligned with the design of annotation API and uses quantization spec directly for annotation. main change is in prepare, we consume quantization_spec object directly instead of the observer or fake quant constructor, we create the constructor inside prepare, and annotation api users only need to interact with quantization spec object after this PR Test Plan: ``` buck2 test mode/opt caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)' ``` Reviewed By: kimishpatel Differential Revision: D45934088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102054 Approved by: https://github.com/kimishpatel	2023-05-23 23:25:56 +00:00
Scott Wolchok	99f68d56ee	[PyTorch] Delete c10::guts::if_constexpr (#101991 ) Now that we have C++17, we should not need this any more. Differential Revision: [D46078335](https://our.internmc.facebook.com/intern/diff/D46078335/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101991 Approved by: https://github.com/r-barnes, https://github.com/Skylion007	2023-05-23 23:19:35 +00:00
Edward Z. Yang	f65732552e	Support FakeTensor with FlatParameter (#101987 ) In this PR we turn FlatParameter into a virtual tensor subclass which doesn't actually ever get instantiated: __new__ will create a Parameter instead (or a FakeTensor, if necessary). Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101987 Approved by: https://github.com/awgu, https://github.com/eellison	2023-05-23 23:12:08 +00:00
PyTorch MergeBot	5147fe4969	Revert "[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812 )" This reverts commit b9721bd70531df641fbd484ab05085a4c52657be. Reverted https://github.com/pytorch/pytorch/pull/101812 on behalf of https://github.com/osalpekar due to Causing test_nn_cuda tests to crash during runtime. More details at [D46093942](https://www.internalfb.com/diff/D46093942) ([comment](https://github.com/pytorch/pytorch/pull/101812#issuecomment-1560238085))	2023-05-23 23:06:21 +00:00
Huy Do	2e5e53b718	Do not upload MacOS conda environment to GitHub when job fails (#102108 ) Fixes https://github.com/pytorch/pytorch/issues/101800. This is not needed anymore as the dependencies issues on MacOS has been addressed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102108 Approved by: https://github.com/malfet	2023-05-23 23:01:33 +00:00
PyTorch MergeBot	32ce06a5ab	Revert "[Reland] fix missing-prototypes warnings in torch_cpu (Part 4) (#101949 )" This reverts commit 4f2c007a1b5170c2aa0d47e388ff9e07c7a7d354. Reverted https://github.com/pytorch/pytorch/pull/101949 on behalf of https://github.com/osalpekar due to As noted in @izaitsevfb's comment, we are still seeing linker errors, this time due to `nnc_prepacked_linear_clamp_run` being made a static function. ([comment](https://github.com/pytorch/pytorch/pull/101949#issuecomment-1560226880))	2023-05-23 22:53:47 +00:00
PyTorch MergeBot	45a8f691ec	Revert "[Reland] fix missing-prototypes warnings in torch_cpu (Part 5) (#101976 )" This reverts commit 4db2dade258961fbbf44bc0015235da89a26cb46. Reverted https://github.com/pytorch/pytorch/pull/101976 on behalf of https://github.com/osalpekar due to reverting to allow https://github.com/pytorch/pytorch/pull/101949 to be cleanly reverted ([comment](https://github.com/pytorch/pytorch/pull/101976#issuecomment-1560224839))	2023-05-23 22:50:28 +00:00
PyTorch MergeBot	0759e1d132	Revert "add Half support for sinh, cosh, ploygamma, entr and i0e on CPU (#99002 )" This reverts commit 5c3cf76eb2c8c8699ae0341c2753903a87bbfda2. Reverted https://github.com/pytorch/pytorch/pull/99002 on behalf of https://github.com/osalpekar due to Need to revert this to cleanly revert https://github.com/pytorch/pytorch/pull/101976 ([comment](https://github.com/pytorch/pytorch/pull/99002#issuecomment-1560221288))	2023-05-23 22:44:59 +00:00
ydwu4	7e58891ca0	Support list output for HigherOrderOperators (#101986 ) Fixes the issue in #100278: support list output for HigherOrderOperator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101986 Approved by: https://github.com/zou3519	2023-05-23 21:36:04 +00:00
Edward Z. Yang	e7a6818e97	Register top level logger for torch (#102090 ) This enables use of artifact logging in modules that aren't under the modules that were specified here. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/102090 Approved by: https://github.com/Skylion007, https://github.com/mlazos	2023-05-23 21:24:21 +00:00
atannous	149237415f	Using deterministic hashing instead of GUID for pytorch serialization id generation (#101964 ) Summary: serialization_id was added in a previous change to be written as a random GUID associated with each time saving of a module is called, for the purpose of adding tracking for saved artifacts. In order not to disturb existing systems that rely on the serialized bytes to be deterministic for serializing the same module, this change uses the combined hash of uncompressed content and file names instead of GUID for serialization id. The use of this hashing reuses the same CRC32 that is already calculated for zip writing, so it doesn't incur additional computational overhead. Data descriptor is one of the file headers inside the zip format https://en.wikipedia.org/wiki/ZIP_(file_format)#Data_descriptor. It contains the CRC32 of the uncompressed data. By inspecting the written data in PyTorchStreamWriter, the CRC32 is found for each written record. In order to make serialization_id a unique and deterministic id for the serialized files without computation overhead, the updated `serialization_id` is computed based on all files written, and is composed of: 1) a combined hash of record name hashes 2) a combined crc32 of the record uncompressed data Example value: "15656915541136177431866432772" Test Plan: buck2 test @//mode/dev //caffe2/caffe2/serialize:inline_container_test Differential Revision: D46038973 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101964 Approved by: https://github.com/davidberard98	2023-05-23 20:47:30 +00:00
vfdev	76af22103b	Fixed type hints for CosineAnnealingWarmRestarts (#102067 ) Fixed type hints for CosineAnnealingWarmRestarts: - `T_mult` is not `Optional[int]` but just `int` - `eta_min` is not `Optional[float]` but just `float` - removed `step` method specific annotation as it is compatible with the base class `e132f09e88/torch/optim/lr_scheduler.py (L1365-L1375)` Otherwise, computation like this `self.T_i * self.T_mult` in `self.step` is not possible: ``` error: Unsupported operand types for * ("int" and "None") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102067 Approved by: https://github.com/janeyx99	2023-05-23 19:06:07 +00:00
Elias Ellison	4692ea76a0	Fine grained apis docs (#101897 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101897 Approved by: https://github.com/msaroufim	2023-05-23 19:03:44 +00:00
Richard Zou	723f111545	[custom_op] explicit autograd API (#101824 ) This PR adds an explicit API for registering a backward formula for a CustomOp. In the end state, we will likely have this explicit API and a magic API (which is sugar on top of an explicit API), since different parties of users prefer different ones. Concretely, to define a backward formula for a CustomOp: - a user must provide us a "save for backward" function that accepts (inputs, output) and returns exactly what they want saved for backward - a user must provide us a "backward" function that accepts (ctx, saved, *grads) and returns us the grad_inputs. The grad_inputs are returned as a dict mapping str to a gradient. Please see the changes in custom_op_db.py for examples of the API. There are a number of pieces to this PR and I'm happy to split it if it helps. They are: - The actual APIs for specifying the two functions (impl_save_for_backward, impl_backward) - The autograd kernel: we take the functions the user give us and construct an autograd.Function object that we then register to the Autograd dispatch key - Indirection for the autograd kernel. We add a layer of indirection so that one can swap out the autograd kernel. This is necessary because by default, we register an "autograd not implemented" kernel as the Autograd implementation but then swap it for the actual kernel when the user provides it. Test Plan: - We apply this API to give backward formulas for things in custom_op_db. We then hook up custom_op_db to the Autograd OpInfo tests. - Various tests in test_python_dispatch.py to check error cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101824 Approved by: https://github.com/ezyang	2023-05-23 18:31:29 +00:00
Richard Zou	8487105fae	[custom_op] Create a new torch._custom_op namespace (#101823 ) torch/custom_op.py is getting long, and the autograd pieces are going to make it even longer. I'm planning on just organizing the files under a torch/_custom_op folder. Note that the imports now look a bit crazy (from torch._custom_op.impl import...) but they will look more OK when we figure out the plan to make custom_op public (coming later). Pull Request resolved: https://github.com/pytorch/pytorch/pull/101823 Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/bdhirsh	2023-05-23 18:31:29 +00:00
Richard Zou	73d1be8e99	[custom_op] Add a test for symints (#101822 ) Tests that a custom op annotated with Sequence[int] actually accepts Sequence[SymInt]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101822 Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/bdhirsh	2023-05-23 18:31:27 +00:00
Wanchao Liang	6e0c741105	[dtensor] hide mesh validation check under init_process_group flag (#101996 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101996 Approved by: https://github.com/wz337	2023-05-23 18:17:54 +00:00
Wanchao Liang	70eccdbf92	[dtensor] add necessary logging to APIs and components (#101994 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101994 Approved by: https://github.com/wz337	2023-05-23 18:17:54 +00:00
Mark Saroufim	eda7efe662	Fix ProfilerTree Test (#101983 ) Summary: T152692570 Created from CodeHub with https://fburl.com/edit-in-codehub Test Plan: CI Sandcastle run Reviewed By: aaronenyeshi Differential Revision: D45656571 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101983 Approved by: https://github.com/aaronenyeshi	2023-05-23 18:10:20 +00:00
Li-Huai (Allan) Lin	02a7318a5b	[MPS] Add aminmax op (#101691 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101691 Approved by: https://github.com/malfet	2023-05-23 18:01:34 +00:00
lkct	80dd847b62	Fix fragile code in `torch.__init__.py` related to `torch._inductor` import (#102021 ) Fixes #102020 For motivation of this change see the above issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102021 Approved by: https://github.com/msaroufim, https://github.com/jansel	2023-05-23 16:59:17 +00:00
Edward Z. Yang	7d1ba0a92a	Support resize on meta storage (#101988 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101988 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2023-05-23 16:49:17 +00:00
Huy Do	51ff408f77	Add retry when cleaning up Windows workspace (#102051 ) Windows flakiness strikes again. There is a new flaky issue start appearing on HUD in which tearing down Windows workspace fails with `Device or resource busy` error when trying to `rm -rf ./*` the workspace, for example https://github.com/pytorch/pytorch/actions/runs/5051845102/jobs/9064107717. It happens on both build and test jobs. I have looked into all commits since last weekend but there is nothing standing out or Windows-related. The error means that a process still hold the directory, but it's unclear which one as all CI processes should have been stopped by then (https://github.com/pytorch/pytorch/pull/101460) with the only exception of the runner daemon itself. On the other hand, the issue is flaky as the next job running on the same failed runner can clean up the workspace fine when checking out PyTorch (https://github.com/pytorch/pytorch/blob/main/.github/actions/checkout-pytorch/action.yml#L21-L35). For example, `i-0ec1767a38ec93b4e` failed at https://github.com/pytorch/pytorch/actions/runs/5051845102/jobs/9064107717 and its immediate next job succeeded https://github.com/pytorch/pytorch/actions/runs/5052147504/jobs/9064717085. So, I think that adding retrying should help mitigate this. Related to https://github.com/pytorch/test-infra/pull/4206 (not the same root cause, I figured out https://github.com/pytorch/test-infra/pull/4206 while working on this PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102051 Approved by: https://github.com/kit1980	2023-05-23 16:41:58 +00:00
Bin Bao	431344f2d0	[inductor] Refactor generate_kernel_call (#102018 ) Summary: Refactor generate_kernel_call to support codegen call to Triton kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/102018 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-05-23 15:54:49 +00:00
Yanbo Liang	e132f09e88	[Dynamo] Fix test_cuda_set_device to restore device (#102049 ) Fixes #102025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102049 Approved by: https://github.com/ngimel	2023-05-23 07:37:12 +00:00
Jongsoo Park	b91eb97d34	[transformer benchmark] relax tolerance in sdp.py (#101965 ) Summary: Otherwise we get ``` Traceback (most recent call last): File "<string>", line 49, in <module> File "<string>", line 47, in __run File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp__/sdp#link-tree/caffe2/benchmarks/transformer/sdp.py", line 346, in <module> main(save_path) File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp__/sdp#link-tree/caffe2/benchmarks/transformer/sdp.py", line 328, in main experiment = run_single_experiment(experiment_config) File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp__/sdp#link-tree/caffe2/benchmarks/transformer/sdp.py", line 229, in run_single_experiment assert_close_tensors(nn_mha_output, composite_mha_output) File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp__/sdp#link-tree/caffe2/benchmarks/transformer/sdp.py", line 196, in assert_close_tensors assert torch.allclose(a, b, atol=1e-3, rtol=1e-3) AssertionError ``` Test Plan: buck run mode/dev-nosan //caffe2/benchmarks/transformer:sdp Differential Revision: D45843836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101965 Approved by: https://github.com/drisspg	2023-05-23 06:54:08 +00:00
Elias Ellison	e9246b290f	Initialize cuda tensor in fake tensor (#102027 ) Fix for https://github.com/pytorch/pytorch/issues/92627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102027 Approved by: https://github.com/ngimel	2023-05-23 06:24:50 +00:00
mikey dagitses	9bbee245fe	update rules_python and let bazel install its own pip dependencies (#101405 ) update rules_python and let bazel install its own pip dependencies Summary: This is the official way of doing Python in Bazel. Test Plan: Rely on CI. --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/101405). * #101406 * __->__ #101405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101405 Approved by: https://github.com/vors, https://github.com/huydhn	2023-05-23 06:20:33 +00:00
Xilun Wu	2ca75d49a8	[DTensor][3/N] add DTensor constructor function: full (#101436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101436 Approved by: https://github.com/wanchaol	2023-05-23 06:05:40 +00:00
Sun, Jiayi	5c3cf76eb2	add Half support for sinh, cosh, ploygamma, entr and i0e on CPU (#99002 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99002 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/ngimel	2023-05-23 06:02:15 +00:00
Jerry Zhang	f7c736e1e7	[quant][pt2e] Add observer_or_fake_quant_ctr to QuantizationSpec (#101920 ) Summary: This is the second refactor to align the annotation API with design, next step is to change prepare_pt2e to consume QuantizationSpec object directly Test Plan: ``` buck2 test mode/optcaffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)' ``` Reviewed By: kimishpatel Differential Revision: D45927416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101920 Approved by: https://github.com/andrewor14	2023-05-23 05:48:23 +00:00
Bin Bao	8cab7994a6	[inductor] Move cpp wrapper dynamic shapes test to test_cpp_wrapper (#102017 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102017 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-05-23 03:59:55 +00:00
Elias Ellison	2bce7c8f46	CUDAGraph trees doc (#101902 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101902 Approved by: https://github.com/msaroufim	2023-05-23 03:35:43 +00:00
PyTorch UpdateBot	4a1e9230ba	[vision hash update] update the pinned vision hash (#102028 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102028 Approved by: https://github.com/pytorchbot	2023-05-23 03:04:07 +00:00
Xiongfei Wei	9121f5ca84	Use the symint version of computeStorageNbytes within get_nbytes. (#101634 ) Fixes [#ISSUE_NUMBER](https://github.com/pytorch/xla/pull/4998) according to [comment](https://github.com/pytorch/xla/pull/4998#issuecomment-1550232063). This change is needed to make sure calling tensor.sizes() will error if the tensor has dynamic dimension in pytorch/xla. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101634 Approved by: https://github.com/ezyang	2023-05-23 02:42:58 +00:00
Edward Z. Yang	f216fea44f	Remove commented out pdb (#101993 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101993 Approved by: https://github.com/Skylion007, https://github.com/wconstab	2023-05-23 02:20:10 +00:00
Nikita Shulga	6d0079b12b	[BE] Do not expose `torch.functional.opt_einsum` (#102004 ) It's not mentioned in `__all__`, so moving `import torch.backends.opt_einsum as opt_einsum` into `einsum` function to delay `torch.backends` import and hide it completely from the module scope. level module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102004 Approved by: https://github.com/janeyx99	2023-05-23 01:52:40 +00:00
Lucy Qiu	a2fd2c2b83	[Pytorch] Add Vulkan support for aten::unsqueeze for 2d to 3d (#101719 ) Summary: Unsqueeze operator: https://pytorch.org/docs/stable/generated/torch.unsqueeze.html#torch.unsqueeze Test Plan: Unsqueeze tests: https://www.internalfb.com/phabricator/paste/view/P738187802 ``` lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="unsqueeze" Downloaded 0/2 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules) Building: finished in 15.0 sec (100%) 455/455 jobs, 2/455 updated Total time: 15.0 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = unsqueeze [==========] Running 3 tests from 1 test suite. [----------] Global test environment set-up. [----------] 3 tests from VulkanAPITest [ RUN ] VulkanAPITest.unsqueeze_dim0 [ OK ] VulkanAPITest.unsqueeze_dim0 (96 ms) [ RUN ] VulkanAPITest.unsqueeze_dim1 [ OK ] VulkanAPITest.unsqueeze_dim1 (2 ms) [ RUN ] VulkanAPITest.unsqueeze_dim2 [ OK ] VulkanAPITest.unsqueeze_dim2 (3 ms) [----------] 3 tests from VulkanAPITest (101 ms total) [----------] Global test environment tear-down [==========] 3 tests from 1 test suite ran. (101 ms total) [ PASSED ] 3 tests. ``` All tests: buck run //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 https://www.internalfb.com/phabricator/paste/view/P738255852 Reviewed By: SS-JIA Differential Revision: D45893511 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101719 Approved by: https://github.com/SS-JIA	2023-05-23 01:29:40 +00:00
Kurt Mohler	5fe629e314	Add PyObject preservation for UntypedStorage (#97470 ) Part of #91395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97470 Approved by: https://github.com/ezyang	2023-05-23 01:27:30 +00:00
leslie-fang-intel	488a4303a5	Enable quantized_max_pool3d (#101654 ) Summary Enable `quantized_max_pool3d` kernel to fix the issue https://github.com/pytorch/pytorch/issues/101386. Test Plan ``` clear && python -u -m pytest -s -v test_quantized_op.py -k test_max_pool3d clear && python -u -m pytest -s -v test_quantized_op.py -k test_max_pool3d_nhwc ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101654 Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/mingfeima	2023-05-23 00:45:38 +00:00
Shabab Ayub	8243abc84a	[1/n] instanceof instead of singleton for ph check (#102008 ) Summary: Change placeholder check from singleton to instanceof PHBase so you can create your own PH class with metadata Test Plan: added unit test Reviewed By: joshuadeng Differential Revision: D46085128 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102008 Approved by: https://github.com/PaliC	2023-05-23 00:07:45 +00:00
Stav Hertz	81b0f72e16	Fix xnnpack link errors (#101630 ) Summary: Setting srcs for all arvr platforms in xnnpack.buck.bzl allows build with arvr/platform010 to pass. Test Plan: CI Reviewed By: blchxfm Differential Revision: D45738660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101630 Approved by: https://github.com/digantdesai	2023-05-22 22:36:15 +00:00
Ramil Nugmanov	2ae87a1f87	missed StackDataset documentation (#101927 ) New dataset class added by #101338 missed in documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101927 Approved by: https://github.com/kit1980	2023-05-22 21:12:16 +00:00
Peter Bell	b9721bd705	[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812 Approved by: https://github.com/lezcano	2023-05-22 20:39:18 +00:00
Bert Maher	e07c04f48a	[inductor] Update qualname and module for wrapped testcases (#101975 ) D45936056 was hitting bizarre failures running unit tests under FB's test runner, where we'd see things like: ``` 9 TESTS FAILED ✗ caffe2/test/inductor:fused_attention - <locals> (unittest.loader._FailedTest) ``` The reason for this is, it turns out the test runner uses a two-step process where it first lists the tests, in one process, and then runs them using the names from the listing step in separate processes But, since we're decorating the class, it ends getting listed with a weird name like `torch._dynamo.config_utils.ContextDecorator.__call__.<locals>._TestCase`, and when the runner tries to load that module, it fails. So one solution (other than, you know, using pytest) is to update the __qualname__ and __module__ of the _TestCase wrapper so that the runner will actually load the right module. @build[pytorch_dynamo_inductor] Differential Revision: [D46044467](https://our.internmc.facebook.com/intern/diff/D46044467/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101975 Approved by: https://github.com/xuzhao9, https://github.com/jansel	2023-05-22 20:35:29 +00:00
Stephen Jia	c618093681	[vulkan] Fix concat op in feature dimension (#101721 ) Summary: Fix a small bug in the `cat_feature` shader where an early exit path was not being taken correctly. Test Plan: Referring to [Pytorch Vulkan Testing Procedures](https://www.internalfb.com/intern/wiki/Pytorch_Vulkan_Backend/Development/Vulkan_Testing_Procedures/): * Run operator unit tests on Mac and Android * Run model inference and correctness benchmarks Differential Revision: D45962806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101721 Approved by: https://github.com/salilsdesai	2023-05-22 20:24:40 +00:00
Richard Barnes	be94ff976d	Have irange use `if constexpr` (#94050 ) Test Plan: Sandcastle Differential Revision: D42997779 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94050 Approved by: https://github.com/Skylion007, https://github.com/soumith, https://github.com/malfet	2023-05-22 20:12:37 +00:00
Ivan Zaitsev	38e73b30b7	bring quantized_backward.cpp in sync with intern (#101990 ) The version of [D45965552](https://www.internalfb.com/diff/D45965552) exported as #101739 was not the latest. This PR brings GH in sync with intern. For Meta employees, see: [D46056765](https://www.internalfb.com/diff/D46056765) [D45965552](https://www.internalfb.com/diff/D45965552) @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/101990 Approved by: https://github.com/kit1980	2023-05-22 19:53:42 +00:00
Richard Zou	4de5ee43bf	[torch.library] Change Library.__del__ into weakref.finalize (#101829 ) `__del__` is a bit difficult to use, because when it is called, it is not guaranteed that anything it uses has not been cleaned up. Ed tells me he got the following exception one day, which is what prompted this PR. ``` Exception ignored in: <function Library.__del__ at 0x7fa36d211e50> Traceback (most recent call last): File "/data/users/ezyang/a/pytorch/torch/library.py", line 139, in __del__ AttributeError: 'NoneType' object has no attribute 'remove' ``` One solution is to use weakref.finalize, which lets one define a function to be run when the object is deleted that can hold references to specific things it needs. Another solution is to just check if the object is None, but I like the weakref solution better. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/101829 Approved by: https://github.com/ezyang	2023-05-22 19:51:08 +00:00
Pavan Yalamanchili	5e635e17da	Add documentation for a catching invalid index type (#96451 ) Summary: The assert in in compute_q8gemm_prepacked_sparse_dq is currently unreachable. Added inline comments to explain what is happening. Test Plan: Ran qnnpack q8gemm-sparse-test to verify. Differential Revision: D43930667 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96451 Approved by: https://github.com/salilsdesai, https://github.com/jianyuh	2023-05-22 19:37:09 +00:00
Bug Hunter Yan	c9f8f4cf2d	Fix device normalization of automatically generate methods for custom backends. (#101796 ) Fixes #ISSUE_NUMBER Fix the problem of error handling when the device input parameter adopts string type, Align capabilities. `foo_storage = torch.DoubleStorage(4).foo(device="foo:0, non_blocking=False")` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101796 Approved by: https://github.com/bdhirsh	2023-05-22 19:02:16 +00:00
cyy	4db2dade25	[Reland] fix missing-prototypes warnings in torch_cpu (Part 5) (#101976 ) PR #101788 depended on https://github.com/pytorch/pytorch/pull/100849 which was reverted. Now that https://github.com/pytorch/pytorch/pull/100849 has been relanded, we can reland #101788 too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101976 Approved by: https://github.com/Skylion007	2023-05-22 19:00:13 +00:00
shaoyf42	bdb3fb49bc	[c10d] Fix the check message of unsupported collectives ops. (#101775 ) 1. Fix the check message of unsupported collectives ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101775 Approved by: https://github.com/H-Huang	2023-05-22 18:37:05 +00:00
Jason Ansel	5ba16011d7	Suppress profiler spam in dynamo benchmarks (#101942 ) Makes this stuff go away: ``` STAGE:2023-05-20 20:49:34 63580:63580 ActivityProfilerController.cpp:311] Completed Stage: Warm Up STAGE:2023-05-20 20:49:34 63580:63580 ActivityProfilerController.cpp:317] Completed Stage: Collection STAGE:2023-05-20 20:49:34 63580:63580 ActivityProfilerController.cpp:321] Completed Stage: Post Processing ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101942 Approved by: https://github.com/shunting314, https://github.com/desertfire	2023-05-22 18:32:31 +00:00
Wanchao Liang	38a29324b0	[dtensor][2/N] more tensor ops to use strategy propagation (#101203 ) As titled, this PR adapts a few more tensor ops to use strategy based sharding prop Pull Request resolved: https://github.com/pytorch/pytorch/pull/101203 Approved by: https://github.com/XilunWu	2023-05-22 17:16:14 +00:00
Nikita Shulga	496212f408	Revert "group constraints by arg (#101815 )" This reverts commit 03de15806e5d27ee4ef6d82dbcc66dac78f6e3bf. Reverted https://github.com/pytorch/pytorch/pull/101815 on behalf of https://github.com/malfet due to it broke ExecuTorch and author was well aware about it"	2023-05-22 09:28:43 -07:00
Ren Pang	a630328695	Fix Backend docs search items (#101214 ) Fixes #100944 ## New <img width="1142" alt="image" src="https://github.com/pytorch/pytorch/assets/13214530/79102f2e-8a8f-4169-be53-9248397e653c"> <img width="765" alt="image" src="https://github.com/pytorch/pytorch/assets/13214530/4e5f17e7-a445-4822-ac8a-0d73c9ed71ee"> ## Old <img width="1341" alt="image" src="https://github.com/pytorch/pytorch/assets/13214530/985b4ec9-6d11-4962-8619-3c14ec09c3d9"> <img width="1112" alt="image" src="https://github.com/pytorch/pytorch/assets/13214530/e8dcf1a9-73e7-4fd6-8adc-eb036b1bb87b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101214 Approved by: https://github.com/albanD	2023-05-22 14:58:38 +00:00
Matthew Hoffman	a6f4088c21	Hint Tensor._make_subclass as a staticmethod (#101961 ) Fixes #101862 No more type errors and improved return type value: ```python import torch from torch import nn t = torch.tensor([1, 2, 3], dtype=torch.float32) t2 = torch.Tensor._make_subclass( # OK nn.Parameter, t.data, ) reveal_type(t2) # Type of "t2" is "Parameter" t3 = t._make_subclass( # OK nn.Parameter, t.data, ) reveal_type(t3) # Type of "t3" is "Parameter" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101961 Approved by: https://github.com/albanD	2023-05-22 12:42:50 +00:00
Edward Z. Yang	19af5c0b69	Explain how fastAtomicAdd works (#101951 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101951 Approved by: https://github.com/albanD	2023-05-22 12:41:08 +00:00
cyy	4f2c007a1b	[Reland] fix missing-prototypes warnings in torch_cpu (Part 4) (#101949 ) This PR relands the changes introduced in PR #100849. The old PR turnd nnc_aten_embedding into a static function, however, it is actually used in torch/csrc/jit/tensorexpr/operators/misc.cpp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101949 Approved by: https://github.com/albanD	2023-05-22 10:53:07 +00:00
PyTorch MergeBot	d0bb8fdc64	Revert "[dynamo] Minor refactor to use is_allowed to decide inlining of NNModule methods (#101910 )" This reverts commit 8b2a9f81cc7cab9cb49cd2c96b9304a3f9313fca. Reverted https://github.com/pytorch/pytorch/pull/101910 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/101910#issuecomment-1556782524))	2023-05-22 08:37:12 +00:00
Richard Li	e9a7115605	Update Kineto submodule (#101952 ) The Kineto submodule has been stayed at commit 21beef3787b4134c43584f6c2443341921c41f69. which is Apr 19th. This commit is just to keep it updated Pull Request resolved: https://github.com/pytorch/pytorch/pull/101952 Approved by: https://github.com/kit1980	2023-05-22 01:56:14 +00:00
Liao, Xuan	0a694dba2b	[inductor] fix avg_pool2d accuracy problem in lowering (#101789 ) Fixes #100987 In the current `avg_pool2d` lowering of inductor when `count_include_pad`, the mean of each window is calculated by dividing a fixing value, i.e. `kernel_size[0] * kernel_size[1]`. However for ceil mode, the amount of number in a window on the border could be less than `kernel_size[0] * kernel_size[1]`. This PR fixes the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101789 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/EikanWang	2023-05-22 01:32:40 +00:00
Nikita Vedeneev	3004d40439	torch.unique with dim: NumPy compatible sorting (#101693 ) Fixes https://github.com/pytorch/pytorch/issues/101681. The change `transpose -> moveaxis` was sufficient. Not only does it make the output similar to NumPy, it also preserves lexicographical sorting order along selected dimensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101693 Approved by: https://github.com/ngimel	2023-05-21 21:51:14 +00:00
mikey dagitses	dcffd5c646	show errors on bazel test failure (#101928 ) show errors on bazel test failure Summary: Without this it's impossible to know what went wrong in CI. Test Plan: Should be a no-op correctness wise. --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/101928). * #101406 * #101405 * __->__ #101928 * #101445 * #101744 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101928 Approved by: https://github.com/huydhn	2023-05-21 20:04:21 +00:00
mikey dagitses	2a62b59e04	improve diagnostics from bazel_linter.py (#101445 ) improve diagnostics from bazel_linter.py Summary: This was swallowing stderr on errors and trying to just parse an empty string from stdout. Test Plan: Verify with subsequent broken diff. --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/101445). * #101406 * #101405 * #101928 * __->__ #101445 * #101744 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101445 Approved by: https://github.com/huydhn	2023-05-21 19:02:35 +00:00
mikey dagitses	b54cdaf9fb	use bazelisk as the bazel binary for lintrunner (#101744 ) use bazelisk as the bazel binary for lintrunner Summary: This let's us rely on .bazelversion to pick the right version. Test Plan: Rely on CI. --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/101744). * #101406 * #101405 * #101928 * #101445 * __->__ #101744 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101744 Approved by: https://github.com/huydhn	2023-05-21 18:58:34 +00:00
Rickey K. Liang	807d81155f	[CUDA][CUBLAS] Fix BF16 reduced precision reduction note in Numerical accuracy docs (#101884 ) Fixes #100966 Ref #101044 Align implementation and documentation. (This is what's previously missed from the above issue and PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101884 Approved by: https://github.com/eqy, https://github.com/ezyang	2023-05-21 17:38:00 +00:00
Zhengxu Chen	351c2ea2fb	[export] Prototype on serialization schema. (#101899 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/101899 Approved by: https://github.com/angelayi	2023-05-21 06:31:53 +00:00
Li-Huai (Allan) Lin	330c907301	[MPS] Fix embedding cache key (#101857 ) Fixes #101198 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101857 Approved by: https://github.com/kulinseth	2023-05-21 06:11:25 +00:00
Edward Z. Yang	22ca1a1124	Partially fix shape mismatch in vision_maskrcnn (#101477 ) The bulk of the heavy lifting is happening in https://github.com/pytorch/vision/pull/7592 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101477 Approved by: https://github.com/voznesenskym	2023-05-21 05:20:08 +00:00
PyTorch UpdateBot	9e8da7fb44	[vision hash update] update the pinned vision hash (#101938 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101938 Approved by: https://github.com/pytorchbot	2023-05-21 02:29:32 +00:00
Benson Ma	66a2600b6a	[T153220354] Fix header inclusions in c10 (#1541 ) (#101846 ) Summary: This is a re-attempt to land the iwyu header changes, by taking the diff from [PR 100304](https://github.com/pytorch/pytorch/pull/100304), and adding the bare minimal changes to make the diff build corectly in the internal builds. X-link: https://github.com/facebookresearch/pytorch3d/pull/1541 X-link: https://github.com/fairinternal/pytorch3d/pull/44 - Re-work D45769819 to fix header inclusions in c10 Test Plan: ``` buck2 build --no-remote-cache mode/dev-nosan //caffe2/c10/... buck2 build --no-remote-cache mode/dev-nosan //deeplearning/fbgemm/fbgemm_gpu/... buck2 build mode/dev-nosan //vision/fair/pytorch3d/pytorch3d:_C ``` Reviewed By: malfet Differential Revision: D45920611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101846 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-05-20 19:35:14 +00:00
William Wen	dde6d56101	Prevent pattern matches across mutation ops in inductor pre-grad FX passes (#101144 ) Per https://github.com/pytorch/pytorch/issues/101124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101144 Approved by: https://github.com/jansel	2023-05-20 19:27:56 +00:00
Kwanghoon An	13640bf925	disableing quantizing gradient in 8bw (#101739 ) Summary: Quantizing a gradient is not applicable to complex ASR model. Gradient in INT8 f438266519 Gradient in FP32 f438109197 Clearly two WER shows the limitation with quantizing a gradient. As of now, we are okay with simply enabling quantized backpropagation but computing gradient in FP32. It already saves a memory due to model size. Test Plan: Signals Differential Revision: D45965552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101739 Approved by: https://github.com/izaitsevfb	2023-05-20 18:39:12 +00:00
AllenTiTaiWang	f0dc41a768	[ONNX] Bump onnx submodule to release 1.14.0 (#101809 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101809 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-05-20 15:09:43 +00:00
Avik Chaudhuri	03de15806e	group constraints by arg (#101815 ) Before, we would emit a soup of specializations / constraints without any obvious order to guide readability. With this diff, we group such results by arg, and add comments preceding each group. Empirically, the results read much better. Differential Revision: [D45995199](https://our.internmc.facebook.com/intern/diff/D45995199/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101815 Approved by: https://github.com/tugsbayasgalan	2023-05-20 06:01:14 +00:00
Tugsbayasgalan Manlaibaatar	b5ee34e5f2	Disallow module forward input mutation in aot_export (#101834 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101834 Approved by: https://github.com/bdhirsh	2023-05-20 05:41:01 +00:00
Jason Ansel	0c6f409cda	[inductor] Refactor RNG operators (#100064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064 Approved by: https://github.com/ngimel	2023-05-20 03:43:33 +00:00
Animesh Jain	8b2a9f81cc	[dynamo] Minor refactor to use is_allowed to decide inlining of NNModule methods (#101910 ) Fixes #101609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101910 Approved by: https://github.com/yanboliang	2023-05-20 03:34:20 +00:00
PyTorch UpdateBot	2886b3e692	[vision hash update] update the pinned vision hash (#101917 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101917 Approved by: https://github.com/pytorchbot	2023-05-20 02:57:54 +00:00
XiaobingSuper	bb62a3734e	inductor: fix name 'inf' is not defined issue when calling external_call function (#101865 ) This PR will fix https://github.com/pytorch/pytorch/issues/101695. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101865 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/Skylion007	2023-05-20 01:44:21 +00:00
XiaobingSuper	350f0cd78c	inductor: fix bfloat16 store complier issue (#101856 ) Fix the bfloat16 compiler error: ``` /tmp/torchinductor_xiaobing/ez/cezrraw7rtu5vkxcfd544i53crqaobycprf5twyvf7b62jrgi75p.cpp: In function ‘void kernel(const bfloat16, bfloat16)’: /tmp/torchinductor_xiaobing/ez/cezrraw7rtu5vkxcfd544i53crqaobycprf5twyvf7b62jrgi75p.cpp:20:79: error: expected ‘;’ before ‘}’ token 20 \| tmp0.store(tmp1 + static_cast<long>(16L*i1_inner), 16) \| ^ \| ; 21 \| } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101856 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire	2023-05-20 01:41:41 +00:00
Shunting Zhang	029c6a9934	[accuracy minifier] cast copied model rather than update the original model (#101901 ) This is the fix Ed found during the break of the summit :) I think I'd better to split it out of https://github.com/pytorch/pytorch/pull/99773 so people don't need to patch that PR to run the repro.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/101901 Approved by: https://github.com/ezyang	2023-05-20 00:50:32 +00:00
Avik Chaudhuri	73e887b5c7	[easy] refactor signature flattening transform (#101886 ) Move `ChangeInputOutputSignature` out of export function to avoid closed over variables that make dependencies hard to understand. Also rename it while we're at it. Differential Revision: [D46029076](https://our.internmc.facebook.com/intern/diff/D46029076/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101886 Approved by: https://github.com/tugsbayasgalan	2023-05-20 00:47:04 +00:00
Animesh Jain	7a17e9d0b6	[dynamo] Bugfix for unspecialized nn module variable (#101859 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101859 Approved by: https://github.com/yanboliang, https://github.com/shingjan	2023-05-20 00:46:56 +00:00
Peter Bell	48346a4648	[inductor] Test indirect indexing asserts with dimension of size 1 (#101811 ) Closes #101354, where the test came from. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101811 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-05-19 23:11:56 +00:00
Peter Bell	89bd5d3dab	[inductor] Implement magic methods on IR values (#101076 ) This wraps `ops` into an `OpsWrapper` object which wraps any returned IR values into an `OpsValue` instance. This allows magic methods to be implemented and means lowerings can write mathematical expressions much more fluently. So instead of ```python ops.add(ops.mul(ops.mul(ops.sub(ops.mul(_Ap2, x), _Ap3), x), x), _1) ``` we can write ```python (_Ap2 * x - _Ap3) * x * x + _1 ``` And it will translate to the equivalent `ops` calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101076 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-05-19 23:09:37 +00:00
Jerry Zhang	15495f2d96	[quant][pt2e] Introduce QuantizationAnnotation API (#101708 ) Summary: This diff adds QuantizationAnnotation and also refactors the existing annotation to use this object ``` dataclass class QuantizationAnnotation: # How some input nodes should be quantized, expressed as QuantizationSpec # a map from torch.fx.Node to QuantizationSpec input_qspec_map: Dict[Node, QuantizationSpec] # How the output of this node is quantized, expressed as QuantizationSPec output_qspec: QuantizationSpec class QuantizationSpec: dtype: torch.dtype is_dynamic: bool = False quant_min: Optional[int] = None quant_max: Optional[int] = None qscheme: Optional[torch.qscheme] = None ch_axis: Optional[int] = None # TODO: follow up PR will add this # Kind of observer such as MinMaxObserver, PerChannelHistogramObserver etc. # observer_or_fake_quant_type: Union[ObserverBase, FakeQuantizeBase] ``` Example after full refactor: ``` int8_qspec = QuantizationSpec(dtype=torch.int8, ...) weight_qspec = QuantizationSpec(dtype=torch.int8, ...) conv_node["quantization_annotation"] = QuantizationAnnotation( input_qspec_map={input_node: int8_qspec, weight_node: weight_qspec} output_qspec=int8_qspec, ) ``` Note: right now input_qspec_map and output_qspec map are still using observer and fake quant constructors. Follow up PR: change the input_qspec_map and output_qspec to use QuantizationSpec directly Test Plan: ``` buck2 test mode/optcaffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)' ``` Differential Revision: D45895027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101708 Approved by: https://github.com/andrewor14	2023-05-19 22:54:27 +00:00
Jason Fried	03f50fcc02	[codemod][3.10][NamedTuple] Use typing_extensions to get NamedTuple Generics (#101830 ) Summary: 3.10 doesn't have support for Generic NamedTuples, but it exists in future versions so typing_extensions supports it (Note: this ignores all push blocking failures!) Test Plan: sandcastle Reviewed By: itamaro Differential Revision: D45923201 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101830 Approved by: https://github.com/izaitsevfb	2023-05-19 22:50:18 +00:00
Sergii Dymchenko	b07e97c084	Fix finding existing Needs label comments (#101889 ) After https://github.com/pytorch/pytorch/pull/101747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101889 Approved by: https://github.com/izaitsevfb, https://github.com/malfet, https://github.com/zou3519	2023-05-19 22:43:53 +00:00
Xiaodong Wang	c8fd1cfad1	[pt2] Turn off lazy reinit when cuda graph is on (#101848 ) Summary: cuda graph doesn't work with cuda 11's cupti lazy reinit. So we'll turn it off if any modules turn on cudagraph Test Plan: test with cuda graph on Reviewed By: aaronenyeshi Differential Revision: D45967197 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101848 Approved by: https://github.com/aaronenyeshi	2023-05-19 21:50:38 +00:00
Aaron Enye Shi	fa7ad77ac9	[Profiler] Workaround CUPTI Lazy Reinit and CUDA Graphs crash in CUDA 11 (#101879 ) Summary: Since CUPTI lazy re-init crashes with CUDA Graphs in CUDA 11, we should disable this. Remove this item once majority of workloads move to CUDA 12. Test Plan: CI Tests Reviewed By: xw285cornell Differential Revision: D45921028 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/101879 Approved by: https://github.com/xw285cornell	2023-05-19 21:47:07 +00:00
Mark Saroufim	3666ca9d97	Dynamic Shape Doc (#101885 ) <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 2f25c1e</samp> > _Dynamic shapes guide_ > _`TorchDynamo` and `TorchInductor`_ > _Learn from data flow_ Thanks @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/101885 Approved by: https://github.com/eellison, https://github.com/ezyang	2023-05-19 21:43:22 +00:00
Mark Saroufim	ff5b9428aa	Fake Tensor Docs (#101882 ) <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 75f33ae</samp> > _Fake tensors help_ > _compile and optimize code_ > _`PT2` in autumn_ Thanks @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/101882 Approved by: https://github.com/eellison, https://github.com/ezyang	2023-05-19 21:39:34 +00:00
Mark Saroufim	581d13a069	Add Logging Doc to compile index (#101888 ) <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at ba85a41</samp> > _`logging` module_ > _documents PyTorch events_ > _cutting through the fog_ Thanks @mlazos Pull Request resolved: https://github.com/pytorch/pytorch/pull/101888 Approved by: https://github.com/eellison	2023-05-19 21:29:25 +00:00
PyTorch MergeBot	7f3fed125e	Revert "separate out dynamo .requires_grad and .is_grad_enabled guards (#100570 )" This reverts commit 1fabee399d74ee5e1b519673a15619e9fded6562. Reverted https://github.com/pytorch/pytorch/pull/100570 on behalf of https://github.com/PaliC due to breaking inductor tests along with #101219 ([comment](https://github.com/pytorch/pytorch/pull/100570#issuecomment-1555271267))	2023-05-19 21:29:09 +00:00
Mark Saroufim	2dd33c71c1	Docs for torchcompile and functorch (#101881 ) <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at b5f48b6</samp> > _`torch.compile` docs_ > _Add a new section for `func`_ > _Winter of features_ Thanks @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101881 Approved by: https://github.com/eellison, https://github.com/zou3519	2023-05-19 21:23:43 +00:00
cviviers	81c181dc01	Update BCEWithLogitsLoss pos_weight description in documentation (#101567 ) Fixes #82496 and #65702 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101567 Approved by: https://github.com/mikaylagawarecki	2023-05-19 21:23:21 +00:00
Natalia Gimelshein	5ea7096ebc	match sdpa patterns from HF (#100609 ) Adds sdpa patterns seen in HF models. To actually make the patterns match, we need constant folding to remove addition of all-zeros mask, and figure out what to do with low mem dropout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100609 Approved by: https://github.com/jansel	2023-05-19 21:02:46 +00:00
Edward Z. Yang	96ee23e198	Print restarting analysis at INFO level with a exception breadcrumb (#101573 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101573 Approved by: https://github.com/albanD	2023-05-19 20:29:18 +00:00
Elias Ellison	e5e451a9db	Update batch size for a couple models (#101837 ) The memory compression for these models is at parity, but because we interleave timings between torch.compile and eager run memory is duplicated between between eager and cudagraphs pool and causes OOM. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101837 Approved by: https://github.com/anijain2305	2023-05-19 19:09:59 +00:00
PyTorch MergeBot	498c34e8e8	Revert " fix missing-prototypes warnings in torch_cpu (Part 4) (#100849 )" This reverts commit c2f28d1c1df0db78f2951e4df5dde264f80f07eb. Reverted https://github.com/pytorch/pytorch/pull/100849 on behalf of https://github.com/izaitsevfb due to fails internal Meta builds, including fbcode and android, see D46009888: ld.lld: error: undefined symbol: nnc_aten_embedding ([comment](https://github.com/pytorch/pytorch/pull/100849#issuecomment-1555105800))	2023-05-19 19:05:15 +00:00
PyTorch MergeBot	083f304d27	Revert "fix inference_mode with torch.compile (#101219 )" This reverts commit 11f7ae19cd068e7b01b7a296db869b4933951a57. Reverted https://github.com/pytorch/pytorch/pull/101219 on behalf of https://github.com/PaliC due to breaking inductor tests ([comment](https://github.com/pytorch/pytorch/pull/101219#issuecomment-1555104220))	2023-05-19 19:03:00 +00:00
PyTorch MergeBot	e760a968c8	Revert " fix missing-prototypes warnings in torch_cpu (Part 5) (#101788 )" This reverts commit ac1cf00085b30eadd164ccf02e5208a59ec1b38b. Reverted https://github.com/pytorch/pytorch/pull/101788 on behalf of https://github.com/izaitsevfb due to depends on #100849 which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/101788#issuecomment-1555097961))	2023-05-19 18:58:11 +00:00
Angela Yi	4f9aa7cb0f	[export] Error when constraining on static values (#101655 ) Fixes https://github.com/pytorch/pytorch/issues/100415 Results in the following error: ``` Traceback (most recent call last): File "/scratch/angelayi/work/pytorch/test/export/test_export.py", line 572, in test_export_constrain_static export(f, example_inputs, constraints) File "/scratch/angelayi/work/pytorch/torch/_export/__init__.py", line 348, in export method_name_to_graph_module[compile_spec.method_name] = _export( File "/scratch/angelayi/work/pytorch/torch/_export/__init__.py", line 119, in _export raise UserError(UserErrorType.CONSTRAIN_VIOLATION, str(e)) torch._dynamo.exc.UserError: File "/scratch/angelayi/work/pytorch/test/export/test_export.py", line 561, in f constrain_as_value(c, min=1, max=3) It appears that you're trying to set a constraint on a value which we evaluated to have a static value of 3. Scroll up to see where this constraint was set. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101655 Approved by: https://github.com/avikchaudhuri	2023-05-19 18:27:36 +00:00
Aaron Gokaslan	3e2ea32dab	[BE]: Enable ruff rule TRY302 and apply fixes (#101874 ) Removes useless try statements and unreachable code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101874 Approved by: https://github.com/malfet	2023-05-19 17:30:52 +00:00
Nikita Shulga	1ac663d9f1	`collect_env`: parse HIP version exception free (#101844 ) Should prevent broken collect_env reporting as shown in https://github.com/pytorch/vision/issues/7561#issue-1698000841 <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 5204e0f</samp> > _`get_version_or_na`_ > _Helper function refactors_ > _Code like autumn leaves_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/101844 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi	2023-05-19 17:24:35 +00:00
AllenTiTaiWang	0df691df4e	[ONNX] Support aten::broadcast_to (#101833 ) Support aten::broadcast as the way we support on aten::expand. Fix #92678 #101768 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101833 Approved by: https://github.com/thiagocrepaldi	2023-05-19 16:59:54 +00:00
ydwu4	113b67059f	Fix specify_constraints signature for exporting module (#101831 ) Currently, when exporting a module, specify_constraint's signature is wrong. We make it consistent with the calling convention of the module being exported. @build[pytorch_dynamo_inductor] Pull Request resolved: https://github.com/pytorch/pytorch/pull/101831 Approved by: https://github.com/avikchaudhuri	2023-05-19 16:58:17 +00:00
Brian Hirsh	11f7ae19cd	fix inference_mode with torch.compile (#101219 ) It looks like inference_mode wasn't playing well with functionalization. If you run torch.compile on a function, and the inputs to the function are tensors created outside of inference mode, then we need to make sure that when we created functional tensor wrappers for those inputs during compilation, those functional wrappers properly mirror whether or not the original tensor is an inference tensor. Hopefully fixes https://github.com/pytorch/pytorch/issues/101151 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101219 Approved by: https://github.com/albanD, https://github.com/ezyang	2023-05-19 16:14:56 +00:00
Brian Hirsh	1fabee399d	separate out dynamo .requires_grad and .is_grad_enabled guards (#100570 ) Fixes https://github.com/pytorch/pytorch/issues/100977 This will hopefully fix this error (from [issue](https://github.com/pytorch/pytorch/issues/99616)) This PR fixes an internal model: we were running an inductor inference graph, but `torch.is_grad_enabled()` was True, causing us to error inside of the inference graph when we encountered an out= operator. I haven't been able to create a smaller repro - before landing this, I want to create a smaller repro to convince myself of why we need to separate out these guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100570 Approved by: https://github.com/ezyang	2023-05-19 16:14:56 +00:00
Elias Ellison	f99eeb5bdf	Check devices on meta functions that return inputs (#101807 ) FakeTensor has a default device logic that wraps meta tensors to the right device after running meta kernels and throws on multiple devices. This logic was only running on the wrapping from meta kernels -> fake. For out variants, where the output of the meta kernel was already a fake tensor because it was an input, the device logic wasn't running. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101807 Approved by: https://github.com/ngimel	2023-05-19 16:13:39 +00:00
XiaobingSuper	61b6b038b0	inductor: fix FloorDiv issue for dynamic shape path (#101793 ) For TIMM ```tf_mixnet_l``` cpu dynamic shape path, we always get a wrong result compared with eager mode, the root cause is that we compute a wrong index when doing vectorization: ``` or(long i2=static_cast<long>(0L); i2<static_cast<long>(16L(((std::ceil((1.0/2.0)(std::ceil((1.0/2.0)(std::ceil((1.0/2.0)(std::ceil((1.0/2.0)ks1))))))))(std::ceil((1.0/2.0)(std::ceil((1.0/2.0)(std::ceil((1.0/2.0)(std::ceil((1.0/2.0)ks1))))))))) / 16L)); i2+=static_cast<long>(16L)) ``` the main loop's index using ```/``` rather than ```//```. After this PR, the ```tf_mixnet_l``` accuracy test can be passed. How to reproduce this issue? ``` python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/timm_models.py --accuracy --float32 -dcpu --inference -n5 --inductor --dynamic-shapes --only tf_mixnet_l ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101793 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/ezyang	2023-05-19 12:39:27 +00:00
medivh-xp	e06bd8f3b1	fsdp support create hybrid-sharded process group for custom backend (#100622 ) FSDP creates communication groups for intra-node communication through dist.new_subgroups. Previously, dist.new_subgroups only supported creation based on the number of CUDA devices. However, issue #99706 removed the avaliable-check for CUDA devices, allowing for custom backend create group based on num of custom devices per node. This PR allows FSDP to explicitly pass device num within the node when creating communication groups for intra-node communication, instead of defaulting to the number of CUDA devices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100622 Approved by: https://github.com/awgu	2023-05-19 06:08:55 +00:00
Vinitha Vijayan	4441ce21dc	Add missing conversion functions between half and float for ppc64le (#100168 ) Fixes compilation error on ppc64-le resulting from missing conversion functions 'convert_half_float' and 'convert_float_half'. These functions are implemented by this commit. Started failing compilation from the following commit onwards: ced5c89b6fbe827a538b7ada96b2f9a5989871c7. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100168 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-05-19 05:22:26 +00:00
wbigat	4486a1d09a	Improve the functionality of untyped storage for privateuse1. (#100868 ) Complete the implementation of the interface is_pinned() of untyped storage class for privateuse1. And refactor the implementation in typed storage by untyped_storage.is_pinned(). Hi, @ezyang This is another improvement of untyped storage for privateuse1, can you take a moment to review it? Thanks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100868 Approved by: https://github.com/kurtamohler, https://github.com/ezyang	2023-05-19 04:33:59 +00:00
Edward Z. Yang	f66d5dd788	SymIntify functorch vmap (#101409 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101409 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2023-05-19 03:07:41 +00:00
Khushi	1aaf0396eb	[reland][opinfo] empty_strided (#101782 ) Follows #100223 Previous PR: #100890 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101782 Approved by: https://github.com/ezyang	2023-05-19 03:06:29 +00:00
FindHao	e5b7c7a04f	Fix torchinductor uint8 bug (#101468 ) Fixes #96604 ## Issue description When we use a constant tensor with a uint8 type, the kernel generated by torchinductor output wrong results. For example, the negative value of `5` in uint8 will be `255`, and it is `True` that `255` is larger than `5`. However, the output result is `False` when we compare `torch.neg(5)` with `5`. It is because torchinductor bypass the data type for constant tensors and the `5` here is taken as a int32. Then, the comparison is between `-5` with `5`. ## Solution This PR generates an extra conversion for uint8 constant value when we use it. it does not occur on the first assignment but the access for this constant value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101468 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-05-19 02:01:50 +00:00
Michael Voznesensky	4c1bc91f42	Support autograd.Function w/ grad (#99483 ) This PR adds support for tracing autograd.Function with grad. A few important bullet points outlining our approach: 1) Our goal is to verify soundness in order to add a call_function to the autograd.Function's `apply` to the graph. 2) We achieve (1) by either verifying soundness or rejecting soundness, by ensuring that both forward and backward of the autograd.Function are sound. 3) For the forward, if we verify soundness, we install its guards into the graph. 4) For the backward, if we verify soundness, we throw it out. However, backwards soundness verification is more onerous, and has a config driven set of banned attrs and methods for tensors. 1-4 above are achieved by turning the forward and backward into UserDefinedFunctionVariables, and inlining through them, relying on dynamo's soundness detection. If we graph break in these, we raise and treat them as unsound. As noted above, backwards is stricter yet. For the tracing, the safety comes from dynamo's HigherOrderOperator system. That system ensures that not only do we trace soundly, but that no new variables are lifted into inputs during the tracing, and that the forward and backwards are entirely self contained. Whenever we reject a function as unsound, we restore back, as usual. Due to some limitations in the lifting logic, we have an escape hatch we implemented for tensors that are known in forward, but cross into backwards through save_tensors (save) /saved_tensors (load). We escape hatch here to avoid having the known saved tensors coming from forward end up being accidentally treated as lifted variables (and rejected). This is sound, but a little hacky feeling. Additionally, due to some limitations in fx node removal, combined with how we produce subgraphs for the traces installed from HigherOrderOperators, we had to improve our node removal logic. In the event of a restore, we remove the old nodes from the graph, as usual in dynamo. However, because the references to these nodes may exist in subgraphs, we traverse any nodes users and remove them first if and only if they are in another graph. This is always sound, because removal should only be downstream of restoration at this point. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99483 Approved by: https://github.com/zou3519	2023-05-19 01:26:21 +00:00
Thiago Crepaldi	7776a41bd6	[ONNX] Detect None constant during jit scalar type analysis (#101608 ) Fixes #97987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101608 Approved by: https://github.com/titaiwangms	2023-05-19 01:20:01 +00:00
rraminen	eb470ab2fb	Add cooperative_groups header to cuda_to_hip_mappings.py (#100721 ) This PR is to add hip/hip_cooperative_groups.h to avoid the hipify errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100721 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2023-05-19 01:18:12 +00:00
Richard Barnes	bcb4444cec	PyTorch -> C++17 (#98209 ) (#100557 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 4f0b524</samp> This pull request updates the codebase and the documentation to use C++17 instead of C++14 as the minimum required C++ standard. This affects the `ATen`, `c10`, and `torch` libraries and their dependencies, as well as the CI system and the `conda` package metadata. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100557 Approved by: https://github.com/malfet	2023-05-19 00:49:08 +00:00
drisspg	6f13d6892a	Add meta support for multinomial (#101324 ) # Summary Found this when trying to compile the text gen loop of nanogpt here: `b33289942b/torchbenchmark/models/nanogpt_generate/model.py (L322)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101324 Approved by: https://github.com/ngimel	2023-05-19 00:04:26 +00:00
Nikita Shulga	6f46716ee2	Fix/skip CSE tests on Python-3.8 without `astunparse` (#101805 ) If `astunparse` is not installed, following guard will be generated in `test_guard_function_builder_with_cse`: ```python def ___make_guard_fn(): def guard(L): if not (x[0].a < x[1].a * (3 - x[2].a)): return False if not (a.b.c[0].d.e + a.b.c[1].d.e * a.b.c[2].d.e > 0): return False if not (f(m.n[0], '0').x.y.z * f(m.n[0], '1').x.y.z * f(m.n[0], '2').x.y.z < 512): return False if not (self.g(a, b).k + (1 - self.g(a, b).k) <= m[0].a + self.g(a, b).k): return False return True return guard ``` Though, I have to say, hardcoding string comparison is pretty weird. Also, skip `test_guards_cse_pass_[single\|multiple]` if AST unparsing is missing. Fixes failure in a test introduced by https://github.com/pytorch/pytorch/pull/98488 copilot:poem Pull Request resolved: https://github.com/pytorch/pytorch/pull/101805 Approved by: https://github.com/atalman, https://github.com/ysiraichi	2023-05-18 23:14:35 +00:00
PyTorch UpdateBot	0a0acce515	[vision hash update] update the pinned vision hash (#101821 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101821 Approved by: https://github.com/pytorchbot	2023-05-18 22:28:29 +00:00
Rohan Varma	60547fcbee	Autoformat torch/utils/checkpoint (#101649 ) Per title Differential Revision: [D45933467](https://our.internmc.facebook.com/intern/diff/D45933467/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101649 Approved by: https://github.com/Skylion007, https://github.com/soulitzer	2023-05-18 21:55:05 +00:00
Howard Huang	d7f6bfe651	Fix require_backends_available to reenable distributed tests (#101704 ) ## TLDR Fix decorator to re-enable 26+ distributed tests that were previously being skipped in CI ## Explanation As part of the UCC upstream, we updated the backend tests cases to also include "ucc". `3ed1569e86/torch/testing/_internal/common_distributed.py (L90-L92)` In distributed tests we use a decorator which reads from this config and makes sure all backends are available on the system. `3ed1569e86/torch/testing/_internal/distributed/distributed_test.py (L7131)` However, UCC is not configured on by default for a certain subset of CI tests, which causes the entire test to be skipped (even if the test is meant for nccl and the backend being tested is nccl). As the fix, we should just check that only the `BACKEND` being tested is available ## Changes - Change logic to only check if the current backend being used is available - Rename `require_backends_available` -> `require_backend_is_available` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101704 Approved by: https://github.com/rohan-varma	2023-05-18 21:33:15 +00:00
PyTorch MergeBot	b5217d0898	Revert "match sdpa patterns from HF (#100609 )" This reverts commit c73923473d4ed0ab08143cb8fe3e8c3f86f2cf73. Reverted https://github.com/pytorch/pytorch/pull/100609 on behalf of https://github.com/izaitsevfb due to breaks inductor tests, please see D45973031 ([comment](https://github.com/pytorch/pytorch/pull/100609#issuecomment-1553650280))	2023-05-18 21:12:38 +00:00
PyTorch MergeBot	a76c1af351	Revert "Implement adding bias vector into structured sparse linear operator (#100881 )" This reverts commit c3a893c659bebf0e5b62452a751c4e6ab3dc5b2d. Reverted https://github.com/pytorch/pytorch/pull/100881 on behalf of https://github.com/izaitsevfb due to breaks internal builds, see D45972633 ([comment](https://github.com/pytorch/pytorch/pull/100881#issuecomment-1553621418))	2023-05-18 20:47:02 +00:00
PyTorch MergeBot	eb9ac9c156	Revert "Add activation functions (ReLU and SiLU for now) for structured sparse linear operator (#101339 )" This reverts commit bfb3941ad8aaf0af159c2bec3cf1cbec1488f335. Reverted https://github.com/pytorch/pytorch/pull/101339 on behalf of https://github.com/izaitsevfb due to Depends on #100881, which has to be reverted due to internal build breakage. ([comment](https://github.com/pytorch/pytorch/pull/101339#issuecomment-1553618216))	2023-05-18 20:42:44 +00:00
Sergei Vorobev	2c0d607882	[bazel] add build for functorch (#101475 ) Fixes #101469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101475 Approved by: https://github.com/ezyang	2023-05-18 20:29:08 +00:00
Andrey Talman	7ffdd4fedc	Update release related information (#101819 ) Update release related information. Features became more complex. Number of commits per releases have increased a lot. We had in average: 2.5k commits for releases 1.1.0-1.7.0, 3-3.5k commits for releases 1.8.0-1.12.0 4.5k-5k commits for releases 1.13.0, 2.0.0 Hence current target is 3 releases a year Pull Request resolved: https://github.com/pytorch/pytorch/pull/101819 Approved by: https://github.com/svekars, https://github.com/malfet	2023-05-18 20:27:16 +00:00
Zain Rizvi	686b12c93d	Reduce log output when no tests are prioritized (#101803 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 733b991</samp> Improve test reordering output in `tools/testing/test_selections.py`. Add a check to only print reordering information when there are tests to prioritize. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101803 Approved by: https://github.com/malfet, https://github.com/kit1980	2023-05-18 20:21:41 +00:00
Wei Ji	f95d42b1b7	[DataPipe] Update docstring for functional form of DataPipes (#100446 ) Copy the docstring from IterDataPipe and MapDataPipe classes to their functional form. Done using [`functools.update_wrapper`](https://docs.python.org/3/library/functools.html#functools.update_wrapper), xref https://stackoverflow.com/questions/6394511/python-functools-wraps-equivalent-for-classes. See also parallel change to `.pyi` stub files at https://github.com/pytorch/pytorch/pull/100503 Fixes https://github.com/pytorch/data/issues/792 and https://github.com/weiji14/zen3geo/issues/69. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100446 Approved by: https://github.com/NivekT	2023-05-18 19:59:00 +00:00
Nitin Jain	556bb691fd	[AO]Fix observed LSTM layer setup individually observed LSTM (#101299 ) Summary: We have found that `_get_lstm_with_individually_observed_parts()` is missing setup step which sets up the LSTM layer state initializing weights and biases of this layer. This diff fixes the observed numerical discrepancy seen by CTRL team in using the above API. Test Plan: N3358643 Differential Revision: D45821681 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101299 Approved by: https://github.com/andrewor14	2023-05-18 19:15:01 +00:00
Animesh Jain	2fa1b563da	[dynamo] Activation checkpoint higher order ops - Reland 101028 (#101790 ) https://github.com/pytorch/pytorch/pull/101028 was reverted due to internal breakage. Relanding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101790 Approved by: https://github.com/zou3519	2023-05-18 19:09:14 +00:00
Sergii Dymchenko	a33ac44540	Better needs label error message (#101747 ) As suggested in https://github.com/pytorch/pytorch/issues/101694 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101747 Approved by: https://github.com/malfet, https://github.com/zou3519	2023-05-18 18:27:13 +00:00
Huy Do	8b751b41c0	Do not trigger lint and pull workflows when sync nightly #26921 (#101746 ) These workflows run when the nightly PR #26921 is synched and they will fail due to the large commit message saved in env variable `COMMIT_MESSAGES`, for example https://github.com/pytorch/pytorch/actions/runs/4977477882. We don't need to run CI jobs on nightly. The list of failing workflows is from `f3e13d9567` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101746 Approved by: https://github.com/atalman, https://github.com/malfet	2023-05-18 16:57:03 +00:00
lezcano	1930428d89	Minor improvement on the decomposition of upsample_bilinear (#101682 ) This is how it's done in core. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101682 Approved by: https://github.com/ngimel	2023-05-18 16:51:51 +00:00
cyy	ac1cf00085	fix missing-prototypes warnings in torch_cpu (Part 5) (#101788 ) This PR fixes more missing-prototypes violations in the torch_cpu source following PRs #100053, #100147 and #100245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101788 Approved by: https://github.com/Skylion007	2023-05-18 16:38:14 +00:00
Driss Guessous	c9ba967c21	Upstream xformers code (#100583 ) # Summary Since the initial upstream of memory efficient attention from xformers: #86157, significant work updates have been made to the kernel including - increased performance, bug-fixes, and added functionality. This PR upstreams the latest version of this kernel as of: version 0.0.20 or commit: [6425fd0cacb1a6579aa2f0c4a570b737cb10e9c3](`6425fd0cac`) ## Future Although this version of the Kernel has support for dropout and arbitrary attention bias, I did not add this support to SDPA yet, and left the guards in sdp_utils. Those will follow up PRs in order to reduce the scope creep of these substantial changes, and ensure that nothing is broken. ## Specific Changes ### Minor Changes * The build system work was done in the previous PR and so no changes were needed to CMAKE 🤞 * Adding the new files and re-arranging/creating folder structure * Updating include paths * Switching from xformer specific functions: `XFORMERS_CHECK -> TORCH_CHECK` * Changes to xformer specific macros * Updates to the `generate_kernels.py` to use account for Pytorch file structure, also added an arg parse that I could run on a test dir before creating the files in place. ### Bigger Changes * Previous Kernel changes "Removed the chunk optimization: see discussion here: https://github.com/pytorch/pytorch/pull/96880" * Increased the number of cuda kernels -> potentially effecting the cuda_lib size. * Preemptively made changes to the dtypes of seed and offset in order to allow for cuda_graphs: #100196 this is not finished. * Made VERY BC breaking changes to at::_efficient_attention_forward and at::_efficeint_attention_backward function signatures. * I made these changes due to in part to the ability for this PR to land:https://github.com/pytorch/pytorch/pull/100196 ### Due Diligence Checks: * CUDA_lib size: * Before: 496 MiB * After: 496MiB * Performance Sweep: * I sweeped over 576 configs for forward only inference and the geomean speedup was 0.98x with a min speed up of 0.84 and a max speedup of 1.2 * For Forw+Back running on 270 configs ( to reduce memory) the geomean speedup was 1.02X with a min speed up of 1.02 and a max speedup of 1.35. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100583 Approved by: https://github.com/cpuhrsch	2023-05-18 16:15:34 +00:00
Animesh Jain	794cc3952e	adding moco to CI (#101098 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/101098 Approved by: https://github.com/desertfire	2023-05-18 10:01:49 +00:00
chuanqiw	b315c9b5ab	[CI] Enlarge memory for OOM models in inductor cpu HF accuracy test (#101395 ) Change the Inductor CPU HF accuracy test node from `linux.4xlarge` (32GB) to `linux.24xlarge` (192GB) to enlarge the node memory. Also add 3 HF models back to CI test. Fixes #101390 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101395 Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/huydhn	2023-05-18 09:23:30 +00:00
Angela Yi	72a73ef67b	Add aten.searchsorted.Tensor meta kernel (#101637 ) Test Plan: CI Differential Revision: D45933187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101637 Approved by: https://github.com/ezyang	2023-05-18 06:55:11 +00:00
cyy	c2f28d1c1d	fix missing-prototypes warnings in torch_cpu (Part 4) (#100849 ) This PR fixes more missing-prototypes violations in the torch_cpu source following PRs #100053, #100147 and #100245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100849 Approved by: https://github.com/albanD	2023-05-18 03:49:45 +00:00
XiaobingSuper	900ca4df59	inductor: skip weight packing when has zero shape (#101355 ) This PR wil skip weight packing when has zero shape to fix https://github.com/pytorch/pytorch/issues/101211. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101355 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-05-18 03:32:45 +00:00
Kazuaki Ishizaki	e48a052e7b	Fix link error on s390x (#101000 ) When I got the main branch and picked up #99872, I got the following link error. The root cause is that method definitions in the header file will generate multiple instantiations for the same method signature. This PR fixes the link error by avoiding to generate multiple instantiations. ``` % python setup.py develop ... [1080/1456] Linking CXX shared library lib/libtorch_cpu.so FAILED: lib/libtorch_cpu.so : && /usr/bin/c++ -fPIC -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor ... ... /usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/AvgPoolKernel.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_int_flt(int __vector(4))': AvgPoolKernel.cpp.ZVECTOR.cpp:(.text+0xa520): multiple definition of `at::vec::ZVECTOR::vec_int_flt(int __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x16920): first defined here /usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/AvgPoolKernel.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_flt_int(float __vector(4))': AvgPoolKernel.cpp.ZVECTOR.cpp:(.text+0xa5c0): multiple definition of `at::vec::ZVECTOR::vec_flt_int(float __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x169c0): first defined here /usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/AdaptiveMaxPoolKernel.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_int_flt(int __vector(4))': AdaptiveMaxPoolKernel.cpp.ZVECTOR.cpp:(.text+0x5970): multiple definition of `at::vec::ZVECTOR::vec_int_flt(int __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x16920): first defined here /usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/AdaptiveMaxPoolKernel.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_flt_int(float __vector(4))': AdaptiveMaxPoolKernel.cpp.ZVECTOR.cpp:(.text+0x5a10): multiple definition of `at::vec::ZVECTOR::vec_flt_int(float __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x169c0): first defined here /usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/AdaptiveAvgPoolKernel.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_int_flt(int __vector(4))': AdaptiveAvgPoolKernel.cpp.ZVECTOR.cpp:(.text+0x7d90): multiple definition of `at::vec::ZVECTOR::vec_int_flt(int __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x16920): first defined here /usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/AdaptiveAvgPoolKernel.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_flt_int(float __vector(4))': AdaptiveAvgPoolKernel.cpp.ZVECTOR.cpp:(.text+0x7e30): multiple definition of `at::vec::ZVECTOR::vec_flt_int(float __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x169c0): first defined here /usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/Activation.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_int_flt(int __vector(4))': Activation.cpp.ZVECTOR.cpp:(.text+0x65840): multiple definition of `at::vec::ZVECTOR::vec_int_flt(int __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x16920): first defined here /usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/Activation.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_flt_int(float __vector(4))': Activation.cpp.ZVECTOR.cpp:(.text+0x658e0): multiple definition of `at::vec::ZVECTOR::vec_flt_int(float __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x169c0): first defined here collect2: error: ld returned 1 exit status [67/316] Building CXX object test_api/CMakeFiles/test_api.dir/modules.cpp.o ninja: build stopped: subcommand failed. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101000 Approved by: https://github.com/malfet	2023-05-18 02:13:22 +00:00
Aleksandar Samardžić	bfb3941ad8	Add activation functions (ReLU and SiLU for now) for structured sparse linear operator (#101339 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101339 Approved by: https://github.com/cpuhrsch	2023-05-18 01:53:18 +00:00
cdzhan	a0e6f82087	[inductor] send max_pool2d_with_indices and its backwand to fallback if dilation is not 1 (#100531 ) Fixes #93384 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100531 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-05-18 01:07:06 +00:00
Animesh Jain	dafa009c3c	[dynamo][moco] Save global torch state to restore on graph break (#101201 ) This is relevant to https://github.com/pytorch/pytorch/pull/100570 as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101201 Approved by: https://github.com/voznesenskym	2023-05-18 01:03:15 +00:00
Ramil Nugmanov	28098cae6b	[DataLoader] Adding `StackDataset` (#101338 ) Torch wrapping datasets list has: `TensorDataset` `ConcatDataset` `ChainDataset` `TensorDataset` is useful for stacking sets of tensors but can't work with objects without `.size()` method. This PR proposes `StackDataset`, similar to `TensorDataset` but for a general case like `ConcatDataset`. Possible usage of `StackDataset` is multimodal networks with different input like image+text or for staking non-tensor input and property to predict. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101338 Approved by: https://github.com/ejguan, https://github.com/NivekT	2023-05-18 00:57:12 +00:00
Sergii Dymchenko	f0f0f70904	Fix check-labels workflow commenting on forked PRs (#101467 ) Using `pull_request_target` allows securely passing the secrets to make comments on a forked PRs. See more about `pull_request_target` in https://github.blog/2020-08-03-github-actions-improvements-for-fork-and-pull-request-workflows/ The change was verified in https://github.com/malfet/deleteme/pull/53 - with `on pull_request` there were no "This PR needs a label" comment, with with `on pull_request_target` the comment can be posted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101467 Approved by: https://github.com/malfet	2023-05-18 00:47:53 +00:00
Natalia Gimelshein	c73923473d	match sdpa patterns from HF (#100609 ) Adds sdpa patterns seen in HF models. To actually make the patterns match, we need constant folding to remove addition of all-zeros mask, and figure out what to do with low mem dropout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100609 Approved by: https://github.com/jansel	2023-05-18 00:33:52 +00:00
PandaNinjas	18f6f30d7c	Make HUD link https (#101461 ) It will now send you to the HUD site instead of staying on GitHub and adding the HUD link after the GitHub URL Pull Request resolved: https://github.com/pytorch/pytorch/pull/101461 Approved by: https://github.com/drisspg	2023-05-18 00:11:48 +00:00
Nikita Shulga	124d812f38	[BE] Fix rule not found error message (#101745 ) Prevent error message from becoming of single column of characters Thanks @clee200 for explaining how it worked before <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at fef1e25</samp> > _`reject_reason` fixed_ > _Syntax error caused trouble_ > _Autumn of bugs ends_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/101745 Approved by: https://github.com/kit1980, https://github.com/osalpekar	2023-05-17 23:57:36 +00:00
Peter Bell	66e398951a	[inductor/decomp] Add aten._unsafe_index to disable range checks (#101602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101602 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-05-17 23:36:24 +00:00
Peter Bell	b256091c7b	[inductor] Generate indirect_indexing checks even if optimized out (#100895 ) Fixes #100831, fixes #100878 Previously `gen_assert_indirect_indexing` was only called on the index expressions passed to `ops.load` and `ops.store` which means if the variable is optimized out during lowering, we never generate the assert. This instead makes `ops.indirect_indexing` eagerly generate the assert statement, whether or not it will be used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100895 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-05-17 23:36:24 +00:00
Peter Bell	ef512db0f8	[inductor] Constant and index_expr propagation pass (#101077 ) This pass does a limited form of constant propagation, as well as propagation of sympy indexing expressions. For example, say you have the function: ```python def flip(x): i = torch.arange(x.size(0) - 1, -1, -1, device=x.device) return x[i] ``` On current main this results in indirect indexing: ```python class buf0_loop_body: var_ranges = {z0: 4, z1: 3} index0 = 3 - z0 index1 = 3indirect0 + z1 index2 = 3z0 + z1 def body(self, ops): get_index = self.get_index('index0') index_expr = ops.index_expr(get_index, torch.int64) set_indirect0 = self.set_indirect0(index_expr) get_index_1 = self.get_index('index1') load = ops.load('arg0_1', get_index_1) get_index_2 = self.get_index('index2') store = ops.store('buf0', get_index_2, load, None) return store ``` With this PR the indexing is propagated through the computation and into direct indexing: ```python class buf0_loop_body: var_ranges = {z0: 4, z1: 3} index0 = -3z0 + z1 + 9 index1 = 3z0 + z1 def body(self, ops): get_index = self.get_index('index0') load = ops.load('arg0_1', get_index) get_index_1 = self.get_index('index1') store = ops.store('buf0', get_index_1, load, None) return store ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101077 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-05-17 23:36:24 +00:00
Shawn Xu	df6acf27fc	update gloo submodule (#101472 ) Ran command: `git submodule update --remote -- third_party/gloo` This is to pull in changes up to `31b1f0204b` to enhance gloo scalability for large distributed training jobs running into ephemeral port exhaustion issues. The test failure in https://github.com/pytorch/pytorch/pull/101438 shows the need for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101472 Approved by: https://github.com/H-Huang	2023-05-17 23:06:19 +00:00
Nikita Shulga	8c0b148926	[CI] Distribute bot workload (#101723 ) Pined hashes updates to be done by @pytorchupdatebot As mergebot token access is restricted to environment <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at d57c0f4</samp> > _`UPDATEBOT_TOKEN`_ > _A new name for the night_ > _Autumn leaves falling_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/101723 Approved by: https://github.com/huydhn	2023-05-17 21:46:55 +00:00
Yanbo Liang	29de581764	[Dynamo] Graph break on torch.cuda.set_device() (#101668 ) Fixes #97280 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101668 Approved by: https://github.com/jansel	2023-05-17 21:35:08 +00:00
PyTorch MergeBot	5f07c589b0	Revert "[inductor] Refactor RNG operators (#100064 )" This reverts commit 3bbf0683a1d56d8edc03822ccf3e38445322b4f8. Reverted https://github.com/pytorch/pytorch/pull/100064 on behalf of https://github.com/izaitsevfb due to breaks inductor tests, see D45936056 ([comment](https://github.com/pytorch/pytorch/pull/100064#issuecomment-1552093728))	2023-05-17 21:16:41 +00:00
Jane Xu	3135bec4a0	[docs] Clarify when to use SparseAdam (#101465 ) ![image](https://github.com/pytorch/pytorch/assets/31798555/ff19a522-2630-4578-bc0e-6a704aa94d4e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101465 Approved by: https://github.com/albanD	2023-05-17 21:16:20 +00:00
PyTorch MergeBot	5d3cfda1ed	Revert "match sdpa patterns from HF (#100609 )" This reverts commit f33725b82b83703df1d1135cc34a64a1b2b856a9. Reverted https://github.com/pytorch/pytorch/pull/100609 on behalf of https://github.com/izaitsevfb due to Based on #100064, which needs to be reverted due to diff-train issues. ([comment](https://github.com/pytorch/pytorch/pull/100609#issuecomment-1552089472))	2023-05-17 21:13:11 +00:00
Liang Hou	b429a4de13	Update public_api to remove duplicated `randn_like` (#101302 ) Remove duplicated `randn_like` in `functorch/op_analysis` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101302 Approved by: https://github.com/drisspg	2023-05-17 21:03:13 +00:00
Nikita Shulga	2236d5ef83	[Security] Move mergebot workflows in its own env (#101718 ) Pin update workflows will be updated to use @pytorchupdatebot in separate PR	2023-05-17 13:33:06 -07:00
Catherine Lee	f3fc531eee	Check for pytest extensions in run_test (#100916 ) not very elegant checked on separate conda env that doesnt have the usual ci dependencies the two pytest extensions at fault are pytest-rerunfailures and pytest-shard, also included pytest-flakefinder just incase no idea if this is a good way to do this could also check individually and add flags based on that, but was told that needing to requiring all the ci dependencies to be downloaded was also ok Pull Request resolved: https://github.com/pytorch/pytorch/pull/100916 Approved by: https://github.com/huydhn	2023-05-17 20:27:55 +00:00
Catherine Lee	e3c9a1e5c4	Run dynamo tests in parallel (#101432 ) cuts off ~30 min per shard (2 shards and 2 python versions so 2 hours total) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101432 Approved by: https://github.com/huydhn, https://github.com/desertfire, https://github.com/ZainRizvi	2023-05-17 20:26:24 +00:00
Avik Chaudhuri	e3c66ded86	remove default lower bound in dynamic_dim suggestions (#101636 ) So instead of `2 <= dynamic_dim(x, 0)` simply suggest `dynamic_dim(x, 0)`. This has exactly the same effect. Differential Revision: [D45933273](https://our.internmc.facebook.com/intern/diff/D45933273/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101636 Approved by: https://github.com/tugsbayasgalan, https://github.com/ydwu4	2023-05-17 19:55:04 +00:00
Masaki Kozuki	c8579b7374	Run `test_cpp_memory_snapshot_pickle` only when linux and x86_64 (#101366 ) On Arm, I got ``` Traceback (most recent call last): File "/opt/pytorch/pytorch/test/test_cuda.py", line 5260, in test_cpp_memory_snapshot_pickle mem = run() File "/opt/pytorch/pytorch/test/test_cuda.py", line 5257, in run t = the_script_fn() File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 496, in prof_func_call return prof_callable(func_call, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 493, in prof_callable return callable(args, **kwargs) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): File "/opt/pytorch/pytorch/test/test_cuda.py", line 5254, in the_script_fn @torch.jit.script def the_script_fn(): return torch.rand(311, 411, device='cuda') ~~~~~~~~~~ <--- HERE RuntimeError: record_context_cpp is not support on non-linux non-x86_64 platforms ``` `dfe484a3b3/torch/csrc/profiler/unwind/unwind.cpp (L4-L24)` seems related Pull Request resolved: https://github.com/pytorch/pytorch/pull/101366 Approved by: https://github.com/zdevito	2023-05-17 19:44:21 +00:00
PyTorch MergeBot	dfac4364c4	Revert "[opinfo] empty_strided (#100890 )" This reverts commit 01c7106580667720de80fac12a95cab0fed78ad1. Reverted https://github.com/pytorch/pytorch/pull/100890 on behalf of https://github.com/PaliC due to broke test_ops.py slow test ([comment](https://github.com/pytorch/pytorch/pull/100890#issuecomment-1551903975))	2023-05-17 19:00:15 +00:00
Natalia Gimelshein	f33725b82b	match sdpa patterns from HF (#100609 ) Adds sdpa patterns seen in HF models. To actually make the patterns match, we need constant folding to remove addition of all-zeros mask, and figure out what to do with low mem dropout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100609 Approved by: https://github.com/jansel	2023-05-17 17:44:40 +00:00
andrewor14	8e51521cee	[quant][pt2] Handle maxpool + conv + bn case in prepare QAT (#100941 ) Summary: This commit fixes a bug where we copy the metadata from the wrong node after replace_pattern. This happened in the case of [maxpool -> getitem1 -> conv -> bn -> getitem2], where `getitem1` is the placeholder node fed into the fused conv + bn pattern, and we incorrectly copied the metadata from `getitem1` instead of from `getitem2`. We fix this bug by filtering out the placeholder nodes before doing the metadata copying. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_fusion_getitem_placeholder Reviewers: jerryzh168, kimishpatel Differential Revision: [D45916751](https://our.internmc.facebook.com/intern/diff/D45916751) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100941 Approved by: https://github.com/jerryzh168	2023-05-17 17:36:32 +00:00
atannous	3ed1569e86	Adding serialization ID to inline container (#100994 ) Summary: In order to better track models after serialization, this change writes a serialization_id as a UUID to inline container. Having this ID enables traceability of model in saving and loading events. serialization_id is generated as a new UUID everytime serialization takes place. It can be thought of as a model snapshot identifier at the time of serialization. Test Plan: ``` buck2 test @//mode/dev //caffe2/caffe2/serialize:inline_container_test ``` Local tests: ``` buck2 run @//mode/opt //scripts/atannous:example_pytorch_package buck2 run @//mode/opt //scripts/atannous:example_pytorch buck2 run @//mode/opt //scripts/atannous:example_pytorch_script ``` ``` $ unzip -l output.pt Archive: output.pt Length Date Time Name --------- ---------- ----- ---- 36 00-00-1980 00:00 output/.data/serialization_id 358 00-00-1980 00:00 output/extra/producer_info.json 58 00-00-1980 00:00 output/data.pkl 261 00-00-1980 00:00 output/code/__torch__.py 326 00-00-1980 00:00 output/code/__torch__.py.debug_pkl 4 00-00-1980 00:00 output/constants.pkl 2 00-00-1980 00:00 output/version --------- ------- 1045 7 files ``` ``` unzip -p output.pt "output/.data/serialization_id" a9f903df-cbf6-40e3-8068-68086167ec60 ``` Differential Revision: D45683657 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100994 Approved by: https://github.com/davidberard98	2023-05-17 17:08:48 +00:00
ydwu4	326a4cc815	Support map autograd and pytree in/out. (#101633 ) Rebased https://github.com/pytorch/pytorch/pull/100494 and added dummy AOTConfig. This PR adds autograd and pytree support for map operator. Implementation-wise: 1. We temporarily make two HigherOrderOperators, "map" and "map_impl": - "map" is user-facing. Currently, it unwraps the pytrees in inputs and create a flat_fn for it. Dynamo currently cannot deal with pytree.tree_flatten and pytree.tree_unflatten, we therefore make it a HigherOrderOperator to trigger dynamo logic of handling HigherOrderOperators. - "map_impl" is the actual operator that works with the rest of torch subsystems such as functionalization, make_fx. It accepts flattend arguments, and a num_mapped_args integer denoting how many of the flattend arguments need to mapped i.e. their first dimension will be unstacked. 2. We create the forward and backward graph in autograd key and call torch.autograd.Function. Currently, the backward graph is recomputation-based and we need to partition the joint graph in the future to be more efficient. Example traced graphs for map operators: ### Case 1: simple f and autograd ```python def f(x, y): return x + y def g(xs, y): out = control_flow.map(f, xs, y) return torch.autograd.grad(out, (xs, y), torch.ones_like(out)) gm = make_fx(g, tracing_mode="symbolic")(torch.ones(3, 4, 5, requires_grad=True), torch.ones(5, requires_grad=True)) # gm.print_readable() produces following: class g(torch.nn.Module): def forward(self, xs_1: f32[3, s1, s2], y_1: f32[s2]): # No stacktrace found for following nodes body_graph_0 = self.body_graph_0 map_impl = torch.ops.map_impl(body_graph_0, 1, xs_1, y_1); body_graph_0 = None getitem: f32[3, s1, s2] = map_impl[0]; map_impl = None ones_like: f32[3, s1, s2] = torch.ops.aten.ones_like.default(getitem, pin_memory = False) is_same_size = torch.ops.aten.is_same_size.default(getitem, ones_like); getitem = None body_graph_1 = self.body_graph_1 map_impl_1 = torch.ops.map_impl(body_graph_1, 2, xs_1, ones_like, y_1); body_graph_1 = xs_1 = ones_like = None getitem_1 = map_impl_1[0] getitem_2: f32[3, s1, s2] = map_impl_1[1] getitem_3: f32[3, s2] = map_impl_1[2]; map_impl_1 = None sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(getitem_3, [0], True); getitem_3 = None sym_size: Sym(s2) = torch.ops.aten.sym_size(y_1, 0); y_1 = None view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]); sum_1 = sym_size = None return (getitem_2, view) class <lambda>(torch.nn.Module): def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s2]): # No stacktrace found for following nodes add: f32[s1, s2] = torch.ops.aten.add.Tensor(arg1_1, arg2_1); arg1_1 = arg2_1 = None return [add] class <lambda>(torch.nn.Module): def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s1, s2], arg3_1: f32[s2]): # No stacktrace found for following nodes add: f32[s1, s2] = torch.ops.aten.add.Tensor(arg1_1, arg3_1); arg1_1 = None is_same_size = torch.ops.aten.is_same_size.default(add, arg2_1); add = None sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(arg2_1, [0], True) sym_size: Sym(s2) = torch.ops.aten.sym_size(arg3_1, 0); arg3_1 = None view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]); sum_1 = sym_size = None return [None, arg2_1, view] ``` ### Case 2: list input/output f and autograd ```python def f(x, y): return [x[0].cos() + y.sin(), x[1].sin() * y.cos()] def g(xs, y): out = control_flow.map(f, xs, y) flat_out, _ = pytree.tree_flatten(out) flat_inp, _ = pytree.tree_flatten((xs, y)) requires_grad_inp = [inp for inp in flat_inp if inp.requires_grad] return torch.autograd.grad(flat_out, requires_grad_inp, [torch.ones_like(out) for out in flat_out]) gm = make_fx(g, tracing_mode="symbolic")( [torch.ones(3, 4, 5), torch.ones(3, 4, 5, requires_grad=True)], torch.ones(5, requires_grad=True)) # gm.print_readable() produces following: class g(torch.nn.Module): def forward(self, xs, y): xs_1: f32[3, s1, s2], xs_2: f32[3, s1, s2], y_1: f32[s2], = fx_pytree.tree_flatten_spec([xs, y], self._in_spec) # No stacktrace found for following nodes body_graph_0 = self.body_graph_0 map_impl = torch.ops.map_impl(body_graph_0, 2, xs_1, xs_2, y_1); body_graph_0 = None getitem: f32[3, s1, s2] = map_impl[0] getitem_1: f32[3, s1, s2] = map_impl[1]; map_impl = None ones_like: f32[3, s1, s2] = torch.ops.aten.ones_like.default(getitem, pin_memory = False) ones_like_1: f32[3, s1, s2] = torch.ops.aten.ones_like.default(getitem_1, pin_memory = False) is_same_size = torch.ops.aten.is_same_size.default(getitem, ones_like); getitem = None is_same_size_1 = torch.ops.aten.is_same_size.default(getitem_1, ones_like_1); getitem_1 = None body_graph_1 = self.body_graph_1 map_impl_1 = torch.ops.map_impl(body_graph_1, 4, xs_1, xs_2, ones_like, ones_like_1, y_1); body_graph_1 = xs_1 = xs_2 = ones_like = ones_like_1 = None getitem_2 = map_impl_1[0] getitem_3 = map_impl_1[1] getitem_4: f32[3, s1, s2] = map_impl_1[2] getitem_5: f32[3, s2] = map_impl_1[3]; map_impl_1 = None sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(getitem_5, [0], True); getitem_5 = None sym_size: Sym(s2) = torch.ops.aten.sym_size(y_1, 0); y_1 = None view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]); sum_1 = sym_size = None return pytree.tree_unflatten([getitem_4, view], self._out_spec) class <lambda>(torch.nn.Module): def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s1, s2], arg3_1: f32[s2]): # No stacktrace found for following nodes cos: f32[s1, s2] = torch.ops.aten.cos.default(arg1_1); arg1_1 = None sin: f32[s2] = torch.ops.aten.sin.default(arg3_1) add: f32[s1, s2] = torch.ops.aten.add.Tensor(cos, sin); cos = sin = None sin_1: f32[s1, s2] = torch.ops.aten.sin.default(arg2_1); arg2_1 = None cos_1: f32[s2] = torch.ops.aten.cos.default(arg3_1); arg3_1 = None mul: f32[s1, s2] = torch.ops.aten.mul.Tensor(sin_1, cos_1); sin_1 = cos_1 = None return [add, mul] class <lambda>(torch.nn.Module): def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s1, s2], arg3_1: f32[s1, s2], arg4_1: f32[s1, s2], arg5_1: f32[s2]): # No stacktrace found for following nodes cos: f32[s1, s2] = torch.ops.aten.cos.default(arg1_1); arg1_1 = None sin: f32[s2] = torch.ops.aten.sin.default(arg5_1) add: f32[s1, s2] = torch.ops.aten.add.Tensor(cos, sin); cos = sin = None sin_1: f32[s1, s2] = torch.ops.aten.sin.default(arg2_1) cos_1: f32[s2] = torch.ops.aten.cos.default(arg5_1) mul: f32[s1, s2] = torch.ops.aten.mul.Tensor(sin_1, cos_1) is_same_size = torch.ops.aten.is_same_size.default(add, arg3_1); add = None is_same_size_1 = torch.ops.aten.is_same_size.default(mul, arg4_1); mul = None mul_1: f32[s1, s2] = torch.ops.aten.mul.Tensor(arg4_1, sin_1); sin_1 = None mul_2: f32[s1, s2] = torch.ops.aten.mul.Tensor(arg4_1, cos_1); arg4_1 = cos_1 = None sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(mul_1, [0], True); mul_1 = None sym_size: Sym(s2) = torch.ops.aten.sym_size(arg5_1, 0) view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]); sum_1 = None # sin_2: f32[s2] = torch.ops.aten.sin.default(arg5_1) neg: f32[s2] = torch.ops.aten.neg.default(sin_2); sin_2 = None mul_3: f32[s2] = torch.ops.aten.mul.Tensor(view, neg); view = neg = None cos_2: f32[s1, s2] = torch.ops.aten.cos.default(arg2_1); arg2_1 = None mul_4: f32[s1, s2] = torch.ops.aten.mul.Tensor(mul_2, cos_2); mul_2 = cos_2 = None sum_2: f32[1, s2] = torch.ops.aten.sum.dim_IntList(arg3_1, [0], True); arg3_1 = None view_1: f32[s2] = torch.ops.aten.view.default(sum_2, [sym_size]); sum_2 = sym_size = None cos_3: f32[s2] = torch.ops.aten.cos.default(arg5_1); arg5_1 = None mul_5: f32[s2] = torch.ops.aten.mul.Tensor(view_1, cos_3); view_1 = cos_3 = None add_1: f32[s2] = torch.ops.aten.add.Tensor(mul_3, mul_5); mul_3 = mul_5 = None return [None, None, mul_4, add_1] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101633 Approved by: https://github.com/zou3519	2023-05-17 16:52:26 +00:00
Devashish Shankar	38e537db55	Handle multi-user case in split-cat simplification (#101473 ) Summary: Post refactoring, previous diff had a drop in QPS gained on prod model - because of multi-user getitems. Multi user getitems can be handled by the replacer. Differential Revision: D45893988 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101473 Approved by: https://github.com/jansel	2023-05-17 16:36:17 +00:00
Sergii Dymchenko	e17d9f2c64	Fix determenistic typos (#101631 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101631 Approved by: https://github.com/lezcano, https://github.com/ZainRizvi	2023-05-17 16:12:28 +00:00
Kimish Patel	07e759eca2	[PT2][Quant] Move to module partitioner for linear pattern quantization (#101122 ) Subgraph matcher is somewhat unreliable as the pattern can vary depending on the dimensionality of input tensor used to trace _and_ what appears before linear Differential Revision: [D45713915](https://our.internmc.facebook.com/intern/diff/D45713915/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101122 Approved by: https://github.com/jerryzh168	2023-05-17 15:47:08 +00:00
Jongsoo Park	ebae77e891	[transformer benchmark] sort by cuda time (#101349 ) Summary: The benchmark is running on CUDA Test Plan: buck run mode/opt //caffe2/benchmarks/transformer:sdp_backwards Differential Revision: D45843837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101349 Approved by: https://github.com/drisspg	2023-05-17 15:38:56 +00:00
Jason Ansel	403ce1a1c9	Fix benchmark model names printouts with tqdm (#101627 ) With the TQDM changes in #100969 -- the models names ended up getting hidden from the benchmark printouts. We would print the model name with no newline, then tqdm would print a `\r` and overwrite the name of the running model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101627 Approved by: https://github.com/ezyang	2023-05-17 15:31:11 +00:00
Kimish Patel	bec655f826	[PT] Update module partitioner to return parameter node (#101121 ) Instead of returning param name, return parameter get_attr node. Differential Revision: [D45713916](https://our.internmc.facebook.com/intern/diff/D45713916/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101121 Approved by: https://github.com/angelayi	2023-05-17 14:56:51 +00:00
XiaobingSuper	75375b410d	inductor(CPU): fix issue when padding/stride/dilation size is one for cpu weight packing pass(reland) (#101353 ) Differential Revision: [D45874469](https://our.internmc.facebook.com/intern/diff/D45874469) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101353 Approved by: https://github.com/desertfire	2023-05-17 14:49:14 +00:00
Kimish Patel	2c807a4acf	[PT2][Quant] Remove None annotations (#101120 ) None annotations are not needed anymore. Remove them. Differential Revision: [D45713917](https://our.internmc.facebook.com/intern/diff/D45713917/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101120 Approved by: https://github.com/jerryzh168	2023-05-17 14:38:34 +00:00
Richard Zou	783a46adee	[functorch] fix UB in interpreter stack (#101568 ) The UB was: - We grab a reference to the last element in the interpreter stack (DynamicLayerStack) - Then, we pop the last element in the interpreter stack - Finally, we continue to use the reference to the last element. The fix is to stop using that reference and instead use the popped element. Test Plan: - It's difficult to write a test for this PR so I didn't - Patched in https://github.com/pytorch/pytorch/pull/101409 and verified that this PR fixes the bad_variant_access it was experiencing under clang compilers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101568 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2023-05-17 13:36:47 +00:00
Aleksei Nikiforov	cb6fa890d4	s390x SIMD: Propagate NaN in minimum and maximum operations (#99716 ) This change fixes NNUtilsTest.ClipGradNormErrorIfNonfinite test from test_api c++ unit test when ZVECTOR is enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99716 Approved by: https://github.com/jgong5	2023-05-17 11:47:46 +00:00
Aleksei Nikiforov	a85f6aa4ca	s390x zvector: implement expm1 for complex vectorized types (#99872 ) This change fixes build with zvector on s390x Pull Request resolved: https://github.com/pytorch/pytorch/pull/99872 Approved by: https://github.com/jgong5	2023-05-17 11:12:24 +00:00
Yukio Siraichi	f72f0119ec	Implement CSE for dynamo guards. (#98488 ) This PR extracted the CSE part of the code in #89707. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98488 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/anijain2305	2023-05-17 10:47:24 +00:00
Angela Yi	f994d0b619	[dynamo] Change dimension constraint summary to log.info (#101584 ) Running with `TORCH_LOGS=dynamo` will have the dimension constraint summary pop up again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101584 Approved by: https://github.com/avikchaudhuri	2023-05-17 06:47:14 +00:00
Michael Voznesensky	39f52c0218	Switch AOT Inductor test to export, add dynamic, fix invocation bug (#101585 ) Fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/101585 Approved by: https://github.com/ngimel, https://github.com/desertfire	2023-05-17 05:52:08 +00:00
Aleksandar Samardžić	c3a893c659	Implement adding bias vector into structured sparse linear operator (#100881 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100881 Approved by: https://github.com/cpuhrsch	2023-05-17 05:46:22 +00:00
shaoyf42	97180aca5e	Enables barrier to support the specified device (#99589 ) Enables barrier to support the specified device, e.g cuda/custom device. There is some discussion here: https://github.com/pytorch/pytorch/issues/97938#issue-1646833919 Today, there are two limitations of barrier: One is that barrier does not support custom #device: `fbdb86c174/torch/csrc/distributed/c10d/ProcessGroup.hpp (L512-L522)` The second is that there is a special valid for nccl when device_id is not None, which is an assumption for cuda and nccl bindings, and also hinders custom device. `789070986c/torch/distributed/distributed_c10d.py (L3504-L3508)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99589 Approved by: https://github.com/kwen2501	2023-05-17 05:26:04 +00:00
Liao, Xuan	6261aa5c8d	[inductor][cpp] support non contiguous vectorization codegen (#99966 ) Currently, cpp vectorization is supported only when the node has at least one contiguous index. The PR enables cpp vectorization when all indices in the node are non-contiguous. Specifically, the most inner index is selected as the tiling index. ### Validation For the E2E performance and functionality, both inference and training model suites for data type float32 and bfloat16 are validated. All the results show that there is no performance regression and no new failures compared with baseline. ### Code The modification could help certain kernels in GPT-J do vectorization. Here is a snippet of output code change. Before ``` { #pragma GCC ivdep for(long i0=static_cast<long>(0L); i0<static_cast<long>(16Lks0); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(32L); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(1L + (2Li1) + (256Li0))]; auto tmp1 = static_cast<float>(tmp0); auto tmp2 = decltype(tmp1)(-tmp1); auto tmp3 = static_cast<bfloat16>(tmp2); out_ptr0[static_cast<long>((2Li1) + (64Li0))] = tmp3; } } } { #pragma GCC ivdep for(long i0=static_cast<long>(0L); i0<static_cast<long>(16Lks0); i0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(32L); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>((2Li1) + (256Li0))]; out_ptr1[static_cast<long>((2Li1) + (64Li0))] = tmp0; } } } ``` After ``` { #pragma GCC ivdep for(long i0=static_cast<long>(0L); i0<static_cast<long>(16Lks0); i0+=static_cast<long>(1L)) { for(long i1=static_cast<long>(0L); i1<static_cast<long>(32L); i1+=static_cast<long>(16L)) { auto tmp0 = ([&]() { __at_align__ bfloat16 tmpbuf[16 2]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>(1L + (2Li1_inner) + (2Li1) + (256Li0))]; return load_bf16_as_float(tmpbuf); })(); auto tmp1 = (tmp0); auto tmp2 = tmp1.neg(); auto tmp3 = (tmp2); { __at_align__ bfloat16 tmpbuf[16sizeof(float)/sizeof(bfloat16)]; store_float_as_bf16(tmpbuf, tmp3); for (long i1_inner = 0; i1_inner < 16; i1_inner++) out_ptr0[static_cast<long>((2Li1_inner) + (2Li1) + (64Li0))] = tmpbuf[i1_inner]; } } } } { #pragma GCC ivdep for(long i0=static_cast<long>(0L); i0<static_cast<long>(16Lks0); i0+=static_cast<long>(1L)) { for(long i1=static_cast<long>(0L); i1<static_cast<long>(32L); i1+=static_cast<long>(16L)) { auto tmp0 = ([&]() { __at_align__ bfloat16 tmpbuf[16 * 2]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>((2Li1_inner) + (2Li1) + (256Li0))]; return at::vec::Vectorized<bfloat16>::loadu(tmpbuf, 16); })(); { __at_align__ bfloat16 tmpbuf[16sizeof(float)/sizeof(bfloat16)]; tmp0.store(tmpbuf, 16); for (long i1_inner = 0; i1_inner < 16; i1_inner++) out_ptr1[static_cast<long>((2Li1_inner) + (2Li1) + (64L*i0))] = tmpbuf[i1_inner]; } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99966 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-05-17 05:19:22 +00:00
Tugsbayasgalan Manlaibaatar	47f43ed84a	Actually functionalize torch.export (#101433 ) I thought i enabled this, but apparently not. This PR makes the export fully functional for real this time :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101433 Approved by: https://github.com/angelayi	2023-05-17 05:09:24 +00:00
Bug Hunter Yan	0c470b17e3	Extend storage create for custom storageImpl (#100237 ) Fixes #ISSUE_NUMBER For the scenario where users inherit storageimpl to implement their own subclasses, the current storage creation method cannot correctly create storage objects. Refer to the registration method of Allocator to expand the creation method of storageimpl, users can register their own custom storageimpl creation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100237 Approved by: https://github.com/albanD	2023-05-17 04:30:13 +00:00
Huy Do	d1a472a366	Fix Buck OSS build after flatbuffers update in #100716 (#101626 ) Broken in trunk after #100716, for example https://github.com/pytorch/pytorch/actions/runs/4996560867/jobs/8949908952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101626 Approved by: https://github.com/PaliC	2023-05-17 04:19:17 +00:00
Avik Chaudhuri	41d668c9dc	work around precision error in constraint solver (#101607 ) In https://github.com/pytorch/pytorch/pull/101307 we tried to fix https://github.com/pytorch/pytorch/issues/101093 using `nsimplify` to convert floats into rationals, but the fix is not reliable: it is possible for `nsimplify` to pick constants that don't work. Currently, constraint solving is only used by `export`, but constraints are added in all modes. This means that we can hit this issue even in non-`export` modes. This diff works around this issue for such modes by delaying raising such failures until constraint solving. Differential Revision: [D45922797](https://our.internmc.facebook.com/intern/diff/D45922797/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101607 Approved by: https://github.com/ezyang	2023-05-17 03:25:04 +00:00
PyTorch MergeBot	3c4f97c213	[vision hash update] update the pinned vision hash (#101635 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101635 Approved by: https://github.com/pytorchbot	2023-05-17 03:18:47 +00:00
Masaki Kozuki	ba2bc7df8f	Enable `backward` on `_foreach_zero_` (#101149 ) Currently torchgen cannot find an appropriate `DifferentiabilityInfo` for `_foreach_zero_` because `gen_foreach_derivativeinfo` doesn't correctly make use of `functional_info_by_signature` and `differentiability_infos`, and `is_reference_for_foreach` a bit too strict to `_foreach_zero_`. Generated code in `VariableType` ```c++ void _foreach_zero_(c10::DispatchKeySet ks, at::TensorList self) { auto self_ = unpack(self, "self", 0); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self ); std::vector<c10::optional<at::Tensor>> original_selfs(self.size()); std::vector<std::shared_ptr<ZeroBackward0>> grad_fns; if (_any_requires_grad) { for (const auto& i : c10::irange( self.size() )) { const auto ith_requires_grad = compute_requires_grad(self[i]); check_inplace(self[i], ith_requires_grad); grad_fns.push_back([&]() -> std::shared_ptr<ZeroBackward0> { if (!ith_requires_grad) { return nullptr; } else { auto grad_fn = std::shared_ptr<ZeroBackward0>(new ZeroBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self[i] )); return grad_fn; } }()); } } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif { at::AutoDispatchBelowAutograd guard; at::redispatch::_foreach_zero_(ks & c10::after_autograd_keyset, self_); } #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif if (!grad_fns.empty()) { auto differentiable_outputs = flatten_tensor_args( self ); TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size()); for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { rebase_history(differentiable_outputs[i], grad_fns[i]); } } } } ``` Rel: - #58833 - #96405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101149 Approved by: https://github.com/soulitzer	2023-05-17 03:10:13 +00:00
Jason Ansel	3bbf0683a1	[inductor] Refactor RNG operators (#100064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064 Approved by: https://github.com/ngimel	2023-05-17 01:29:31 +00:00
Li-Huai (Allan) Lin	bb3558961f	[MPS] Add histogram ops (#96652 ) Adds `torch.histc`, `torch.histogram`, `torch.histogramdd` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96652 Approved by: https://github.com/kulinseth, https://github.com/malfet	2023-05-17 01:25:43 +00:00
Nikita Shulga	20cf42de2c	Revert "[Reland] Add sym_size/stride/numel/storage_offset to native_function.… (#100749 )" This reverts commit bb454891ed5ce97f580ae52e20f8e9ff2d0f3bf5.	2023-05-16 18:17:02 -07:00
Zain Rizvi	9a17989b63	Prioritize modified tests when running on `main` (#101618 ) If a PR modifies a test, prioritize running that test on the default branch so that we get the test signal faster Fixes https://github.com/pytorch/pytorch/issues/101617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101618 Approved by: https://github.com/huydhn	2023-05-17 00:49:45 +00:00
Jane Xu	7ca5e68c00	Reorganize foreach ops more logically in native_functions.yaml (#101583 ) This is a purely cosmetic change where I organized the foreach ops in native_functions.yaml such that 1. all variants of each op are grouped together 2. add, sub, mul, div are first 3. every op after is alphabetical This way, it's easier to see all the variants of an op, say add, in one screen. Items 2 and 3 are not strictly necessary but is simple a more organized scheme than not caring at all. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101583 Approved by: https://github.com/mlazos	2023-05-17 00:18:25 +00:00
Jane Xu	cde597efa1	[docs] Warn that GradScaler can scale under 1 (#101569 ) Completes action item 1 in https://github.com/pytorch/pytorch/issues/99640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101569 Approved by: https://github.com/ngimel	2023-05-16 23:56:07 +00:00
PyTorch MergeBot	e69198b043	Revert "Support map autograd and pytree in/out (#100494 )" This reverts commit b8fa41be9d396d97cfcd53964a228e2f987e104a. Reverted https://github.com/pytorch/pytorch/pull/100494 on behalf of https://github.com/PaliC due to breaking tests on trunk, please check hud.pytorch.org for the broken tests ([comment](https://github.com/pytorch/pytorch/pull/100494#issuecomment-1550454835))	2023-05-16 22:50:18 +00:00
ydwu4	b8fa41be9d	Support map autograd and pytree in/out (#100494 ) This PR adds autograd and pytree support for map operator. Implementation-wise: 1. We temporarily make two HigherOrderOperators, "map" and "map_impl": - "map" is user-facing. Currently, it unwraps the pytrees in inputs and create a flat_fn for it. Dynamo currently cannot deal with pytree.tree_flatten and pytree.tree_unflatten, we therefore make it a HigherOrderOperator to trigger dynamo logic of handling HigherOrderOperators. - "map_impl" is the actual operator that works with the rest of torch subsystems such as functionalization, make_fx. It accepts flattend arguments, and a num_mapped_args integer denoting how many of the flattend arguments need to mapped i.e. their first dimension will be unstacked. 2. We create the forward and backward graph in autograd key and call torch.autograd.Function. Currently, the backward graph is recomputation-based and we need to partition the joint graph in the future to be more efficient. Example traced graphs for map operators: ### Case 1: simple f and autograd ```python def f(x, y): return x + y def g(xs, y): out = control_flow.map(f, xs, y) return torch.autograd.grad(out, (xs, y), torch.ones_like(out)) gm = make_fx(g, tracing_mode="symbolic")(torch.ones(3, 4, 5, requires_grad=True), torch.ones(5, requires_grad=True)) # gm.print_readable() produces following: class g(torch.nn.Module): def forward(self, xs_1: f32[3, s1, s2], y_1: f32[s2]): # No stacktrace found for following nodes body_graph_0 = self.body_graph_0 map_impl = torch.ops.map_impl(body_graph_0, 1, xs_1, y_1); body_graph_0 = None getitem: f32[3, s1, s2] = map_impl[0]; map_impl = None ones_like: f32[3, s1, s2] = torch.ops.aten.ones_like.default(getitem, pin_memory = False) is_same_size = torch.ops.aten.is_same_size.default(getitem, ones_like); getitem = None body_graph_1 = self.body_graph_1 map_impl_1 = torch.ops.map_impl(body_graph_1, 2, xs_1, ones_like, y_1); body_graph_1 = xs_1 = ones_like = None getitem_1 = map_impl_1[0] getitem_2: f32[3, s1, s2] = map_impl_1[1] getitem_3: f32[3, s2] = map_impl_1[2]; map_impl_1 = None sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(getitem_3, [0], True); getitem_3 = None sym_size: Sym(s2) = torch.ops.aten.sym_size(y_1, 0); y_1 = None view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]); sum_1 = sym_size = None return (getitem_2, view) class <lambda>(torch.nn.Module): def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s2]): # No stacktrace found for following nodes add: f32[s1, s2] = torch.ops.aten.add.Tensor(arg1_1, arg2_1); arg1_1 = arg2_1 = None return [add] class <lambda>(torch.nn.Module): def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s1, s2], arg3_1: f32[s2]): # No stacktrace found for following nodes add: f32[s1, s2] = torch.ops.aten.add.Tensor(arg1_1, arg3_1); arg1_1 = None is_same_size = torch.ops.aten.is_same_size.default(add, arg2_1); add = None sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(arg2_1, [0], True) sym_size: Sym(s2) = torch.ops.aten.sym_size(arg3_1, 0); arg3_1 = None view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]); sum_1 = sym_size = None return [None, arg2_1, view] ``` ### Case 2: list input/output f and autograd ```python def f(x, y): return [x[0].cos() + y.sin(), x[1].sin() * y.cos()] def g(xs, y): out = control_flow.map(f, xs, y) flat_out, _ = pytree.tree_flatten(out) flat_inp, _ = pytree.tree_flatten((xs, y)) requires_grad_inp = [inp for inp in flat_inp if inp.requires_grad] return torch.autograd.grad(flat_out, requires_grad_inp, [torch.ones_like(out) for out in flat_out]) gm = make_fx(g, tracing_mode="symbolic")( [torch.ones(3, 4, 5), torch.ones(3, 4, 5, requires_grad=True)], torch.ones(5, requires_grad=True)) # gm.print_readable() produces following: class g(torch.nn.Module): def forward(self, xs, y): xs_1: f32[3, s1, s2], xs_2: f32[3, s1, s2], y_1: f32[s2], = fx_pytree.tree_flatten_spec([xs, y], self._in_spec) # No stacktrace found for following nodes body_graph_0 = self.body_graph_0 map_impl = torch.ops.map_impl(body_graph_0, 2, xs_1, xs_2, y_1); body_graph_0 = None getitem: f32[3, s1, s2] = map_impl[0] getitem_1: f32[3, s1, s2] = map_impl[1]; map_impl = None ones_like: f32[3, s1, s2] = torch.ops.aten.ones_like.default(getitem, pin_memory = False) ones_like_1: f32[3, s1, s2] = torch.ops.aten.ones_like.default(getitem_1, pin_memory = False) is_same_size = torch.ops.aten.is_same_size.default(getitem, ones_like); getitem = None is_same_size_1 = torch.ops.aten.is_same_size.default(getitem_1, ones_like_1); getitem_1 = None body_graph_1 = self.body_graph_1 map_impl_1 = torch.ops.map_impl(body_graph_1, 4, xs_1, xs_2, ones_like, ones_like_1, y_1); body_graph_1 = xs_1 = xs_2 = ones_like = ones_like_1 = None getitem_2 = map_impl_1[0] getitem_3 = map_impl_1[1] getitem_4: f32[3, s1, s2] = map_impl_1[2] getitem_5: f32[3, s2] = map_impl_1[3]; map_impl_1 = None sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(getitem_5, [0], True); getitem_5 = None sym_size: Sym(s2) = torch.ops.aten.sym_size(y_1, 0); y_1 = None view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]); sum_1 = sym_size = None return pytree.tree_unflatten([getitem_4, view], self._out_spec) class <lambda>(torch.nn.Module): def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s1, s2], arg3_1: f32[s2]): # No stacktrace found for following nodes cos: f32[s1, s2] = torch.ops.aten.cos.default(arg1_1); arg1_1 = None sin: f32[s2] = torch.ops.aten.sin.default(arg3_1) add: f32[s1, s2] = torch.ops.aten.add.Tensor(cos, sin); cos = sin = None sin_1: f32[s1, s2] = torch.ops.aten.sin.default(arg2_1); arg2_1 = None cos_1: f32[s2] = torch.ops.aten.cos.default(arg3_1); arg3_1 = None mul: f32[s1, s2] = torch.ops.aten.mul.Tensor(sin_1, cos_1); sin_1 = cos_1 = None return [add, mul] class <lambda>(torch.nn.Module): def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s1, s2], arg3_1: f32[s1, s2], arg4_1: f32[s1, s2], arg5_1: f32[s2]): # No stacktrace found for following nodes cos: f32[s1, s2] = torch.ops.aten.cos.default(arg1_1); arg1_1 = None sin: f32[s2] = torch.ops.aten.sin.default(arg5_1) add: f32[s1, s2] = torch.ops.aten.add.Tensor(cos, sin); cos = sin = None sin_1: f32[s1, s2] = torch.ops.aten.sin.default(arg2_1) cos_1: f32[s2] = torch.ops.aten.cos.default(arg5_1) mul: f32[s1, s2] = torch.ops.aten.mul.Tensor(sin_1, cos_1) is_same_size = torch.ops.aten.is_same_size.default(add, arg3_1); add = None is_same_size_1 = torch.ops.aten.is_same_size.default(mul, arg4_1); mul = None mul_1: f32[s1, s2] = torch.ops.aten.mul.Tensor(arg4_1, sin_1); sin_1 = None mul_2: f32[s1, s2] = torch.ops.aten.mul.Tensor(arg4_1, cos_1); arg4_1 = cos_1 = None sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(mul_1, [0], True); mul_1 = None sym_size: Sym(s2) = torch.ops.aten.sym_size(arg5_1, 0) view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]); sum_1 = None # sin_2: f32[s2] = torch.ops.aten.sin.default(arg5_1) neg: f32[s2] = torch.ops.aten.neg.default(sin_2); sin_2 = None mul_3: f32[s2] = torch.ops.aten.mul.Tensor(view, neg); view = neg = None cos_2: f32[s1, s2] = torch.ops.aten.cos.default(arg2_1); arg2_1 = None mul_4: f32[s1, s2] = torch.ops.aten.mul.Tensor(mul_2, cos_2); mul_2 = cos_2 = None sum_2: f32[1, s2] = torch.ops.aten.sum.dim_IntList(arg3_1, [0], True); arg3_1 = None view_1: f32[s2] = torch.ops.aten.view.default(sum_2, [sym_size]); sum_2 = sym_size = None cos_3: f32[s2] = torch.ops.aten.cos.default(arg5_1); arg5_1 = None mul_5: f32[s2] = torch.ops.aten.mul.Tensor(view_1, cos_3); view_1 = cos_3 = None add_1: f32[s2] = torch.ops.aten.add.Tensor(mul_3, mul_5); mul_3 = mul_5 = None return [None, None, mul_4, add_1] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100494 Approved by: https://github.com/zou3519	2023-05-16 22:05:11 +00:00
Huy Do	552b712f80	Run C++ testcases in parallel with pytest-xdist (#101440 ) After an investigation, running C++ tests with https://github.com/pytest-dev/pytest-cpp is just slower than running them directly, plain and simple. I'm curious on the exact root cause, but that's a story for another day. `time build/bin/test_lazy` takes half a minute to run 610 tests on `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 2, 5, linux.4xlarge.nvidia.gpu)` while `time pytest /var/lib/jenkins/workspace/build/bin/test_lazy -v` takes 20+ minutes on the same runner. This is a very costly price to pay. The saving grace here is that https://github.com/pytest-dev/pytest-cpp supports pytest-xdist to run tests in parallel with `-n auto`, so `time pytest /var/lib/jenkins/workspace/build/bin/test_lazy -v -n auto` takes only 3 minutes. This is still not as fast as running C++ tests directly, but it's order of magnitude faster than running them sequentially. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101440 Approved by: https://github.com/clee2000	2023-05-16 21:52:36 +00:00
Catherine Lee	b998ec96ac	Don't run libtorch tests on slow test shard (#101429 ) They should run on the default shard instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101429 Approved by: https://github.com/huydhn	2023-05-16 21:50:14 +00:00
Catherine Lee	a26516b78b	Add inductor as a test disable group (#101448 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/101448 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-05-16 21:48:49 +00:00
PaliC	e0fc24cdc5	add retries to inductor benchmark suite (#101019 ) This pr accomplishes 1) Enables retries for downloading torchbenchmark and huggingface models in a similar method to how we do it for timm models right now. 2) creates a `_download_model` function for the hugging face and TIMM runners whose output I plan to use to preload the models somewhere if possible (please double check I'll be saving the right thing). Instead of retries, we plan to just add torchbench to a docker image as it is relatively small. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 3361a4c</samp> > _We're the brave and bold coders of the `common.py` module_ > _We've made a handy function for downloading models_ > _We've shared it with our mates in the other runners_ > _So pull and push and try again, we'll get them all in time_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/101019 Approved by: https://github.com/huydhn, https://github.com/desertfire	2023-05-16 21:41:50 +00:00
Thibaut Durand	01da732691	Fix type annotation of `torch.split` (#100655 ) The type annotation indicates `list` but the returned type is `tuple` ```python >>> import torch >>> type(torch.arange(10).split(4)) <class 'tuple'> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100655 Approved by: https://github.com/kit1980	2023-05-16 21:35:41 +00:00
Edward Z. Yang	41468833fb	vision_maskrcnn is now deterministic (#101116 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101116 Approved by: https://github.com/ngimel	2023-05-16 21:32:17 +00:00
ydwu4	2e08c68564	Avoid cond prefix when naming subgraph of HigherOrderOperators (#101439 ) Fixes the issue in #100278: HigherOrderOperator body functions should not all be named "cond_body". Pull Request resolved: https://github.com/pytorch/pytorch/pull/101439 Approved by: https://github.com/zou3519	2023-05-16 21:28:40 +00:00
PyTorch MergeBot	0585944eac	Revert "match sdpa patterns from HF (#100609 )" This reverts commit 0a7ea9627f087afef5c59b2500e018abb6c3c1b5. Reverted https://github.com/pytorch/pytorch/pull/100609 on behalf of https://github.com/izaitsevfb due to Breaks internal tests, see D45899223 ([comment](https://github.com/pytorch/pytorch/pull/100609#issuecomment-1550349249))	2023-05-16 20:57:33 +00:00
Nikita Karetnikov	42e65a2587	[pt2] add meta for `linalg_lu_factor_ex` (#101375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101375 Approved by: https://github.com/lezcano	2023-05-16 20:56:54 +00:00
Nikita Shulga	cb734123e2	[GHF] Ignore flaky classification for pin updates (#101587 ) Also add regression test to validate it Fixes https://github.com/pytorch/test-infra/issues/4126 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101587 Approved by: https://github.com/atalman	2023-05-16 20:48:30 +00:00
Zain Rizvi	b1474019a4	Test Reordering: Run previously failing tests first (#101123 ) Makes the CI prioritize running any test files that had a failing test in a previous iteration of the given PR. A follow up to https://github.com/pytorch/pytorch/pull/100522 which makes the `.pytest_cache` available to use here A concrete example: 1. Person A pushes a new commit and creates a PR. 2. 2 hours later, test_im_now_broken.py fails 3. Person A attempts to fix the test, but the test is actually still broken 4. The CI, seeing that test_im_now_broken.py had failed on a previous run, will now prioritize running that test first. Instead of waiting another 2 hours to get a signal, Person A only needs to wait ~15 minutes (which is how long it takes for tests to start running) # Testing I modified a file to make the tests invoking it fail and triggered CI twice with this failure. First run: https://github.com/pytorch/pytorch/actions/runs/4963943209/jobs/8883800811 Test step took 1h 9m to run Second run: https://github.com/pytorch/pytorch/actions/runs/4965016776/jobs/8885657992 Test step failed within 2m 27s Pull Request resolved: https://github.com/pytorch/pytorch/pull/101123 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-05-16 19:57:54 +00:00
mikey dagitses	b5ed606a8b	use Bazelisk to fetch Bazel in CI (#101424 ) use Bazelisk to fetch Bazel in CI Summary: Advantages: 1. this is cross-platform, no MacOS specific code 2. this lets us use define the version of Bazel succinctly in .bazelversion (already provided) Note that this change will upgrade our version of Bazel to 6.1.1. Test Plan: Rely on CI. --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/101424). * #101406 * #101405 * #101445 * __->__ #101424 * #101411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101424 Approved by: https://github.com/huydhn, https://github.com/vors	2023-05-16 19:52:15 +00:00
lkct	e7681b53e3	Fix typing for `setup_context` in `autograd` (#101464 ) The original only matches a tuple of length 1, but it's intended to match any length. Also, it now aligns with the docstring at L320 `d5cba0618a/torch/autograd/function.py (L320)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101464 Approved by: https://github.com/soulitzer, https://github.com/kit1980	2023-05-16 18:41:35 +00:00
PyTorch MergeBot	eac5f2a8e4	Revert "Actually functionalize torch.export (#101433 )" This reverts commit eec752ed056160ea848facfd19a19235e5f16e55. Reverted https://github.com/pytorch/pytorch/pull/101433 on behalf of https://github.com/PaliC due to causing failures on functorch macOS tests ([comment](https://github.com/pytorch/pytorch/pull/101433#issuecomment-1550111671))	2023-05-16 17:51:45 +00:00
Edward Z. Yang	b94f143ace	SymIntify convNd and conv_transposeNd, fix inductor symint handling (#101488 ) Fixes https://github.com/pytorch/pytorch/issues/101014 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101488 Approved by: https://github.com/ngimel	2023-05-16 17:46:52 +00:00
Nikita Karetnikov	411ba1c8bf	[pt2] skip flaky `linalg_householder_product` tests (#101551 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101551 Approved by: https://github.com/lezcano	2023-05-16 17:44:15 +00:00
kshitij12345	afea1a9fe9	[meta] error checking for inplace ops (#101532 ) Fixes #100753 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101532 Approved by: https://github.com/lezcano	2023-05-16 17:26:59 +00:00
Nikita Shulga	54fe828cd0	Improve rebase message when PR is uptodate (#101504 ) Also, preserve target branch commit revision at time of merge Fixes https://github.com/pytorch/test-infra/issues/4148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101504 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2023-05-16 17:26:08 +00:00
Nikita Shulga	20deccf8a1	BE changes for tryrebase.py (#101503 ) - Use context manager rather than explicit ```try: finally:``` - Add `ref/remotes` prefix to `onto_branch` in `main` rather than in `rebase_onto` functions - Define `MAIN_BRANCH` and `VIABLE_STRICT_BRANCH` constants in tests. - Replace `self.assertTrue(x in y)` with `self.assertIn(x, y)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101503 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2023-05-16 17:26:08 +00:00
PyTorch MergeBot	1272cd73da	Revert "extend serialization for tensor metadata (#99808 )" This reverts commit 4b9bc6f2a6d33fc9ca8065789fc287ad411b27ac. Reverted https://github.com/pytorch/pytorch/pull/99808 on behalf of https://github.com/izaitsevfb due to Breaks internal builds: ld.lld: error: undefined symbol: torch::jit::GetBackendMetaSerialization() ([comment](https://github.com/pytorch/pytorch/pull/99808#issuecomment-1550071656))	2023-05-16 17:22:25 +00:00
wgb	3f87c04cf8	fix a typo in common_device_type.py (#101485 ) Fixes #ISSUE_NUMBER In common_device_type.py the PrivateUse1TestBase class has a typo. Reference [https://github.com/pytorch/pytorch/pull/99960](url) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101485 Approved by: https://github.com/albanD	2023-05-16 17:16:09 +00:00
Snorf Yang	2a3e45a2a8	Docs: update default device description (#101283 ) Closes #101274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101283 Approved by: https://github.com/albanD	2023-05-16 17:07:31 +00:00
Xilun Wu	010763be9a	[DTensor][2/N] add DTensor constructor function: empty (#101022 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101022 Approved by: https://github.com/wanchaol	2023-05-16 16:50:54 +00:00
Xilun Wu	5cc361c736	[DTensor][1/N] add DTensor constructor function: ones (#100933 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100933 Approved by: https://github.com/wanchaol	2023-05-16 16:50:54 +00:00
Daohang Shi	2af7df62a5	log inductor compilation time to scuba (#101317 ) Summary: Set up timer around `compile_fx_inner` and log to scuba Differential Revision: D45822137 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101317 Approved by: https://github.com/nmacchioni	2023-05-16 16:32:17 +00:00
Tugsbayasgalan Manlaibaatar	eec752ed05	Actually functionalize torch.export (#101433 ) I thought i enabled this, but apparently not. This PR makes the export fully functional for real this time :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101433 Approved by: https://github.com/angelayi	2023-05-16 16:22:13 +00:00
albanD	59dff01319	Add top level function to check if running with deploy (#101420 ) Also not sure if this should be a public function or not. Leaving it private for now but let me know if you prefer for it to be public. FYI @nikitaved this will logically conflict with your triton kernel PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101420 Approved by: https://github.com/malfet	2023-05-16 16:05:49 +00:00
Xuehai Pan	05f6250815	Add missing `torch.distributed.ReduceOp.AVG` in type stubs (#101534 ) Add missing `AVG` to `torch.distributed.ReduceOp` enum for type annotation. Ref: `88b6a4577b/torch/csrc/distributed/c10d/Types.hpp (L35-L47)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101534 Approved by: https://github.com/Skylion007	2023-05-16 15:51:21 +00:00
mikey dagitses	47d31364d7	run buildifier on WORKSPACE (#101411 ) run buildifier on WORKSPACE Summary: Make it easier to keep the file clean with subsequent changes. Test Plan: Should be a no-op. --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/101411). * #101406 * #101405 * #101445 * #101424 * __->__ #101411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101411 Approved by: https://github.com/huydhn	2023-05-16 14:53:28 +00:00
Ronan Gautier	ff3f19615f	Type conversion between float/complex dtypes (#97935 ) This PR is an implementation of the feature request https://github.com/pytorch/pytorch/issues/97888, for the implementation of `torch.dtype.to_complex()` and `torch.dtype.to_float()` methods that convert between float and complex dtypes of the same precision. Disclaimer: it's the first time I code in C++ so hopefully the code is correct, but I'm not super confident about the PR. Any advice/comment is welcome. It's also my first contribution to a large library, so hopefully I'm not doing anything wrong ! @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/97935 Approved by: https://github.com/ezyang	2023-05-16 14:39:44 +00:00
kshitij12345	2b2a717f19	[inductor] erfc: lowering (#101416 ) Codegen support was already present. This PR just removes the fallback. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101416 Approved by: https://github.com/lezcano	2023-05-16 14:31:13 +00:00
Edward Z. Yang	23d1cc3811	Update llama to failing (#101565 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101565 Approved by: https://github.com/janeyx99	2023-05-16 14:12:26 +00:00
Angela Yi	9e023e1818	[fx] Better replacements finder in subgraph rewriter (#100556 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100556 Approved by: https://github.com/mcr229	2023-05-16 14:08:44 +00:00
Richard Zou	6bc0f4a4ee	[reland][CustomOp] Add Dispatcher error callback (#101452 ) Reland of #101015, original stack reverted due to internal test flakiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101452 Approved by: https://github.com/soulitzer	2023-05-16 13:33:31 +00:00
Richard Zou	c8be493dac	[reland][custom_op] Change the python type that maps to ListType in schema (#101451 ) Reland of #101190. Original stack was reverted due to internal test flakiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101451 Approved by: https://github.com/soulitzer	2023-05-16 13:33:31 +00:00
Richard Zou	4f8cbaa10a	[reland] Cleanup custom op library after each custom_op test (#101450 ) Reland of #100980. Original PR was reverted due to internal test flakiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101450 Approved by: https://github.com/soulitzer	2023-05-16 13:33:29 +00:00
Jean Schmidt	c2e16d8b2c	buck1 can't properly handle '/' on rule names, so fixing 'impl/cow/context' and 'core/impl/cow/context_test' build rules (#101552 ) This is because PR #101510 can't be landed due to the author not have linked a github account to his internal meta account. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101552 Approved by: https://github.com/DanilBaibak	2023-05-16 11:28:58 +00:00
Nikita Vedeneev	c51dfbf5b4	triu/tril: complete dtype support for CPU/CUDA. (#101414 ) As per title, we can support full dtype table for these ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101414 Approved by: https://github.com/ngimel	2023-05-16 10:42:36 +00:00
XiaobingSuper	88b6a4577b	inductor: fix sign gets wrong result dtype issue (#101377 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101377 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel	2023-05-16 08:01:06 +00:00
David Berard	935100cbde	[profiler] When record_inputs=True, record scalar lists of length <= 30 (#100593 ) Many ops take as inputs scalars or scalar lists which are important to understand the properties of the op. For example, convolution ops' behavior and output shapes often depend on padding and strides, which are provided as scalars of lists of scalars. This will record scalar lists when record_inputs=True. Details: During collection (and this was true before this PR as well), we serialize values and tensor metadata into an InputOutputEncoder. After collection occurs, we deserialize these values to attach the information to each of the events. This PR does this: - Adds support for serializing scalar lists during collection / serialization - Adds an extra field called "Concrete Args" - Splits up the deserialization process into two steps - one for generating "input shapes" and one for generating "concrete args". We split up input shapes and concrete args to avoid interrupting any previous workflows that relied on the specific data in the input shapes category; additionally, it's just a better description. Note that single scalars will remain in the "input shapes" category as they were already in that category in the past. Differential Revision: [D45798431](https://our.internmc.facebook.com/intern/diff/D45798431) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100593 Approved by: https://github.com/aaronenyeshi	2023-05-16 07:58:46 +00:00
XiaobingSuper	e389bfa01a	inductor: add dtype check before doing cpu binary fusion (#101376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101376 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel	2023-05-16 07:27:26 +00:00
Jiong Gong	6f7ebcdcd8	[inductor] enable descriptive name for cpp kernels (#101330 ) This PR enables the descriptive name for cpp kernels similar to the triton kernel name. A new configuration `config.cpp.descriptive_names` is added similar to that of triton. The kernel name follows the format: `cpp_<fused_name>_<id>`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101330 Approved by: https://github.com/XiaobingSuper, https://github.com/jansel	2023-05-16 06:48:11 +00:00
Jiong Gong	86869475ff	[inductor] move dtype propagation log to schedule artifact (#101351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101351 Approved by: https://github.com/jansel	2023-05-16 06:43:38 +00:00
Jiong Gong	dfc46153a7	[inductor] add graph id prefix to inductor_wrapper_call profile info (#101350 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101350 Approved by: https://github.com/jansel	2023-05-16 06:32:00 +00:00
PyTorch MergeBot	d9d34b3e18	[vision hash update] update the pinned vision hash (#101471 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101471 Approved by: https://github.com/pytorchbot	2023-05-16 05:33:08 +00:00
PaliC	c03555a303	add retries to external contribution data upload (#100889 ) Adds retries to external contribution upload as it is shown to be flaky <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 43c2602</samp> Added a function to read data from S3 objects and used it to implement a retry mechanism and verification for uploading external contribution stats. Modified `tools/stats/upload_external_contrib_stats.py` and `tools/stats/upload_stats_lib.py`. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 43c2602</samp> > _We'll upload the stats to the cloud, me hearties_ > _We'll use `read_from_s3` to check them all_ > _We'll retry if the connection fails, me hearties_ > _We'll log the results and have a ball_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/100889 Approved by: https://github.com/huydhn	2023-05-16 05:00:48 +00:00
AllenTiTaiWang	773f6b626d	[ONNX] Diagnostic to show all unsupported call_functions (#100451 ) Introduce `Analysis` to analyze fx graphmodule and emit diagnostics. This class can be extended to interact with `Transform` (passes) to decide if a pass should trigger based on graph analysis result. E.g., if decomp needs to run by checking operator namespace in nodes. For now leaving it as out of scope but can revisit if maintaining multi fx extractor becomes reality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100451 Approved by: https://github.com/titaiwangms	2023-05-16 04:59:23 +00:00
AllenTiTaiWang	45d080e0ac	[ONNX] Diagnostic 'log' and 'log_and_raise_if_error' (#100407 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100407 Approved by: https://github.com/thiagocrepaldi	2023-05-16 04:55:50 +00:00
Yanbo Liang	e4eaf33346	Re-enable detectron2_maskrcnn on CI (#100791 ) #99665 has been fixed, we can re-enable these models on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100791 Approved by: https://github.com/huydhn	2023-05-16 04:25:58 +00:00
Xu Zhao	af4248b9ad	Update the torchbench pin to include timm upgrade (#101466 ) Upgrade the timm (huggingface/pytorch-image-models) repo from 45af496 to 6635bc3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101466 Approved by: https://github.com/ngimel, https://github.com/malfet	2023-05-16 04:21:44 +00:00
andrewor14	964e61ee95	[quant][pt2] Handle no conv bias in prepare QAT fusion (#100610 ) Summary: This commit adds support for conv + BN fusion for the case where conv has no bias. Since the replacement patterns with and without conv bias are substantially different, we perform the replacement for each of these two cases separately. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_fusion_no_conv_bias Reviewers: jerryzh168, kimishpatel Differential Revision: [D45743510](https://our.internmc.facebook.com/intern/diff/D45743510) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100610 Approved by: https://github.com/jerryzh168	2023-05-16 04:05:53 +00:00
Yanbo Liang	7052fb37bd	[Dynamo] Improve handling UnspecializedNNModuleVariable side effect (#101141 ) Fixes #101102 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101141 Approved by: https://github.com/jansel	2023-05-16 03:57:13 +00:00
Rohan Varma	6f8a71aa3d	[c10d][Fix] Start gloo sequence numbers at 0. (#101422 ) Gloo PG used to create a random sequence number and broadcast it to the rest of the group. But when we started enforcing sequence number checks in ProcessGroupWrapper, we observed this was occasionally flaky. For example, this error in a job was wrong, as all ranks were running the first broadcast collective. Somehow the sequence number wasn't communicated across the store correctly: `` RuntimeError: Detected mismatch between collectives on ranks. Rank 16 is running collective: CollectiveFingerPrint(SequenceNumber=1977865401, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=54090078, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))).Collectives differ in the following aspects: Sequence number: 1977865401vs 54090078 ``` The issue reproduces rarely in tests, but is more common in large world size jobs. Differential Revision: [D45870688](https://our.internmc.facebook.com/intern/diff/D45870688/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101422 Approved by: https://github.com/H-Huang	2023-05-16 03:55:36 +00:00
Rohan Varma	4b849744d1	[IValue] Only coalesce once (#101447 ) Differential Revision: [D45880966](https://our.internmc.facebook.com/intern/diff/D45880966/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101447 Approved by: https://github.com/H-Huang	2023-05-16 03:52:53 +00:00
PyTorch MergeBot	13056ca229	Revert "[fx] Better replacements finder in subgraph rewriter (#100556 )" This reverts commit 9842d1ef94e84088735e143bffb238cb5eda7446. Reverted https://github.com/pytorch/pytorch/pull/100556 on behalf of https://github.com/izaitsevfb due to Reverting temporarily to unblock diff train, see D45743510 and #100610 ([comment](https://github.com/pytorch/pytorch/pull/100556#issuecomment-1548934932))	2023-05-16 03:50:06 +00:00
Tugsbayasgalan Manlaibaatar	194d360329	Add more canonical way of adding runtime pass (#100956 ) * #100955 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100956 Approved by: https://github.com/ydwu4, https://github.com/guangy10	2023-05-16 03:23:04 +00:00
PandaNinjas	f0786ad776	Use %zu instead of %ld when formatting size_t (#101412 ) This fixes compiling on systems where `size_t` is an `unsigned int` instead of an `unsigned long int` (32 bit Raspberry Pi OS is one example). `%ld` expects an `unsigned long int`, while `%zu` specifies that it's an unsigned size_t. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101412 Approved by: https://github.com/albanD	2023-05-16 02:45:55 +00:00
Driss Guessous	52363de2ec	Clean up grad check in sdp_utils.h (#101435 ) # Summary The priorty order was not being run correctly because of confusing function name Pull Request resolved: https://github.com/pytorch/pytorch/pull/101435 Approved by: https://github.com/jbschlosser	2023-05-16 02:22:45 +00:00
Huy Do	c3f7db3f52	Use python3 instead of /usr/bin/env python3 on Windows (#101437 ) I'm seeing some curious flaky failure on Windows where python3 couldn't be found in the env, for example https://github.com/pytorch/pytorch/actions/runs/4983028765/jobs/8920011406 or https://github.com/pytorch/pytorch/actions/runs/4967229128/jobs/8889106289. On the other hand, other scripts invoked directly with python3 works fine in the same workflow. So let's use use `python3 .github/scripts/parse_ref.py` instead. The binary python3 is in GITHUB_PATH populated by `setup-windows` step Pull Request resolved: https://github.com/pytorch/pytorch/pull/101437 Approved by: https://github.com/PaliC	2023-05-16 02:10:24 +00:00
Alexis Thual	24cc7fe020	Fix Wishart distribution documentation (#95816 ) This PR fixes the `torch.distributions.wishart.Wishart` example. Running the current example ```python m = Wishart(torch.eye(2), torch.Tensor([2])) m.sample() # Wishart distributed with mean=`df * I` and # variance(x_ij)=`df` for i != j and variance(x_ij)=`2 * df` for i == j ``` fails with ``` --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Untitled-1 in [321](untitled:Untitled-1?line=320) # %% ----> [322](untitled:Untitled-1?line=321) m = Wishart(torch.eye(2), torch.Tensor([2])) [323](untitled:Untitled-1?line=322) m.sample() # Wishart distributed with mean=`df * I` and [324](untitled:Untitled-1?line=323) # variance(x_ij)=`df` for i != j and variance(x_ij)=`2 * df` for i == j Untitled-1 in __init__(self, df, covariance_matrix, precision_matrix, scale_tril, validate_args) [83](untitled:Untitled-1?line=82) [84](untitled:Untitled-1?line=83) if param.dim() < 2: ---> [85](untitled:Untitled-1?line=84) raise ValueError("scale_tril must be at least two-dimensional, with optional leading batch dimensions") [86](untitled:Untitled-1?line=85) [87](untitled:Untitled-1?line=86) if isinstance(df, Number): ValueError: scale_tril must be at least two-dimensional, with optional leading batch dimensions ``` Is seems that the parameters of `Wishart.__init__()` were re-ordered, but the documentation was not updated. This PR fixes it. Here is the updated behaviour: ```python m = Wishart(torch.Tensor([2]), covariance_matrix=torch.eye(2)) m.sample() ``` ``` Untitled-1:255: UserWarning: Singular sample detected. tensor([[[6.6366, 0.7796], [0.7796, 0.2136]]]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95816 Approved by: https://github.com/ngimel, https://github.com/kit1980	2023-05-16 02:07:30 +00:00
Yanbo Liang	7f3b00bfe0	[Inductor] Improve view/reshape on tensors with shape 0 (#101051 ) Fixes failure 14k github models. This is a follow up for #99671. There is another case we don't handle well, which inspired me to switch to ```fake_reindex``` to handle tensors with shape 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101051 Approved by: https://github.com/ngimel	2023-05-16 01:38:14 +00:00
Edward Z. Yang	d198033661	Revert torch.fx.interpreter error printing change (#101462 ) Apparently this is breaking internal peeps and I don't care enough to keep it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101462 Approved by: https://github.com/wushirong, https://github.com/ngimel, https://github.com/voznesenskym	2023-05-16 01:25:49 +00:00
Andrew Gallagher	799ef7e501	[caffe2/tools/autograd] Fix non-determinism in code gen (#101425 ) Fix several cases of leaking set-iteration-order to generated sources, causing non-determinism in generated code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101425 Approved by: https://github.com/albanD	2023-05-16 00:54:03 +00:00
shibo	a8376099f9	fix print tensor in cpp for privateuse1 (#100797 ) Fixes #ISSUE_NUMBER 1、fix lintrunnr in `test/inductor/test_cuda_repro.py` 2、In Libtorch, if we rename the `privateuseone` backend to `foo`, and when we print tensor with `std::cout << tensor`, we will get the output like this, ``` 1.0, 2.0 ... [PrivateUse1FloatType{2,3}] ``` and it should be like this ``` 1.0, 2.0 ... [fooFloatType{2,3}] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100797 Approved by: https://github.com/ezyang	2023-05-16 00:25:35 +00:00
Jiong Gong	788ff0623b	[decomp] fix decomp of batch_norm when weight/bias is not flattened (#101059 ) Fix https://github.com/pytorch/pytorch/issues/100970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101059 Approved by: https://github.com/ezyang	2023-05-16 00:00:34 +00:00
chunyuan	1faef895ca	Inductor cpp wrapper: support sympy.Expr as input (#101257 ) Leverage the logic in https://github.com/pytorch/pytorch/pull/95533 to get the `dtype` of `sympy.Expr` and support it as graph input in the cpp wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101257 Approved by: https://github.com/jgong5, https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/jansel	2023-05-15 23:57:28 +00:00
Jack Taylor	187eb7ca88	Enable default workflow PyT 2.0 UTs on ROCm stack (#100981 ) PR to enable default workflow PyTorch 2.0 unit tests for the ROCm stack. - Enables all the dynamo unit test suites - Enables some of the inductor unit test suites - `test_config` - `test_cpp_wrapper` (cpu only) - `test_minifier` - `test_standalone_compile` - `test_torchinductor_dynamic_shapes` - `test_torchinductor_opinfo` - `test_torchinductor` - `test_triton_wrapper` - Introduces TEST_WITH_ROCM conditions for unit test skip/fail dictionaries in test_torchinductor_dynamic_shapes.py and test_torchinductor_opinfo.py Note this PR follows on from the discussions for the previous UT enablement PR https://github.com/pytorch/pytorch/pull/97988, we have opted to only enable a few inductor suites at the moment to ease the upstreaming effort as these files are changing very quickly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100981 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2023-05-15 23:45:04 +00:00
Khushi	01c7106580	[opinfo] empty_strided (#100890 ) Follows: #100223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100890 Approved by: https://github.com/ezyang	2023-05-15 23:39:39 +00:00
Huy Do	3920ec1442	Apply the same fix to cleanup process on Windows CPU build job (#101460 ) This goes together with https://github.com/pytorch/test-infra/pull/4169. To be replace by the main branch once https://github.com/pytorch/test-infra/pull/4169 merges Pull Request resolved: https://github.com/pytorch/pytorch/pull/101460 Approved by: https://github.com/clee2000, https://github.com/PaliC	2023-05-15 23:19:03 +00:00
Edward Z. Yang	0577043d94	Rename minpybind namespace from py to mpy (#101410 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101410 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2023-05-15 23:15:01 +00:00
Howard Huang	a206e8b027	[small BE] update NcclTest dim size (#101127 ) Previously input dimensions are fixed to 3x3, this is a small change to make that configurable. Will be used in future additions to nccl tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/101127 Approved by: https://github.com/rohan-varma	2023-05-15 23:05:10 +00:00
pminimd	59a3759d97	Update cpp_extension.py (#101285 ) When we need to link extra libs, we should notice that 64-bit CUDA may be installed in "lib", not in "lib64". <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 05c1ca6</samp> Improve CUDA compatibility in `torch.utils.cpp_extension` by checking for `lib64` or `lib` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101285 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-05-15 22:47:41 +00:00
Xiang Gao	1732077758	Bump up flatbuffer submodule version to the latest release (v23.3.3) (#100716 ) The current flatbuffer version uses `--std=c++0x` which is too old. On my system, one of flatbuffer's dependency has stopped supporting C++0x, causing a build issue on my system. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100716 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-05-15 22:28:01 +00:00
fduwjj	9d858642af	[PTD] Make input contiguous for _ReduceScatter (#101373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101373 Approved by: https://github.com/wz337	2023-05-15 22:08:21 +00:00
William Wen	0e811044bd	[dynamo 3.11] enable other torch 3.11 dynamo-related tests (#99180 ) Notes: - No segfaults observed in any CI tests: dynamo unittests, inductor unittests, dynamo-wrapped pytorch tests. So we remove the warning that using dynamo 3.11 may result in segfaults. - Fixed a weakreflist copying bug that caused a few dynamo-wrapped tests to hang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99180 Approved by: https://github.com/malfet, https://github.com/TamirFriedman-RecoLabs	2023-05-15 22:06:28 +00:00
Elias Ellison	3b7c6b21d7	Disable locality reodering in training (#101423 ) Differential Revision: [D45874682](https://our.internmc.facebook.com/intern/diff/D45874682) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101423 Approved by: https://github.com/ngimel	2023-05-15 21:34:49 +00:00
soulitzer	70ef0bb45a	Fix checkpoint doc small formatting issue (#101419 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101419 Approved by: https://github.com/albanD	2023-05-15 21:33:56 +00:00
Shen Li	af841f38bd	[SPMD] Allow Override.replacement to have a global view (#101427 ) It's easier for users to implement one Override that takes care of all target submodules of different types, instead of specifying one mapping pair for each FQN/type. For example, when calculating sharding for sparse layers, the decision needs to be make globally. In this, case it's helpful to allow user Override to get access to all submodules and make replacement decisions accordingly. Differential Revision: [D45879732](https://our.internmc.facebook.com/intern/diff/D45879732) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101427 Approved by: https://github.com/fegin	2023-05-15 21:27:41 +00:00
Tugsbayasgalan Manlaibaatar	9ffad5b62b	Remove input tracker from runtime assertion pass (#100955 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100955 Approved by: https://github.com/ydwu4	2023-05-15 21:26:47 +00:00
ts	563d8058f4	Fix inconsistent torch.nn.MaxPool1d output on cpu and gpu (#99843 ) Fixes #99412 , correctly raising an error when an output of invalid size is calculated. Would be happy to iterate on this if there are any issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99843 Approved by: https://github.com/mikaylagawarecki	2023-05-15 20:27:43 +00:00
Nikita Karetnikov	9eb1748b2b	[pt2] add meta and `SymInt` support for `linalg_lu` (#101372 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101372 Approved by: https://github.com/lezcano, https://github.com/albanD	2023-05-15 20:25:00 +00:00
Nikita Karetnikov	ac4cc63ae2	[pt2] add meta for `linalg_ldl_solve` (#101367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101367 Approved by: https://github.com/lezcano	2023-05-15 20:25:00 +00:00
Natalia Gimelshein	0a7ea9627f	match sdpa patterns from HF (#100609 ) Adds sdpa patterns seen in HF models. To actually make the patterns match, we need constant folding to remove addition of all-zeros mask, and figure out what to do with low mem dropout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100609 Approved by: https://github.com/jansel	2023-05-15 20:01:58 +00:00
Angela Yi	9842d1ef94	[fx] Better replacements finder in subgraph rewriter (#100556 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100556 Approved by: https://github.com/mcr229	2023-05-15 20:00:59 +00:00
PyTorch MergeBot	7912b34789	Revert "[CustomOp] Add Dispatcher error callback (#101015 )" This reverts commit c0e5d7e7fee31c332f1cf3d3e4d2305cc1d07bba. Reverted https://github.com/pytorch/pytorch/pull/101015 on behalf of https://github.com/huydhn due to Revert this as the earlier commits in the stack have been reverted ([comment](https://github.com/pytorch/pytorch/pull/101015#issuecomment-1548476583))	2023-05-15 19:49:53 +00:00
fakeYan	4b9bc6f2a6	extend serialization for tensor metadata (#99808 ) Fixes #ISSUE_NUMBER Add the serialization logic of backend metadata to the serialization of tensor, which is implemented through custom registration functions. In #97429 , the structure backendMeta is provided in TensorImpl, and we think that this part of information may also need to be serialized for custom. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99808 Approved by: https://github.com/ezyang	2023-05-15 19:45:34 +00:00
Andrew Gallagher	3b82298265	[caffe2/torchgen] Fix codegen non-determinism (#101286 ) Summary: Fix several cases of leaking set-iteration-order to generated sources, causing non-determinism in generated code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101286 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-05-15 18:45:19 +00:00
PyTorch MergeBot	349a2b3871	Revert "Cleanup custom op library after each custom_op test (#100980 )" This reverts commit d0d81652306bcf88f804a11a0061bcee847c6e5d. Reverted https://github.com/pytorch/pytorch/pull/100980 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100980#issuecomment-1548336634))	2023-05-15 18:17:42 +00:00
wgb	22b9bef3d0	Add device extensions to the test framework for supporting custom device (#99960 ) Fixes #ISSUE_NUMBER add a PrivateUse1TestBase in torch/testing/_internal/common_device_type.py for supporting custom device extensions "privateuse1", and add “device_type" parameter in instantiate_device_type_tests function for adding custom device testbase, the default value is None. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99960 Approved by: https://github.com/albanD, https://github.com/malfet	2023-05-15 18:16:00 +00:00
PyTorch MergeBot	b50595702b	Revert "[custom_op] Change the python type that maps to ListType in schema (#101190 )" This reverts commit de6470e28e31c24862950ca381d32f910a168dd0. Reverted https://github.com/pytorch/pytorch/pull/101190 on behalf of https://github.com/jeanschmidt due to preventing the revert of #100980 ([comment](https://github.com/pytorch/pytorch/pull/101190#issuecomment-1548332644))	2023-05-15 18:15:08 +00:00
Brian Hirsh	ee40cce475	[AOTAutograd] add export entrypoints (#100587 ) The main addition in this PR is two new API's in AOTAutograd. APIs `aot_export_module`: Given a module, exports it into a functionalized FX graph. Returns an `fx.GraphModule`, `GraphSignature` pair. The `GraphSignature` tells you various information about the graph, such as which graph inputs correspond to module params/buffers (and their fqn's), how to pytree-ify the inputs and the outputs of the graph. If you specify `trace_joint=True`, then you'll get back a joint forward-backward graph, that also returns parameter gradients in addition to the user outputs. There are several restrictions on this API, detailed in the comments. The most notable one is probably that this API does not handle partial graphs: If you want a backward graph, then you module's forward function is required to return a scalar loss that we can backprop through. It also does not support capturing the optimizer step. I (gratefully) used @SherlockNoMad and @suo's internal version of the `GraphSignature` object for this API, with a few minor changes in order to integrate it into AOTAutograd. `aot_export_joint_simple`: Given a function, we'll trace it into a joint forward-backward graph and return it. Unlike the above API, the function is not required to return a scalar loss. However, this API makes the guarantee that you do not need to make any calling convention changes between the original function, and the exported one, provided that you do that you do the following: * If you pass `trace_joint=False`, no work is needed: we'll export a functionalized forward graph with the same set of inputs as the original function * If you pass `trace_joint=True`, then you will need to manually use the `default_partitioner` or `min_cut_partitioner` from functorch. If you do, and get back a fw and bw graph, then the forward graph will be runnable identically to the original user function. The main use case for this API is higher order ops: a higher order op like `torch.cond()` can implement its derivative formula by using this API to export a joint graph (for both the true subgraph and the false subgraph), partition it into a fw/bw graph, and run cond on the `true_bw`, `false_bw` subgraphs. cc @zou3519 @Chillee Implementation Strategy A lot of the work in this PR went in to trying to find a reasonable way to re-use existing AOTAutograd components to expose these API's. Concretely: * The two new API's are both thin wrappers around `_aot_export_function`: this is a general purpose export API, that just re-uses `create_aot_dispatcher_function`. If we want to add e.g. an export API that includes the optimizer step in the future, we could probably implement it using `_aot_export_function`. * `aot_export_module` works extra hard to re-use as much of AOTAutograd as possible. For example, when tracing an inference graph, I perform the export under `torch.no_grad()` to make sure we don't accidentally trace out a backwards graph. When exporting a joint graph, I manually `.detach()` all user outputs except the loss, to make sure that we don't accidentally compute gradients for any other user outputs (even if the user forgot to manually detach them). * A large portion of `aot_export_module` comes from parsing out and creating a `GraphSignature` object. We discussed a few weeks ago that there's potentially a lot more information that we could stuff into this object (see [doc](https://docs.google.com/document/d/1_qzdKew5D1J2Q2GkZ1v5jsczSsIU-Sr0AJiPW7DdGjE/edit?usp=sharing)). For now, I ended up deciding to support the more limited use case of exporting a fwd-bwd full graph, without some of the extra annotations in that doc (for example, if we were to export partial graphs, we would need annotations for saved activations). My thought is that once a more concrete use case comes up that the existing API doesn't satisfy, we can revisit the annotations then. * I factored out `create_functional_call()` and `create_tree_flattened_fn()` for pytree-flattening and lifting-params-and-buffers, since I also need them in the export code * I added an `AOTConfig.is_export` flag. The export API re-uses all of the same code paths as the rest of AOTAutograd, but there are a few points where we need to either exit early (and avoid making a runtime epilogue), or add extra error checking, that is only valuable for export. * `aot_dispatch_autograd()` now exits early if it's being called in an export context, so it returns the full graph instead of also trying to create an `autograd.Function`. I think we probably want to factor this out, although I figured it would be safer to wait a bit for clarity on how functional RNG works with export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100587 Approved by: https://github.com/ezyang, https://github.com/SherlockNoMad	2023-05-15 18:08:11 +00:00
Brian Hirsh	bba12a4668	aot_autograd: factor out runtime epilogue from aot_dispatch_base (#100586 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100586 Approved by: https://github.com/ezyang	2023-05-15 18:08:11 +00:00
Natalia Gimelshein	a4830bd86b	fix sign return type (#101346 ) Fixes #101216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101346 Approved by: https://github.com/eellison, https://github.com/jansel	2023-05-15 17:50:36 +00:00
PyTorch MergeBot	d0db7d624d	Revert "[dynamo] Activation checkpointing as higher order op (#101028 )" This reverts commit de15e740a1f1cf0f267bb77ef851522ce2ab4674. Reverted https://github.com/pytorch/pytorch/pull/101028 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/101028#issuecomment-1548280970))	2023-05-15 17:47:08 +00:00
PyTorch MergeBot	13383f45c5	Revert "[c10d] Bridge c10d and gloo stores. (#100384 )" This reverts commit 74b2c04aa1a127fdaf06282bf6534017b619be66. Reverted https://github.com/pytorch/pytorch/pull/100384 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100384#issuecomment-1548279946))	2023-05-15 17:44:54 +00:00
PyTorch MergeBot	2341bd69e9	Revert "[caffe2/tools/autograd] Fix non-determinism in code gen (#101287 )" This reverts commit 52f526cfc0092978ebe6d7be8ae2e71a6d989254. Reverted https://github.com/pytorch/pytorch/pull/101287 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/101287#issuecomment-1548273201))	2023-05-15 17:39:14 +00:00
Devashish Shankar	3e1c8168f8	Add pattern to merge/simplify split-cat (#100713 ) Summary: In simple cases, both split and cat node can be removed in a "split->cat" pattern. However, there are various cases where they can't simply be removed and we need to simplify split/ add transforms before cat. Some such cases are: * Split-dim != cat-dim (but equal split) * Final node: cat vs stack * Final node has additional args * Shuffling of args between split/cat * Some final nodes are non-(cat/stack) For more details, please refer to https://docs.google.com/presentation/d/1SxBuY_FZfljSlX6i8slRNgP2CsUCICP0o4qe8cNNX8U/edit#slide=id.g232e9a90f64_0_273 (slides 8-15) Differential Revision: D45452404 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100713 Approved by: https://github.com/jansel	2023-05-15 17:37:21 +00:00
Ramin Azarmehr	721b144f0f	[MPS] Add support for Custom Kernels (#100661 ) - This change introduces these APIs to enable developing custom kernels on the MPS Stream: `torch::mps::get_command_buffer()` `torch::mps::get_dispatch_queue()` `torch::mps::commit()` - Add ObjC test case Pull Request resolved: https://github.com/pytorch/pytorch/pull/100661 Approved by: https://github.com/kulinseth, https://github.com/malfet	2023-05-15 17:02:33 +00:00
Edward Z. Yang	f48718f749	Update torchbench pin (#101365 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101365 Approved by: https://github.com/albanD, https://github.com/awgu	2023-05-15 16:52:31 +00:00
Aaron Enye Shi	e35323d6a7	[Profiler] Fix HTML plot output for profiler export_memory_timeline (#101316 ) Summary: Wrap the PNG image of the memory plot inside of an HTML body, so that the file can be easily opened or embedding in other frontends. Test Plan: CI Tests # Ran locally on Resnet50: {F988498243} {F988498789} https://www.internalfb.com/manifold/explorer/trace_stats/tree/749163530321413/tmpj3ifzs7r.pt.memorytl.html Differential Revision: D45827509 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/101316 Approved by: https://github.com/xuzhao9	2023-05-15 16:31:06 +00:00
vfdev	a8ea4178ab	Fixed bug in interpolate when interpolation size is larger than max (#101403 ) ## Description This is a bug fix for rare cases that can happen with specific scale, antialias=False, output for a random line can be wrong. For example: ``` line 14 output uint8: [76, 78, 80, 81, 83, 85, 87, 88, 90] expected float: [149, 152, 155, 158, 161, 164, 167, 170, 173] diff: [-73, -74, -75, -77, -78, -79, -80, -82, -83] opencv ref: [149 152 155 158 161 164 167 170 173] ``` It appears that for this line we have 3 weights coeff instead of 2: ``` line 13 \| 351, 2 k: 1130 15254 line 14 \| 378, 3 k: 0 16384 -6780 <------- We should have 2 weights and not 3 line 15 \| 432, 2 k: 15254 1130 ``` which comes from our `_compute_weights_aa` function that is specifically used for AA=False and uint8. ``` xmin = std::max( static_cast<int64_t>(center - support + 0.5 + align_corners_delta), static_cast<int64_t>(0)); xsize = std::min( static_cast<int64_t>(center + support + 0.5 + align_corners_delta), input_size) - xmin; ``` ``` center - support + 0.5 + align_corners_delta: 14.999999999999998 static_cast<int64_t>(center - support + 0.5 + align_corners_delta): 14 xmin -> 14 center + support + 0.5 + align_corners_delta: 17.0 static_cast<int64_t>(center + support + 0.5 + align_corners_delta): 17.0 xsize -> 17 - 14 = 3 <------ 3 instead of 2 ``` For float dtype, AA=False weights and indices are computed differently due to historically first implemented. In any case, `xsize` should not be larger than `max_interp_size`, so we decided to clip `xsize`. Once fixed computed indices and weights are same as for float dtype code path: ``` # Option: xsize = min(xsize, max_interp_size) Line Num \| xmin, xsize 14 \| 378, 2 xmin=378 <---> xmin = i * stride = i * 3 * 9 => i = 14 k: 0 16384 16384 = w * (1 << 14) => w = 1.0 => i=14, w=0 and i=15, w=1 ``` vs ``` Line Num \| index0, index1 F32: 14 \| 15, 16 F32: lambda0, lambda1: 0.999999, 9.53674e-07 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101403 Approved by: https://github.com/NicolasHug	2023-05-15 15:55:42 +00:00
cyy	a94135641c	Fix some NVCC warnings (Part 2) (#101383 ) PR #95568 enables more NVCC warnings. However, some cu files need to be modified to make building process more warning free. PR #100823 already contains some fixes. This PR aims to fix the remaining ones without breaking the codebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101383 Approved by: https://github.com/zou3519	2023-05-15 15:48:45 +00:00
Aleksei Nikiforov	effe1425dd	ASAN: fix use-after-free (#101400 ) arguments() returns vector member of object returned by schema() call. When object returned by schema() call is destroyed, the vector is deallocated as well, it's lifetime isn't extended. This issue detected while running `pytest -v test/mobile/test_lite_script_type.py -k test_nest_typing_namedtuple_custom_classtype` with ASAN. <details> <summary>ASAN output</summary> ``` ==1134126==ERROR: AddressSanitizer: heap-use-after-free on address 0x60d0005a5790 at pc 0x03ff844488d8 bp 0x03fff584afe8 sp 0x03fff584afd8 READ of size 8 at 0x60d0005a5790 thread T0 #0 0x3ff844488d7 in __gnu_cxx::__normal_iterator<c10::Argument const, std::vector<c10::Argument, std::allocator<c10::Argument> > >::__normal_iterator(c10::Argument const const&) /usr/lib/gcc/s390x-i bm-linux-gnu/11/include/g++-v11/bits/stl_iterator.h:1028 #1 0x3ff8444293f in std::vector<c10::Argument, std::allocator<c10::Argument> >::begin() const /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_vector.h:821 #2 0x3ff84d807d1 in torch::jit::toPyObject(c10::IValue) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:617 #3 0x3ff84d80305 in torch::jit::toPyObject(c10::IValue) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:604 #4 0x3ff84856871 in pybind11::detail::type_caster<c10::IValue, void>::cast(c10::IValue, pybind11::return_value_policy, pybind11::handle) /home/user/pytorch/torch/csrc/jit/python/pybind.h:138 #5 0x3ff85318191 in pybind11::cpp_function::initialize<torch::jit::initJitScriptBindings(_object)::$_45, c10::IValue, torch::jit::mobile::Module&, pybind11::tuple const&, pybind11::name, pybind11::is _method, pybind11::sibling, pybind11::arg>(torch::jit::initJitScriptBindings(_object)::$_45&&, c10::IValue ()(torch::jit::mobile::Module&, pybind11::tuple const&), pybind11::name const&, pybind11::is_me thod const&, pybind11::sibling const&, pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#1}::operator()(pybind11::detail::function_call&) const /home/user/pytorch/cmake/../third_party/pybin d11/include/pybind11/pybind11.h:249 #6 0x3ff85317cfd in pybind11::cpp_function::initialize<torch::jit::initJitScriptBindings(_object)::$_45, c10::IValue, torch::jit::mobile::Module&, pybind11::tuple const&, pybind11::name, pybind11::is _method, pybind11::sibling, pybind11::arg>(torch::jit::initJitScriptBindings(_object)::$_45&&, c10::IValue ()(torch::jit::mobile::Module&, pybind11::tuple const&), pybind11::name const&, pybind11::is_me thod const&, pybind11::sibling const&, pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#1}::__invoke(pybind11::detail::function_call&) /home/user/pytorch/cmake/../third_party/pybind11/incl ude/pybind11/pybind11.h:224 #7 0x3ff82ee52e9 in pybind11::cpp_function::dispatcher(_object, _object, _object) /home/user/pytorch/cmake/../third_party/pybind11/include/pybind11/pybind11.h:929 #8 0x3ffab002903 in cfunction_call Objects/methodobject.c:543 #9 0x3ffaaf8a933 in _PyObject_MakeTpCall Objects/call.c:215 #10 0x3ffaaf8e919 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112 #11 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53 #12 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #13 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #14 0x3ffab105447 in call_function Python/ceval.c:5891 #15 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181 #16 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #17 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #18 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #19 0x3ffaaf8a615 in _PyObject_FastCallDictTstate Objects/call.c:142 #20 0x3ffaaf8b271 in _PyObject_Call_Prepend Objects/call.c:431 #21 0x3ffab03f307 in slot_tp_call Objects/typeobject.c:7494 #22 0x3ffaaf8a933 in _PyObject_MakeTpCall Objects/call.c:215 #23 0x3ffab0f0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112 #24 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #25 0x3ffab105447 in call_function Python/ceval.c:5891 #26 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213 #27 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #28 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #29 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #30 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #31 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53 #32 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #33 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #34 0x3ffab105447 in call_function Python/ceval.c:5891 #35 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213 #36 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #37 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #38 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #39 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #40 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #41 0x3ffab105447 in call_function Python/ceval.c:5891 #42 0x3ffab0ff7d7 in _PyEval_EvalFrameDefault Python/ceval.c:4198 #43 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #44 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #45 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #46 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #47 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53 #48 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #49 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #50 0x3ffab105447 in call_function Python/ceval.c:5891 #51 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231 #52 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #53 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #54 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #55 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #56 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53 #57 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #58 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #59 0x3ffab105447 in call_function Python/ceval.c:5891 #60 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231 #61 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #62 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #63 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #64 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #65 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53 #66 0x3ffaaf8ab9b in PyVectorcall_Call Objects/call.c:267 #67 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290 #68 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317 #69 0x3ffab1059c7 in do_call_core Python/ceval.c:5943 #70 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #71 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #72 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #73 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #74 0x3ffaaf8a695 in _PyObject_FastCallDictTstate Objects/call.c:153 #75 0x3ffaaf8b271 in _PyObject_Call_Prepend Objects/call.c:431 #76 0x3ffab03f307 in slot_tp_call Objects/typeobject.c:7494 #77 0x3ffaaf8a933 in _PyObject_MakeTpCall Objects/call.c:215 #78 0x3ffab0f0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112 #79 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #80 0x3ffab105447 in call_function Python/ceval.c:5891 #81 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231 #82 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #83 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #84 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #85 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #86 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #87 0x3ffab105447 in call_function Python/ceval.c:5891 #88 0x3ffab0ff7d7 in _PyEval_EvalFrameDefault Python/ceval.c:4198 #89 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #90 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #91 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #92 0x3ffaaf8ab15 in PyVectorcall_Call Objects/call.c:255 #93 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290 #94 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317 #95 0x3ffab1059c7 in do_call_core Python/ceval.c:5943 #96 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #97 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #98 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #99 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #100 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #101 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #102 0x3ffab105447 in call_function Python/ceval.c:5891 #103 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181 #104 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #105 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #106 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #107 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #108 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53 #109 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #110 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #111 0x3ffab105447 in call_function Python/ceval.c:5891 #112 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181 #113 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #114 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #115 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #116 0x3ffaaf8a695 in _PyObject_FastCallDictTstate Objects/call.c:153 #117 0x3ffaaf8b271 in _PyObject_Call_Prepend Objects/call.c:431 #118 0x3ffab03f307 in slot_tp_call Objects/typeobject.c:7494 #119 0x3ffaaf8ad17 in _PyObject_Call Objects/call.c:305 #120 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317 #121 0x3ffab1059c7 in do_call_core Python/ceval.c:5943 #122 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #123 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #124 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #125 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #126 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #127 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #128 0x3ffab105447 in call_function Python/ceval.c:5891 #129 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213 #130 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #131 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #132 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #133 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #134 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53 #135 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #136 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #137 0x3ffab105447 in call_function Python/ceval.c:5891 #138 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231 #139 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #140 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #141 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #142 0x3ffaaf8ab15 in PyVectorcall_Call Objects/call.c:255 #143 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290 #144 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317 #145 0x3ffab1059c7 in do_call_core Python/ceval.c:5943 #146 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #147 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #148 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #149 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #150 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #151 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #152 0x3ffab105447 in call_function Python/ceval.c:5891 #153 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213 #154 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #155 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #156 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #157 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #158 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #159 0x3ffab105447 in call_function Python/ceval.c:5891 #160 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231 #161 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #162 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #163 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #164 0x3ffaaf8ab15 in PyVectorcall_Call Objects/call.c:255 #165 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290 #166 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317 #167 0x3ffab1059c7 in do_call_core Python/ceval.c:5943 #168 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #169 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #170 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #171 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #172 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #173 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #174 0x3ffab105447 in call_function Python/ceval.c:5891 #175 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181 #176 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #177 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #178 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #179 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #180 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53 #181 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #182 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #183 0x3ffab105447 in call_function Python/ceval.c:5891 #184 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181 #185 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #186 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #187 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #188 0x3ffaaf8a695 in _PyObject_FastCallDictTstate Objects/call.c:153 #189 0x3ffaaf8b271 in _PyObject_Call_Prepend Objects/call.c:431 #190 0x3ffab03f307 in slot_tp_call Objects/typeobject.c:7494 #191 0x3ffaaf8a933 in _PyObject_MakeTpCall Objects/call.c:215 #192 0x3ffab0f0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112 #193 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #194 0x3ffab105447 in call_function Python/ceval.c:5891 #195 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231 #196 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #197 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #198 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #199 0x3ffaaf8ab15 in PyVectorcall_Call Objects/call.c:255 #200 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290 #201 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317 #202 0x3ffab1059c7 in do_call_core Python/ceval.c:5943 #203 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #204 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #205 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #206 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #207 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #208 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #209 0x3ffab105447 in call_function Python/ceval.c:5891 #210 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181 #211 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #212 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #213 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #214 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #215 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53 #216 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #216 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #217 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #218 0x3ffab105447 in call_function Python/ceval.c:5891 #219 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181 #220 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #221 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #222 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #223 0x3ffaaf8a695 in _PyObject_FastCallDictTstate Objects/call.c:153 #224 0x3ffaaf8b271 in _PyObject_Call_Prepend Objects/call.c:431 #225 0x3ffab03f307 in slot_tp_call Objects/typeobject.c:7494 #226 0x3ffaaf8a933 in _PyObject_MakeTpCall Objects/call.c:215 #227 0x3ffab0f0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112 #228 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #229 0x3ffab105447 in call_function Python/ceval.c:5891 #230 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231 #231 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #232 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #233 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #234 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #235 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #236 0x3ffab105447 in call_function Python/ceval.c:5891 #237 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213 #238 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #239 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #240 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #241 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #242 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #243 0x3ffab105447 in call_function Python/ceval.c:5891 #244 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213 #245 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #246 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065 #247 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #248 0x3ffaaf8ab15 in PyVectorcall_Call Objects/call.c:255 #249 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290 0x60d0005a5790 is located 80 bytes inside of 136-byte region [0x60d0005a5740,0x60d0005a57c8) freed by thread T0 here: #0 0x3ffab537de5 in operator delete(void) /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:160 #1 0x3ff55984fdb in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> >::deallocate(std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2>, unsigned long) /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/ext/new_allocator.h:145 previously allocated by thread T0 here: #0 0x3ffab53734f in operator new(unsigned long) /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:99 #1 0x3ff5598443f in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> >::allocate(unsigned long, void const) /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/ext/new_allocator.h:127 #2 0x3fff5849ecf ([stack]+0xb2ecf) SUMMARY: AddressSanitizer: heap-use-after-free /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_iterator.h:1028 in __gnu_cxx::__normal_iterator<c10::Argument const, std::vector<c10::Argument, std::allocator<c10::Argument> > >::__normal_iterator(c10::Argument const const&) Shadow bytes around the buggy address: 0x100c1a000b4aa0: fd fd fd fd fd fd fd fd fd fd fd fa fa fa fa fa 0x100c1a000b4ab0: fa fa fa fa fd fd fd fd fd fd fd fd fd fd fd fd 0x100c1a000b4ac0: fd fd fd fd fd fa fa fa fa fa fa fa fa fa fd fd 0x100c1a000b4ad0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fa 0x100c1a000b4ae0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd =>0x100c1a000b4af0: fd fd[fd]fd fd fd fd fd fd fa fa fa fa fa fa fa 0x100c1a000b4b00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x100c1a000b4b10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x100c1a000b4b20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x100c1a000b4b30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x100c1a000b4b40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Shadow gap: cc ==1134126==ABORTING ``` Additional backtraces (not full): Allocation: ``` #0 __memset_z196 () at ../sysdeps/s390/memset-z900.S:144 #1 0x000003ff96f3072a in __asan::Allocator::Allocate (this=this@entry=0x3ff97041eb8 <__asan::instance>, size=size@entry=136, alignment=8, alignment@entry=0, stack=<optimized out>, stack@entry=0x3ffdbb45d78, alloc_type=<optimized out>, can_fill=true) at /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_allocator.cpp:599 #2 0x000003ff96f2c088 in __asan::asan_memalign (alignment=alignment@entry=0, size=size@entry=136, stack=stack@entry=0x3ffdbb45d78, alloc_type=alloc_type@entry=__asan::FROM_NEW) at /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_allocator.cpp:1039 #3 0x000003ff96fb73b0 in operator new (size=136) at /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:99 #4 0x000003ff41404440 in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> >::allocate (this=0x3ffdbb468c0, __n=1) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/ext/new_allocator.h:127 #5 0x000003ff414042a0 in std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> > >::allocate (__a=..., __n=1) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/alloc_traits.h:464 #6 0x000003ff41403b66 in std::__allocate_guarded<std::allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> > > (__a=...) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/allocated_ptr.h:98 #7 0x000003ff4140372a in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<c10::Argument, std::allocator<c10::Argument> >, std::vector<c10::Argument, std::allocator<c10::Argument> > > (this=0x3ffdbb47888, __p=@0x3ffdbb47880: 0x0, __a=..., __args=..., __args=..., __args=..., __args=...) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:648 #8 0x000003ff41403328 in std::__shared_ptr<c10::FunctionSchema, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<c10::FunctionSchema>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<c10::Argument, std::allocator<c10::Argument> >, std::vector<c10::Argument, std::allocator<c10::Argument> > > (this=0x3ffdbb47880, __tag=..., __args=..., __args=..., __args=..., __args=...) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:1342 #9 0x000003ff41402f06 in std::shared_ptr<c10::FunctionSchema>::shared_ptr<std::allocator<c10::FunctionSchema>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<c10::Argument, std::allocator<c10::Argument> >, std::vector<c10::Argument, std::allocator<c10::Argument> > > ( this=0x3ffdbb47880, __tag=..., __args=..., __args=..., __args=..., __args=...) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr.h:409 #10 0x000003ff41402b6e in std::allocate_shared<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<c10::Argument, std::allocator<c10::Argument> >, std::vector<c10::Argument, std::allocator<c10::Argument> > > (__a=..., __args=..., __args=..., __args=..., __args=...) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr.h:862 #11 0x000003ff4140215c in std::make_shared<c10::FunctionSchema, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<c10::Argument, std::allocator<c10::Argument> >, std::vector<c10::Argument, std::allocator<c10::Argument> > > (__args=..., __args=..., __args=..., __args=...) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr.h:878 #12 0x000003ff413d180c in c10::TupleType::createWithSpec<c10::basic_string_view<char> > (qualName=..., field_names=std::vector of length 1, capacity 1 = {...}, field_types=std::vector of length 1, capacity 1 = {...}, field_defaults=std::vector of length 0, capacity 0) at /home/user/pytorch/aten/src/ATen/core/type.cpp:769 #13 0x000003ff413b9ca6 in c10::TupleType::createNamed (qualName=..., field_names=std::vector of length 1, capacity 1 = {...}, field_types=std::vector of length 1, capacity 1 = {...}) at /home/user/pytorch/aten/src/ATen/core/type.cpp:725 #14 0x000003ff4115fbac in c10::ivalue::TupleTypeFactory<c10::TupleType>::fallback (type=...) at /home/user/pytorch/aten/src/ATen/core/dynamic_type.cpp:383 #15 0x000003ff708217fe in c10::ivalue::Tuple::type<c10::TupleType> (this=0x6080004b8520) at /home/user/pytorch/aten/src/ATen/core/ivalue_inl.h:781 #16 0x000003ff70800740 in torch::jit::toPyObject (ivalue=...) at /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:613 #17 0x000003ff70800306 in torch::jit::toPyObject (ivalue=...) at /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:604 #18 0x000003ff702d6872 in pybind11::detail::type_caster<c10::IValue, void>::cast (src=...) at /home/user/pytorch/torch/csrc/jit/python/pybind.h:138 #19 0x000003ff70d98192 in pybind11::cpp_function::initialize<torch::jit::initJitScriptBindings(_object)::$_45, c10::IValue, torch::jit::mobile::Module&, pybind11::tuple const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg>(torch::jit::initJitScriptBindings(_object)::$_45&&, c10::IValue ()(torch::jit::mobile::Module&, pybind11::tuple const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#1}::operator()(pybind11::detail::function_call&) const (this=0x3ffdbb4ca20, call=...) at /home/user/pytorch/cmake/../third_party/pybind11/include/pybind11/pybind11.h:249 #20 0x000003ff70d97cfe in pybind11::cpp_function::initialize<torch::jit::initJitScriptBindings(_object)::$_45, c10::IValue, torch::jit::mobile::Module&, pybind11::tuple const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg>(torch::jit::initJitScriptBindings(_object)::$_45&&, c10::IValue ()(torch::jit::mobile::Module&, pybind11::tuple const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#1}::__invoke(pybind11::detail::function_call&) (call=...) at /home/user/pytorch/cmake/../third_party/pybind11/include/pybind11/pybind11.h:224 #21 0x000003ff6e9652ea in pybind11::cpp_function::dispatcher (self=<PyCapsule at remote 0x3ff83e27720>, args_in=(<torch._C.LiteScriptModule at remote 0x3ff811844b0>, (<Tensor at remote 0x3ff814efb00>,)), kwargs_in=0x0) at /home/user/pytorch/cmake/../third_party/pybind11/include/pybind11/pybind11.h:929 ``` Deallocation: ``` #0 operator delete (ptr=0x60d0005a5740) at /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:160 #1 0x000003ff44904fdc in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> >::deallocate (this=0x3ffc5dc8020, __p=0x60d0005a5740, __t=1) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/ext/new_allocator.h:145 #2 0x000003ff44904fa8 in std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> > >::deallocate ( __a=..., __p=0x60d0005a5740, __n=1) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/alloc_traits.h:496 #3 0x000003ff449041f2 in std::__allocated_ptr<std::allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> > >::~__allocated_ptr ( this=0x3ffc5dc8030) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/allocated_ptr.h:74 #4 0x000003ff44904888 in std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2>::_M_destroy (this=0x60d0005a5740) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:538 #5 0x000003ff43895a62 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x60d0005a5740) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:184 #6 0x000003ff43895420 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x611000c40648) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:705 #7 0x000003ff4466e7f4 in std::__shared_ptr<c10::FunctionSchema, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x611000c40640) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:1154 #8 0x000003ff4466d820 in std::shared_ptr<c10::FunctionSchema>::~shared_ptr (this=0x611000c40640) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr.h:122 #9 0x000003ff448d82f6 in c10::TupleType::~TupleType (this=0x611000c40580) at /home/user/pytorch/aten/src/ATen/core/jit_type.h:1142 #10 0x000003ff448d8346 in c10::TupleType::~TupleType (this=0x611000c40580) at /home/user/pytorch/aten/src/ATen/core/jit_type.h:1142 #11 0x000003ff731296a4 in std::_Sp_counted_ptr<c10::TupleType*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x603000c43ae0) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:348 #12 0x000003ff71eaf666 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x603000c43ae0) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:168 #13 0x000003ff71eaf330 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x3ffc5dc9368) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:705 #14 0x000003ff73129ee4 in std::__shared_ptr<c10::TupleType, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x3ffc5dc9360) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:1154 #15 0x000003ff73122390 in std::shared_ptr<c10::TupleType>::~shared_ptr (this=0x3ffc5dc9360) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr.h:122 #16 0x000003ff73d00788 in torch::jit::toPyObject (ivalue=...) at /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:613 #17 0x000003ff73d00306 in torch::jit::toPyObject (ivalue=...) at /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:604 ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101400 Approved by: https://github.com/zou3519	2023-05-15 15:32:10 +00:00
PyTorch MergeBot	66eef31444	Revert "[fx] change from #users to num_users in graph printout (#101140 )" This reverts commit e568c5a18d0fb390437912e13c96e29697af27ec. Reverted https://github.com/pytorch/pytorch/pull/101140 on behalf of https://github.com/jeanschmidt due to There are internal changes to this commit that are preventing landing, so I am reverting to unblock the diff train ([comment](https://github.com/pytorch/pytorch/pull/101140#issuecomment-1547989487))	2023-05-15 14:35:22 +00:00
Aaron Gokaslan	616208b4fe	[BE]: Cleanup deprecated stdlib imports (UP006,UP035) (#101361 ) Automated fix to cleanup some deprecated/useless python imports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101361 Approved by: https://github.com/zou3519	2023-05-15 14:32:41 +00:00
mikey dagitses	1b7d875083	put third_party/ittapi/ in .bazelignore (#101364 ) put third_party/ittapi/ in .bazelignore Summary: Bazel fails when recursing into this directory because it has a symlink that infinitely recurses. We don't use this library in Bazel, so it's safe to just ignore its existence. Test Plan: Verified with `bazel query //...` --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/101364). * #101406 * #101405 * __->__ #101364 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101364 Approved by: https://github.com/zou3519, https://github.com/malfet	2023-05-15 14:12:40 +00:00
PyTorch MergeBot	cca31f1797	Revert "implement a function to convert a storage to copy-on-write (#100819 )" This reverts commit aec11b8c802617d87f54e7c2c0ffe96e33657b2c. Reverted https://github.com/pytorch/pytorch/pull/100819 on behalf of https://github.com/jeanschmidt due to added tests are breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100819#issuecomment-1547929531))	2023-05-15 14:10:23 +00:00
mantaionut	bfb2888b51	Re enable AutogradNotImplementedFallback on Windows (#101062 ) Fixes #48763 Due to #48763 AutogradNotImplementedFallback were disabled. I re-enabled them and the CI is successfully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101062 Approved by: https://github.com/zou3519	2023-05-15 13:41:06 +00:00
Avik Chaudhuri	9b6ccde0e6	fix precision error in constraint solver (#101307 ) When adding guards to the constraint solver, we check that they are consistent, i.e., they do not simplify to false when their free symbols are substituted with the corresponding concrete values. However this check may "spuriously" fail because it doesn't take into account precision errors when comparing floats. Since the symbols involved are all positive integers, we try to approximate floats in the guards with rationals, providing concrete values as hints: `sympy.nsimplify` does the job. As an alternative approach, we considered using `sympy.evalf` to compare with reduced precision. But we did not pursue it because * the choice of what is a good reduced precision feels arbitrary (`sympy` uses `1e15` by default); * more importantly, there is no guarantee that we will not encounter the same problem when solving downstream. Differential Revision: [D45826951](https://our.internmc.facebook.com/intern/diff/D45826951/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101307 Approved by: https://github.com/ezyang	2023-05-15 11:03:24 +00:00
PyTorch MergeBot	87f9160b67	Revert "[inductor] fix incorrect strides in copy() decomp, fix hf_LongFormer + hf_BigBird errors (#100115 )" This reverts commit 4c8ee583c3af7ee6bf21ac7908a5f81455dc96e5. Reverted https://github.com/pytorch/pytorch/pull/100115 on behalf of https://github.com/jeanschmidt due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/100115#issuecomment-1547417287))	2023-05-15 08:31:58 +00:00
Nikita Karetnikov	7dd8e08817	[pt2] add meta for `linalg_ldl_factor_ex` (#101362 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101362 Approved by: https://github.com/lezcano	2023-05-15 02:56:49 +00:00
Nikita Karetnikov	a8964d6377	[pt2] add meta and `SymInt` support for `linalg_householder_product` (#101315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101315 Approved by: https://github.com/lezcano	2023-05-15 02:56:49 +00:00
chunyuan	cc54da4877	Inductor cpp wrapper: fix FallbackKernel support (#100788 ) Fixes cpp wrapper support for kernels that are not exposed in `torch.ops.aten`. The current PR limits the support scope to `repeat_interleave.Tensor` and will submit follow-up PRs for more OPs. The PR maps the python schema of the kernel to the cpp schema and uses `c10::Dispatcher::singleton().findSchemaOrThrow` to find the corresponding cpp OP. The current support is limited and will raise `AssertionError` for unsupported cases. The limitation includes: - only support kernel that is not alias - only support kernel the args and returns of which don't have `alias_info` - only support output args to be a `Tensor` - only support input args to be `Tensor`, `Optional[int]`, `Optional[float]` and `Optional[bool]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100788 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-05-15 00:45:44 +00:00
Jose Javier	72908e768e	Fix Math Typesetting for torch.linalg.matrix_exp (#101363 ) Fixes current the matrix_exp documentation typesetting which has an unescaped underscore. It currently looks like this <img width="540" alt="image" src="https://github.com/pytorch/pytorch/assets/3844846/cbff79c3-8c1a-4003-bee3-c4c97ae0e3a0"> With the fix, it looks like this <img width="555" alt="image" src="https://github.com/pytorch/pytorch/assets/3844846/a24d9a3f-bbbd-4685-9244-2bc06872b966"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101363 Approved by: https://github.com/lezcano	2023-05-15 00:31:12 +00:00
Edward Z. Yang	fcf2fb273c	Make missing model import error marginally better (#101221 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101221 Approved by: https://github.com/albanD, https://github.com/anijain2305	2023-05-14 19:57:01 +00:00
Edward Z. Yang	96487d0d1f	Refactor after_dynamo to have a CLI interface too. (#101220 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101220 Approved by: https://github.com/anijain2305	2023-05-14 19:03:16 +00:00
Edward Z. Yang	9ba64cba55	Fix torch.utils._traceback on Python 3.11 (#101277 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101277 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-05-14 19:03:16 +00:00
Aaron Gokaslan	dfe484a3b3	[BE]: Bugfix functorch and some generic typing improvements (#101337 ) Fixes some typing bugs found with newer versions of mypy Pull Request resolved: https://github.com/pytorch/pytorch/pull/101337 Approved by: https://github.com/ezyang	2023-05-14 14:20:56 +00:00
blzheng	65412f95f0	[dynamo] Graph break on ops having inplace_view tag (#100787 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100787 Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/jansel	2023-05-14 11:42:35 +00:00
Iris	568db1b464	[dtensor] Relax condition for _split_tensor() (#101218 ) When tensor.size(self.dim) < num_chunks, we will fill empty chunk with empty tensor (https://github.com/pytorch/pytorch/pull/98722). Therefore, we no longer needs this assert. For example, when sharding a tensor with 1 element on 2 ranks along dim 0, results would be as follows: ``` rank:0, dtensor:DTensor(local_tensor=tensor([0.4963], device='cuda:0'), device_mesh=DeviceMesh:([0, 1]), placements=[Shard(dim=0)]) rank:1, dtensor:DTensor(local_tensor=tensor([], device='cuda:1'), device_mesh=DeviceMesh:([0, 1]), placements=[Shard(dim=0)]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101218 Approved by: https://github.com/wanchaol	2023-05-14 07:39:27 +00:00
PyTorch MergeBot	674e52b0b9	[vision hash update] update the pinned vision hash (#101347 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101347 Approved by: https://github.com/pytorchbot	2023-05-14 03:08:55 +00:00
Jongsoo Park	8876c0b282	[transformer benchmark] fix in sdp_bwd for scaled_dot_product_attention return type (#101341 ) Summary: Otherwise we get ``` Traceback (most recent call last): File "<string>", line 49, in <module> File "<string>", line 47, in __run File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/caffe2/benchmarks/transformer/sdp_backwards.py", line 188, in <module> main() File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/caffe2/benchmarks/transformer/sdp_backwards.py", line 184, in main run_timing(min_run_time, batch_size, embed_dim, num_heads, max_seq_len, dtype) File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/caffe2/benchmarks/transformer/sdp_backwards.py", line 105, in run_timing rand_fused_upward = cpt(x, x, x, mask).clone().detach() File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/torch/nn/modules/module.py", line 1511, in _call_impl return forward_call(args, **kwargs) File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/caffe2/benchmarks/transformer/sdp_backwards.py", line 39, in forward attn, _ = torch.nn.functional.scaled_dot_product_attention( ValueError: too many values to unpack (expected 2) ``` Test Plan: buck run mode/dev-nosan //caffe2/benchmarks/transformer:sdp_backwards Differential Revision: D45843838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101341 Approved by: https://github.com/drisspg	2023-05-14 01:34:51 +00:00
Michael Gschwind	2361f7f0ce	Update doc strings to make description of is_causal consistent for nn.Transformer and nn.MHA (#101089 ) Summary: Update doc strings to make description of is_causal consistent for nn.Transformer and nn.MHA Test Plan: sandcastle & github CI/CD Differential Revision: D45737197 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101089 Approved by: https://github.com/mikaylagawarecki	2023-05-13 18:14:38 +00:00
lezcano	f6c2859ee3	Print the path to the code with `TORCH_LOGS=output_code` (#99038 ) This is quite useful to play around with the code once it's been generated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99038 Approved by: https://github.com/mlazos	2023-05-13 16:20:57 +00:00
KuangDW	07d3772eff	fix typo in comments under torch/distributions/mixture_same_family.py (#101290 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101290 Approved by: https://github.com/Skylion007	2023-05-13 14:25:52 +00:00
blzheng	ab74744522	add inplace_view tag to resize_as_() (#100786 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100786 Approved by: https://github.com/jgong5, https://github.com/bdhirsh, https://github.com/eellison	2023-05-13 13:49:14 +00:00
Michael Voznesensky	76b72bd80d	Rewrite frame state to use a struct for shapes, splitting scalar and size, prep for stride (#101250 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101250 Approved by: https://github.com/ezyang	2023-05-13 09:28:32 +00:00
David Berard	e406125b6b	[profiler] replace record_concrete_inputs_enabled interface with callback instead of boolean (#101292 ) Summary: This allows an internal use case to register a callback that can vary over time instead of being a static value over the lifetime of the program. Test Plan: ran the test listed above ^^. Differential Revision: D45805139 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101292 Approved by: https://github.com/aaronenyeshi	2023-05-13 05:06:48 +00:00
PyTorch MergeBot	44fb7fcb83	[vision hash update] update the pinned vision hash (#101323 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101323 Approved by: https://github.com/pytorchbot	2023-05-13 04:56:58 +00:00
Bin Bao	387b369ee4	[CI] Fix a dashboard command line string formatting bug (#101325 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101325 Approved by: https://github.com/ngimel	2023-05-13 03:00:23 +00:00
Joel Schlosser	4414160453	Factor automatic dynamic into a private helper function (#101114 ) Simple cleanup; makes it easier to disable automatic dynamic for nested tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101114 Approved by: https://github.com/ezyang	2023-05-13 02:34:46 +00:00
Nikita Shulga	9e089db32e	[MPS] Enable `arange` for `int8` and `uint8` dtypes (#101303 ) Not sure, why it was not enabled previously. Sort types in `AT_DISPATCH_MPS_TYPES` by group (floats first then integers) and size. Test implicitly in `test_bernoulli`. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 80c7ed7</samp> > _`Char` and `Byte` types_ > _MPS can dispatch them now_ > _Winter of tensors_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/101303 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/kulinseth	2023-05-13 01:19:08 +00:00
Zain Rizvi	ceecccc09e	Bugfix: Correctly detect test changes in PRs (#101304 ) Fixes a bug where the logic for deciding what tests have been edited by a PR would include all files that had been edited since the merge base, including files that were in main! Now it will only consider the files that are part of the PR itself Pull Request resolved: https://github.com/pytorch/pytorch/pull/101304 Approved by: https://github.com/seemethere, https://github.com/malfet	2023-05-13 00:59:41 +00:00
Michael Lazos	d75f93603a	Flatten exceptions in dynamo (#100779 ) Fixes https://github.com/pytorch/pytorch/issues/93571 [before and after](https://gist.github.com/mlazos/256b0e8f0f98495752a22b960e9f4fcb) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100779 Approved by: https://github.com/ezyang	2023-05-13 00:58:57 +00:00
Nikita Shulga	cc0a271935	[GHF] Use `baseRefOid` to get PRs merge base (#101232 ) In general, `GitHubRepo` class must not contain any methods that incur local IO, nor modify git repo checkout its invoked from. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101232 Approved by: https://github.com/kit1980	2023-05-13 00:27:10 +00:00
Rodrigo Kumpera	c498b1ad95	[C10D] Implement extended store api in HashStore. (#100633 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100633 Approved by: https://github.com/fduwjj	2023-05-12 23:46:49 +00:00
Bin Bao	2a14652879	[CI] Introduce dashboard-tag to pass dashboard run configs (#101320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101320 Approved by: https://github.com/huydhn	2023-05-12 23:26:16 +00:00
Sherlock Huang	bb454891ed	[Reland] Add sym_size/stride/numel/storage_offset to native_function.… (#100749 ) …yaml (#91… (#91919) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/91919 Approved by: https://github.com/ezyang Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/92402 Reviewed By: ezyang Differential Revision: D42565586 Pulled By: SherlockNoMad fbshipit-source-id: 1c2986e45307e076d239836a1b45441a9fa3c9d9 ghstack-source-id: 969f4928486e04c57aaf98e20e3c3ca946c51613 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100749 Approved by: https://github.com/zhxchen17, https://github.com/albanD	2023-05-12 22:57:42 +00:00
Ke Wen	4dbab17edb	[c10d] Use macro to deduplicate codes (#101243 ) Ops.cpp copies code for each of the three device keys (CPU, CUDA PrivateUse1). Use macro to deduplicate it. No logic change. Cc @kumpera @H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/101243 Approved by: https://github.com/H-Huang	2023-05-12 22:12:28 +00:00
Ramin Azarmehr	0be53d83fc	[MPS] Add support for MPSProfiler Python bindings (#101002 ) - Added torch.mps.profiler.[start() and stop()] APIs with RST documentation - Added test case in test_mps Pull Request resolved: https://github.com/pytorch/pytorch/pull/101002 Approved by: https://github.com/malfet	2023-05-12 21:55:34 +00:00
Nikita Shulga	816400a294	Add branch change name for composite action (#101309 ) Mention in release runbook, that composite workflow should reference release branch rather than trunk Pull Request resolved: https://github.com/pytorch/pytorch/pull/101309 Approved by: https://github.com/atalman	2023-05-12 21:50:12 +00:00
Nikita Shulga	47c99e3a1c	Update PyTorch docker base image to Ubuntu-20.04 (take 2) (#101310 ) Followup after https://github.com/pytorch/pytorch/pull/101301 <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 219c58a</samp> > _`BASE_RUNTIME` changed_ > _Ubuntu twenty oh four_ > _Spring of new features_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/101310 Approved by: https://github.com/atalman	2023-05-12 21:46:00 +00:00
Peter Bell	5fe834afc1	[inductor] Insert triton barrier before storing to inplace buffers (#100769 ) The linked issue demonstrates a triton bug where a load broadcasted over multiple warps may see the result of a store that happens later in the triton program. The workaround is to add a barrier before storing, which enforces that all warps have already read the data. e.g. in `test_embedding_var_mean` we now generate: ```python tl.debug_barrier() tl.store(in_out_ptr1 + (tl.broadcast_to(x0, [XBLOCK, 1])), tmp17, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100769 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-05-12 21:37:34 +00:00
Scott Wolchok	05077f2ac3	[PyTorch] Avoid extra refcounting in vector variant of VariableType::unpack (#95835 ) Looks like this line was a historical relic of Variable and Tensor not being the same. I spot checked assembly and it's not the same, which already implies this way is better; specifically there are fewer locked refcounting instructions (I believe the type of the expression is `Tensor` not `const Tensor&` and both forks must have the same type). Spotted this with at::cat in an internal workload; the actual fix is to enable InferenceMode but this should reduce the penalty for failing to do that. Differential Revision: [D43714744](https://our.internmc.facebook.com/intern/diff/D43714744/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95835 Approved by: https://github.com/albanD	2023-05-12 21:21:08 +00:00
Bin Bao	6afa9a4a69	[CI] Change dashboard workflow inputs type to boolean (#101308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101308 Approved by: https://github.com/shunting314	2023-05-12 21:14:31 +00:00
Yueming Hao	a12b640dc9	Fix typos in troubleshooting.rst (#101305 ) There are several typos in the troubleshooting documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101305 Approved by: https://github.com/desertfire	2023-05-12 21:05:13 +00:00
Pawel Kaplinski	6ac0542747	Cpp Reduce LR on plateau scheduler (#100311 ) Hello! Recently i was playing with LibTorch libs, but i noticed that currently there is only one LR Scheduler implementation available. I needed 'Reduce on plateau scheduler', so implemented it by myself. Used it a lot of times, and it seem work as it should, so decided to share my implementation here. If u will decide that this is something worth to merge, or it needs tweaking/tests let me know! Pull Request resolved: https://github.com/pytorch/pytorch/pull/100311 Approved by: https://github.com/albanD	2023-05-12 20:50:48 +00:00
Nikita Shulga	c772d56966	Use 20.04 as base image (#101301 ) As 18.04 just EOLed 2 weeks ago. Fixes https://github.com/pytorch/pytorch/issues/81120 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101301 Approved by: https://github.com/seemethere, https://github.com/atalman	2023-05-12 20:49:04 +00:00
Bin Bao	066175d69c	[CI] Add workflow_dispatch.inputs to control dashboard runs (#101279 ) Summary: This gives a finer control for developers to specify which set of configs to measure for their one-off dashboard run. Right now the queuing for those runs look pretty bad. Another change here is reducing the inference measurement frequency to 2 times a week. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101279 Approved by: https://github.com/huydhn	2023-05-12 20:34:50 +00:00
Scott Wolchok	a8c32eb78e	[PyTorch] add test for numel slow path affecting data_ptr (#100993 ) This test would have stopped #98090 -- data_ptr needs to call custom Python numel if it exists, since it could be arbitrary Python. Differential Revision: [D45701566](https://our.internmc.facebook.com/intern/diff/D45701566/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100993 Approved by: https://github.com/ezyang	2023-05-12 20:33:39 +00:00
Nikita Shulga	568bac7961	[BE][GHF] Add `retries_decorator` (#101227 ) I've noticed that 3-4 functions in trymerge are trying to implement similar tail recursion for flaky network retries. Unify them using single wrapper in `gitutils.py` <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 8d40631</samp> > _`retries_decorator`_ > _adds resilience to GitHub scripts_ > _autumn of errors_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/101227 Approved by: https://github.com/kit1980	2023-05-12 20:29:06 +00:00
Masaki Kozuki	2fcc2002fa	Handle tail 0-size tensor appropriately in `MultiTensorApply` (#100811 ) Fixes #100701 It seems like we don't call `multi_tensor_apply_kernel` at all if the input tensor lists are small and their last tensors are zero-size as per e.g. `ca9f55f79d/aten/src/ATen/native/cuda/MultiTensorApply.cuh (L100-L102)` which was introduced in `05943712a4`. This PR special cases the last zero-size tensors so that we won't be negligent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100811 Approved by: https://github.com/ngimel	2023-05-12 20:26:45 +00:00
Andrew Gallagher	52f526cfc0	[caffe2/tools/autograd] Fix non-determinism in code gen (#101287 ) Fix several cases of leaking set-iteration-order to generated sources, causing non-determinism in generated code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101287 Approved by: https://github.com/albanD	2023-05-12 20:23:50 +00:00
Sergei Vorobev	630593d3cc	[bazel] add python targets (#101003 ) This PR adds bazel python, so that bazel build could be used from python like `import torch`. Notable changes: - Add the python targets. - Add the version.py.tpl generation. - In order to archive the `USE_GLOBAL_DEPS = False` just for the bazel build, employ a monkey-patch hack in the mentioned `version.py.tpl`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101003 Approved by: https://github.com/huydhn	2023-05-12 19:44:01 +00:00
andrewor14	4434b9af6a	[quant][pt2] Handle constant conv args in prepare QAT fusion (#100525 ) Summary: Previously, we would only match and replace conv + BN patterns with default constant args for conv (stride, padding, dilation etc.). If the user sets one of these args to values that are different from the default, we would simply not fuse the pattern. This is due to a limitation in the subgraph rewriter: see https://github.com/pytorch/pytorch/issues/100419. This commit works around the above limitation by first configuring the subgraph rewriter to ignore literals when matching, and then manually copy over the constant args to the new subgraph after `replace_pattern`. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_fusion_constant_args Reviewers: jerryzh168, kimishpatel Differential Revision: [D45515437](https://our.internmc.facebook.com/intern/diff/D45515437) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100525 Approved by: https://github.com/jerryzh168	2023-05-12 19:15:47 +00:00
Huy Do	3f734c584e	Revert "Mark Windows CPU jobs as unstable (#100581 )" (#100676 ) This reverts commit 478a5ddd8ad51bf54beb9cdfd353187cf8b63d93. Putting the Windows jobs to where they come from. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100676 Approved by: https://github.com/clee2000, https://github.com/malfet	2023-05-12 19:15:11 +00:00
Elias Ellison	7e333fe502	Fix cuda graphs & sdpa for dropout==0 (#101280 ) Fixes cuda graph failures from https://github.com/pytorch/pytorch/pull/100931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101280 Approved by: https://github.com/ngimel	2023-05-12 19:06:45 +00:00
Elias Ellison	a8ff647e42	Disable conv cache emptying (#101038 ) We warmup cudagraph trees in the cudagraph memory pool so that if we are part of the way through your run, and a large majority of memory is already allocated to cudagraphs, we dont try to allocate again to eager which would split memory pool in half. However this means this is causing us to fail the following assert due to the `emptyCache` call in CUDNN benchmarking: https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L2959. Disable the empty cache call during cudagraph warmup to fix error. Disabling did not have a significant affect on memory: ![image](https://github.com/pytorch/pytorch/assets/11477974/90513a1e-aa77-410c-a32e-2f80b99e673f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101038 Approved by: https://github.com/shunting314, https://github.com/ngimel	2023-05-12 18:49:46 +00:00
Yanli Zhao	5ac48eb353	[FSDP]Skip unshard call during checkpointing for NO_SHARD sharding strategy (#101095 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101095 Approved by: https://github.com/fegin	2023-05-12 18:19:18 +00:00
mikey dagitses	aec11b8c80	implement a function to convert a storage to copy-on-write (#100819 ) implement a function to convert a storage to copy-on-write Summary: This will be used in the _lazy_clone() operator as well as reshape(). Test Plan: 100% coverage of reachable lines. --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/100819). * #100821 * #100820 * __->__ #100819 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100819 Approved by: https://github.com/ezyang	2023-05-12 17:45:04 +00:00
Aleksei Nikiforov	f0f700e8d2	ASAN: fix use-after-free (#101064 ) When tensor is resized, reference array to it's sizes may become invalid. Make a copy in advance. <details> <summary>ASAN report</summary> ``` ================================================================= ==1115867==ERROR: AddressSanitizer: heap-use-after-free on address 0x61000013d790 at pc 0x03ff8e7da360 bp 0x03fff53c83a0 sp 0x03fff53c8390 READ of size 8 at 0x61000013d790 thread T0 #0 0x3ff8e7da35f in c10::SymInt::is_heap_allocated() const /home/user/pytorch/c10/core/SymInt.h:154 #1 0x3ff8e7da35f in c10::SymInt::maybe_as_int() const /home/user/pytorch/c10/core/SymInt.h:215 #2 0x3ff8e7d0a6d in c10::SymInt::sym_eq(c10::SymInt const&) const /home/user/pytorch/c10/core/SymInt.cpp:69 #3 0x3ff7a9ab0bd in c10::SymInt::operator==(c10::SymInt const&) const /home/user/pytorch/c10/core/SymInt.h:177 #4 0x3ff7a9aaedd in bool std::__equal<false>::equal<c10::SymInt const, c10::SymInt const>(c10::SymInt const, c10::SymInt const, c10::SymInt const) /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++- v11/bits/stl_algobase.h:1162 #5 0x3ff7a9aae4b in bool std::__equal_aux1<c10::SymInt const, c10::SymInt const>(c10::SymInt const, c10::SymInt const, c10::SymInt const) /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/ stl_algobase.h:1211 #6 0x3ff7a9aae05 in bool std::__equal_aux<c10::SymInt const, c10::SymInt const>(c10::SymInt const, c10::SymInt const, c10::SymInt const) /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/s tl_algobase.h:1219 #7 0x3ff7a9aad97 in bool std::equal<c10::SymInt const, c10::SymInt const>(c10::SymInt const, c10::SymInt const, c10::SymInt const) /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_alg obase.h:1556 #8 0x3ff4b23c771 in c10::ArrayRef<c10::SymInt>::equals(c10::ArrayRef<c10::SymInt>) const /home/user/pytorch/c10/util/ArrayRef.h:188 #9 0x3ff4cb91bc1 in bool c10::operator!=<c10::SymInt>(c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>) /home/user/pytorch/c10/util/ArrayRef.h:341 #10 0x3ff6d1b57ff in torch::ADInplaceOrView::resize_(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) /home/user/pytorch/torch/csrc/autograd/Variab leTypeManual.cpp:408 #11 0x3ff6d1e59c7 in c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c1 0::MemoryFormat>), &torch::ADInplaceOrView::resize_>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> > >::operator()(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) /home/user/pytorch/aten/src/ATen/core/boxing/impl/WrapFunctionIntoFunctor.h:13 #12 0x3ff6d1e59c7 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10: :ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>), &torch::ADInplaceOrView::resize_>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::Sy mInt>, c10::optional<c10::MemoryFormat> > >, at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel, c10::Disp atchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) /home/user/pytorch/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:480 #13 0x3ff51ca5129 in at::Tensor const& c10::callUnboxedKernelFunction<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(void, c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>&&, c10::optional<c10::MemoryFormat>&&) /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:50 #14 0x3ff51ca6e8f in at::Tensor const& c10::KernelFunction::call<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(c10::OperatorHandle const&, c10::D ispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:90 #15 0x3ff51ca6e8f in at::Tensor const& c10::Dispatcher::redispatch<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(c10::TypedOperatorHandle<at::Ten sor const& (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)> const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:656 #16 0x3ff5182006b in c10::TypedOperatorHandle<at::Tensor const& (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::redispatch(c10::DispatchKeySet, at::Tensor const&, c 10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:492 #17 0x3ff5182006b in at::_ops::resize_::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) aten/src/ATen/Operators_4.cpp:2144 #18 0x3ff6d1d5e07 in at::redispatch::resize__symint(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) aten/src/ATen/RedispatchFunctions.h:2847 #19 0x3ff6d1bbb67 in torch::autograd::VariableType::(anonymous namespace)::resize_(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) /home/user/pyto rch/torch/csrc/autograd/VariableTypeManual.cpp:243 #20 0x3ff6d1bd197 in c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c1 0::MemoryFormat>), &torch::autograd::VariableType::(anonymous namespace)::resize_>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10 ::optional<c10::MemoryFormat> > >::operator()(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) /home/user/pytorch/aten/src/ATen/core/boxing/impl/WrapFu nctionIntoFunctor.h:13 #21 0x3ff6d1bd197 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10: :ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>), &torch::autograd::VariableType::(anonymous namespace)::resize_>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> > >, at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::call(c 10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) /home/user/pytorch/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor .h:480 #22 0x3ff51ca5129 in at::Tensor const& c10::callUnboxedKernelFunction<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(void, c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>&&, c10::optional<c10::MemoryFormat>&&) /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:50 #23 0x3ff5181ead1 in at::Tensor const& c10::KernelFunction::call<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(c10::OperatorHandle const&, c10::D ispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:90 #24 0x3ff5181ead1 in at::Tensor const& c10::Dispatcher::call<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(c10::TypedOperatorHandle<at::Tensor co nst& (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)> const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const /home/user/pytorch/at en/src/ATen/core/dispatch/Dispatcher.h:639 #25 0x3ff5181ead1 in c10::TypedOperatorHandle<at::Tensor const& (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:487 #26 0x3ff5181ead1 in at::_ops::resize_::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) aten/src/ATen/Operators_4.cpp:2137 #27 0x3ff79b44fcf in at::Tensor::resize__symint(c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const aten/src/ATen/core/TensorBody.h:2452 #28 0x3ff79a802db in torch::autograd::THPVariable_resize_(_object, _object, _object)::$_0::operator()(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const /home/us er/pytorch/torch/csrc/autograd/generated/python_variable_methods.cpp:13417 #29 0x3ff7999f1eb in torch::autograd::THPVariable_resize_(_object, _object, _object) /home/user/pytorch/torch/csrc/autograd/generated/python_variable_methods.cpp:13419 #30 0x3ffa2c9b009 in method_vectorcall_VARARGS_KEYWORDS Objects/descrobject.c:344 #31 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #32 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #33 0x3ffa2e05447 in call_function Python/ceval.c:5891 #34 0x3ffa2dff7d7 in _PyEval_EvalFrameDefault Python/ceval.c:4198 #35 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #36 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #37 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #38 0x3ffa2c8ab15 in PyVectorcall_Call Objects/call.c:255 #39 0x3ffa2c8ac65 in _PyObject_Call Objects/call.c:290 #40 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317 #41 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943 #42 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #43 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #44 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #45 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #46 0x3ffa2c8ab15 in PyVectorcall_Call Objects/call.c:255 #47 0x3ffa2c8ac65 in _PyObject_Call Objects/call.c:290 #48 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317 #49 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943 #50 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #51 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #52 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #53 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #54 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #55 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #56 0x3ffa2e05447 in call_function Python/ceval.c:5891 #57 0x3ffa2dff7d7 in _PyEval_EvalFrameDefault Python/ceval.c:4198 #58 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #59 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #60 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #61 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #62 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53 #63 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #64 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #65 0x3ffa2e05447 in call_function Python/ceval.c:5891 #66 0x3ffa2dff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213 #67 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #68 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #69 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #70 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #71 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #72 0x3ffa2e05447 in call_function Python/ceval.c:5891 #73 0x3ffa2dff7d7 in _PyEval_EvalFrameDefault Python/ceval.c:4198 #74 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #75 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #76 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #77 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #78 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53 #79 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #80 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #81 0x3ffa2e05447 in call_function Python/ceval.c:5891 #82 0x3ffa2dffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231 #83 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #84 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #85 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #86 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #87 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53 #88 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #89 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #90 0x3ffa2e05447 in call_function Python/ceval.c:5891 #91 0x3ffa2dffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231 #92 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #93 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #94 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #95 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #96 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53 #97 0x3ffa2c8ab9b in PyVectorcall_Call Objects/call.c:267 #98 0x3ffa2c8ac65 in _PyObject_Call Objects/call.c:290 #99 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317 #100 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943 #101 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #102 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #103 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #104 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #105 0x3ffa2c8a695 in _PyObject_FastCallDictTstate Objects/call.c:153 #106 0x3ffa2c8b271 in _PyObject_Call_Prepend Objects/call.c:431 #107 0x3ffa2d3f307 in slot_tp_call Objects/typeobject.c:7494 #108 0x3ffa2c8a933 in _PyObject_MakeTpCall Objects/call.c:215 #109 0x3ffa2df0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112 #110 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #111 0x3ffa2e05447 in call_function Python/ceval.c:5891 #112 0x3ffa2dffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231 #113 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #114 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #115 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #116 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #117 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #118 0x3ffa2e05447 in call_function Python/ceval.c:5891 #119 0x3ffa2dff7d7 in _PyEval_EvalFrameDefault Python/ceval.c:4198 #120 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #121 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #122 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #123 0x3ffa2c8ab15 in PyVectorcall_Call Objects/call.c:255 #124 0x3ffa2c8ac65 in _PyObject_Call Objects/call.c:290 #125 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317 #126 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943 #127 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #128 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #129 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #130 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #131 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #132 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #133 0x3ffa2e05447 in call_function Python/ceval.c:5891 #134 0x3ffa2dff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181 #135 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #136 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #137 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #138 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #139 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53 #140 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #141 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #142 0x3ffa2e05447 in call_function Python/ceval.c:5891 #143 0x3ffa2dff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181 #144 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #145 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #146 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #147 0x3ffa2c8a695 in _PyObject_FastCallDictTstate Objects/call.c:153 #148 0x3ffa2c8b271 in _PyObject_Call_Prepend Objects/call.c:431 #149 0x3ffa2d3f307 in slot_tp_call Objects/typeobject.c:7494 #150 0x3ffa2c8ad17 in _PyObject_Call Objects/call.c:305 #151 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317 #152 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943 #153 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #154 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #155 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #156 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #157 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #158 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #159 0x3ffa2e05447 in call_function Python/ceval.c:5891 #160 0x3ffa2dff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213 #161 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #162 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #163 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #164 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #165 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53 #166 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #167 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #168 0x3ffa2e05447 in call_function Python/ceval.c:5891 #169 0x3ffa2dffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231 #170 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #171 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #172 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #173 0x3ffa2c8ab15 in PyVectorcall_Call Objects/call.c:255 #174 0x3ffa2c8ac65 in _PyObject_Call Objects/call.c:290 #175 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317 #176 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943 #177 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #178 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #179 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #180 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #181 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #182 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #183 0x3ffa2e05447 in call_function Python/ceval.c:5891 #184 0x3ffa2dff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213 #185 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #186 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #187 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #188 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #189 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #190 0x3ffa2e05447 in call_function Python/ceval.c:5891 #191 0x3ffa2dffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231 #192 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #193 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #194 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #195 0x3ffa2c8ab15 in PyVectorcall_Call Objects/call.c:255 #196 0x3ffa2c8ac65 in _PyObject_Call Objects/call.c:290 #197 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317 #198 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943 #199 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #200 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #201 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #202 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #203 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #204 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #205 0x3ffa2e05447 in call_function Python/ceval.c:5891 #206 0x3ffa2dff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181 #207 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #208 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #209 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #210 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #211 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53 #212 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #213 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #214 0x3ffa2e05447 in call_function Python/ceval.c:5891 #215 0x3ffa2dff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181 #216 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #217 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #218 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #219 0x3ffa2c8a695 in _PyObject_FastCallDictTstate Objects/call.c:153 #220 0x3ffa2c8b271 in _PyObject_Call_Prepend Objects/call.c:431 #221 0x3ffa2d3f307 in slot_tp_call Objects/typeobject.c:7494 #222 0x3ffa2c8a933 in _PyObject_MakeTpCall Objects/call.c:215 #223 0x3ffa2df0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112 #224 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #225 0x3ffa2e05447 in call_function Python/ceval.c:5891 #226 0x3ffa2dffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231 #227 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #228 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #229 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #230 0x3ffa2c8ab15 in PyVectorcall_Call Objects/call.c:255 #231 0x3ffa2c8ac65 in _PyObject_Call Objects/call.c:290 #232 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317 #233 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943 #234 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277 #235 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #236 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #237 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #238 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #239 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #240 0x3ffa2e05447 in call_function Python/ceval.c:5891 #241 0x3ffa2dff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181 #242 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #243 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #244 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #245 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #246 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53 #247 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114 #248 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123 #249 0x3ffa2e05447 in call_function Python/ceval.c:5891 #250 0x3ffa2dff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181 #251 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46 #252 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065 #253 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342 #254 0x3ffa2c8a695 in _PyObject_FastCallDictTstate Objects/call.c:153 #255 0x3ffa2c8b271 in _PyObject_Call_Prepend Objects/call.c:431 #256 0x3ffa2d3f307 in slot_tp_call Objects/typeobject.c:7494 #257 0x3ffa2c8a933 in _PyObject_MakeTpCall Objects/call.c:215 0x61000013d790 is located 80 bytes inside of 192-byte region [0x61000013d740,0x61000013d800) freed by thread T0 here: #0 0x3ffa3237de5 in operator delete(void) /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:160 #1 0x3ff8e7e3221 in c10::TensorImpl::~TensorImpl() /home/user/pytorch/c10/core/TensorImpl.cpp:75 previously allocated by thread T0 here: #0 0x3ffa323734f in operator new(unsigned long) /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:99 #1 0x3ff4aeeb3d1 in c10::intrusive_ptr<c10::TensorImpl, c10::detail::intrusive_target_default_null_type<c10::TensorImpl> > c10::intrusive_ptr<c10::TensorImpl, c10::detail::intrusive_target_default_nul l_type<c10::TensorImpl> >::make<c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >, c10::DispatchKeySet&, caffe2::TypeMeta&>(c10::intrusive_ptr<c10::S torageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >&&, c10::DispatchKeySet&, caffe2::TypeMeta&) /home/user/pytorch/c10/util/intrusive_ptr.h:498 #2 0x3ff76f79e17 (/home/user/pytorch/build/lib.linux-s390x-cpython-310/torch/lib/libtorch_cpu.so+0x2fb79e17) SUMMARY: AddressSanitizer: heap-use-after-free /home/user/pytorch/c10/core/SymInt.h:154 in c10::SymInt::is_heap_allocated() const Shadow bytes around the buggy address: 0x100c2000027aa0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd 0x100c2000027ab0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x100c2000027ac0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd 0x100c2000027ad0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x100c2000027ae0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd =>0x100c2000027af0: fd fd[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd 0x100c2000027b00: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00 0x100c2000027b10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x100c2000027b20: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00 0x100c2000027b30: 00 00 00 00 04 fa fa fa fa fa fa fa fa fa fa fa 0x100c2000027b40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Shadow gap: cc ==1115867==ABORTING ``` </details> <details> <summary>Additional backtraces (not full)</summary> Memory deallocation: ``` #0 operator delete (ptr=0x61000013d740) at /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:160 #1 0x000003ffa77e3222 in c10::TensorImpl::~TensorImpl (this=0x61000013d740) at /home/user/pytorch/c10/core/TensorImpl.cpp:75 #2 0x000003ff63e76e8c in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_ (this=0x3ffd7ec8230) at /home/user/pytorch/c10/util/intrusive_ptr.h:291 #3 0x000003ff63e76910 in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::~intrusive_ptr (this=0x3ffd7ec8230) at /home/user/pytorch/c10/util/intrusive_ptr.h:370 #4 0x000003ff63e67240 in at::TensorBase::~TensorBase (this=0x3ffd7ec8230) at /home/user/pytorch/aten/src/ATen/core/TensorBase.h:80 #5 0x000003ff63e85ee0 in at::Tensor::~Tensor (this=0x3ffd7ec8230) at aten/src/ATen/core/TensorBody.h:90 #6 0x000003ff63f67304 in resize__functionalization (dispatchKeySet=..., self=..., size=..., memory_format=...) at /home/user/pytorch/aten/src/ATen/FunctionalizeFallbackKernel.cpp:173 #7 0x000003ff63f89258 in c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>), &(resize__functionalization(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>))>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat> > >::operator()(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>) ( this=0x6030000390a0, args=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/boxing/impl/WrapFunctionIntoFunctor.h:13 #8 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>), &(resize__functionalization(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>))>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat> > >, at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>) (functor=0x6030000390a0, dispatchKeySet=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:480 #9 0x000003ff6aca560a in c10::callUnboxedKernelFunction<at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat> > ( unboxed_kernel_func=0x3ff63f88a80 <c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tenso r const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>), &(resize__functionalization(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>))>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat> > >, at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>)>, functor=0x6030000390a0, dispatchKeySet=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:50 #10 0x000003ff6aca715c in c10::KernelFunction::call<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> > (this=0x6210005e1b28, opHandle=..., dispatchKeySet=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:96 #11 c10::Dispatcher::redispatch<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(c10::TypedOperatorHandle<at::Tensor const& (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)> const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const ( this=0x3ff919400e0 <c10::Dispatcher::realSingleton()::_singleton>, op=..., currentDispatchKeySet=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:656 #12 0x000003ff6a82006c in c10::TypedOperatorHandle<at::Tensor const& (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const ( this=0x3ff919a07e0 <at::_ops::resize_::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)::op>, currentDispatchKeySet=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:492 #13 at::_ops::resize_::redispatch (dispatchKeySet=..., self=..., size=..., memory_format=...) at /home/user/pytorch/build/aten/src/ATen/Operators_4.cpp:2144 #14 0x000003ff861d5e08 in at::redispatch::resize__symint (dispatchKeySet=..., self=..., size=..., memory_format=...) at aten/src/ATen/RedispatchFunctions.h:2847 #15 0x000003ff861b579e in torch::ADInplaceOrView::resize_ (ks=..., self=..., size=..., optional_memory_format=...) at /home/user/pytorch/torch/csrc/autograd/VariableTypeManual.cpp:401 ``` Memory access: ``` #0 c10::SymInt::maybe_as_int (this=0x61000013d790) at /home/user/pytorch/c10/core/SymInt.h:215 #1 0x000003ff734d0a6e in c10::SymInt::sym_eq (this=0x61000013d790, sci=...) at /home/user/pytorch/c10/core/SymInt.cpp:69 #2 0x000003ff5f6ab0be in c10::SymInt::operator== (this=0x61000013d790, o=...) at /home/user/pytorch/c10/core/SymInt.h:177 #3 0x000003ff5f6aaede in std::__equal<false>::equal<c10::SymInt const, c10::SymInt const> (__first1=0x61000013d790, __last1=0x61000013d7a0, __first2=0x602000015c30) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_algobase.h:1162 #4 0x000003ff5f6aae4c in std::__equal_aux1<c10::SymInt const, c10::SymInt const> (__first1=0x61000013d790, __last1=0x61000013d7a0, __first2=0x602000015c30) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_algobase.h:1211 #5 0x000003ff5f6aae06 in std::__equal_aux<c10::SymInt const, c10::SymInt const> (__first1=0x61000013d790, __last1=0x61000013d7a0, __first2=0x602000015c30) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_algobase.h:1219 #6 0x000003ff5f6aad98 in std::equal<c10::SymInt const, c10::SymInt const> (__first1=0x61000013d790, __last1=0x61000013d7a0, __first2=0x602000015c30) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_algobase.h:1556 #7 0x000003ff2ff3c772 in c10::ArrayRef<c10::SymInt>::equals (this=0x3ffed7c9900, RHS=...) at /home/user/pytorch/c10/util/ArrayRef.h:188 #8 0x000003ff31891bc2 in c10::operator!=<c10::SymInt> (a1=..., a2=...) at /home/user/pytorch/c10/util/ArrayRef.h:341 #9 0x000003ff51eb5800 in torch::ADInplaceOrView::resize_ (ks=..., self=..., size=..., optional_memory_format=...) at /home/user/pytorch/torch/csrc/autograd/VariableTypeManual.cpp:408 #10 0x000003ff51ee59c8 in c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c 10::MemoryFormat>), &torch::ADInplaceOrView::resize_>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> > >::operator()(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) (this=0x6030007dca40, args=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/boxing/impl/WrapFunctionIntoFunctor.h:13 #11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt >, c10::optional<c10::MemoryFormat>), &torch::ADInplaceOrView::resize_>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional< c10::MemoryFormat> > >, at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel, c10::DispatchKeySet, at::Tenso r const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) (functor=0x6030007dca40, dispatchKeySet=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:480 #12 0x000003ff369a512a in c10::callUnboxedKernelFunction<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> > ( unboxed_kernel_func=0x3ff51ee51f0 <c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tenso r const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>), &torch::ADInplaceOrView::resize_>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::Ar rayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> > >, at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKern el*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>, functor=0x6030007dca40, dispatchKeySet=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:50 #13 0x000003ff369a6e90 in c10::KernelFunction::call<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> > (this=0x6210005e1bc8, opHandle=..., dispatchKeySet=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:90 #14 c10::Dispatcher::redispatch<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(c10::TypedOperatorHandle<at::Tensor const& (at::Tensor const&, c10::Arr ayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)> const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const ( this=0x3ff5d6400e0 <c10::Dispatcher::realSingleton()::_singleton>, op=..., currentDispatchKeySet=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:656 #15 0x000003ff3652006c in c10::TypedOperatorHandle<at::Tensor const& (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const ( this=0x3ff5d6a07e0 <at::_ops::resize_::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)::op>, currentDispatchKeySet=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:492 #16 at::_ops::resize_::redispatch (dispatchKeySet=..., self=..., size=..., memory_format=...) at /home/user/pytorch/build/aten/src/ATen/Operators_4.cpp:2144 #17 0x000003ff51ed5e08 in at::redispatch::resize__symint (dispatchKeySet=..., self=..., size=..., memory_format=...) at aten/src/ATen/RedispatchFunctions.h:2847 #18 0x000003ff51ebbb68 in torch::autograd::VariableType::(anonymous namespace)::resize_ (ks=..., self=..., size=..., optional_memory_format=...) at /home/user/pytorch/torch/csrc/autograd/VariableTypeManual.cpp:243 ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101064 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-05-12 17:41:30 +00:00
vfdev-5	a3700571e1	Fixed a bug in interpolate uint8 AVX2 on non-contig input (#101136 ) Description: - Fixed a bug in interpolate uint8 AVX2 on non-contig input - Added tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/101136 Approved by: https://github.com/NicolasHug	2023-05-12 17:17:10 +00:00
Jane Xu	4a7ee79bf9	[BE] super small comment update to gradcheck.py (#101103 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101103 Approved by: https://github.com/soulitzer	2023-05-12 16:41:44 +00:00
Jane Xu	a53cda1ddc	[optim][BE] split test file into logical parts: SWA, LR, optim (#101100 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101100 Approved by: https://github.com/albanD	2023-05-12 16:41:44 +00:00
PyTorch MergeBot	a64e97b62c	Revert "[dynamo 3.11] enable other torch 3.11 dynamo-related tests (#99180 )" This reverts commit aa8dcab1ce3fb96a7ccdee5861f4d281086ca3e1. Reverted https://github.com/pytorch/pytorch/pull/99180 on behalf of https://github.com/huydhn due to Sorry for reverting this, but linux-bionic-py3.11-clang9 test starts to timeout after this taking more than 3h30m. This is probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/99180#issuecomment-1545982256))	2023-05-12 16:18:22 +00:00
Douglas Lehr	8e54218024	[ROCM] Add build ROCM support to build-triton-wheel.yml (#95142 ) To match with upstream and build triton whl's locally so nightly pytorch whls can access them without needing to use pypi.org. We may have a better approach to build both whl's at once, but for now, to save duplication of code, another matrix is added for device (cuda/rocm) With rocm invoking a different commit and repo. The goal is to eventually have a single whl support both backends. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95142 Approved by: https://github.com/malfet, https://github.com/jithunnair-amd, https://github.com/atalman	2023-05-12 15:54:57 +00:00
ts	dfd822d756	Fix deserialization for UpsamplingBilinear2d (#101248 ) Fixes #100935 , adding handling for the recompute_scale_factor field. I would be happy to write a test for this, but might need some advice on where it should go/how to reliably reproduce the given issue. I'd also be happy to iterate on the proposed changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101248 Approved by: https://github.com/albanD	2023-05-12 15:40:17 +00:00
Edward Z. Yang	fa40195fac	Don't set_current_node in DDP. (#101046 ) Fixes https://github.com/pytorch/pytorch/issues/101045 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101046 Approved by: https://github.com/wconstab, https://github.com/malfet	2023-05-12 14:37:22 +00:00
Bert Maher	d54fcd571a	[dynamo] Skip tests that are broken in fbcode (#101217 ) Some tests don't work in fbcode, for some reason. Skip these until we can figure them out. Differential Revision: [D45791340](https://our.internmc.facebook.com/intern/diff/D45791340/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101217 Approved by: https://github.com/davidberard98	2023-05-12 14:13:14 +00:00
Rodrigo Kumpera	74b2c04aa1	[c10d] Bridge c10d and gloo stores. (#100384 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100384 Approved by: https://github.com/fduwjj	2023-05-12 13:55:31 +00:00
Richard Zou	c0e5d7e7fe	[CustomOp] Add Dispatcher error callback (#101015 ) The PyTorch Dispatcher's "no kernel found for DispatchKey" error message is a bit long and winded. This PR adds a way to add a custom error callback and changes the CustomOp API to use the custom error callback to deliver better error messages. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/101015 Approved by: https://github.com/ezyang	2023-05-12 13:49:20 +00:00
Richard Zou	de6470e28e	[custom_op] Change the python type that maps to ListType in schema (#101190 ) Previously, to specify e.g. int[], a user needed to do Tuple[int, ...]. This PR changes it to Sequence[int]. Bikeshedding: we could totally just use List[int] instead. The types that the user gives us that we use to infer a schema is not entirely faithful: for example, we convert `int` to SymInt. I didn't feel strongly between Sequence[int] and List[int] so I went with the more faithful one, plus Python recommends that you use Sequence for input arguments (over list or tuple), though we don't subscribe to that philosophy in general. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/101190 Approved by: https://github.com/bdhirsh	2023-05-12 13:49:20 +00:00
Richard Zou	d0d8165230	Cleanup custom op library after each custom_op test (#100980 ) This PR tells the custom op tests to destroy all custom ops with specified namespace after each test. The general problem is that if a test fails, the custom op isn't cleaned up. We could fix this via try-finally, but using a tearDown method seemed like a nice O(1) solution. Test Plan: - deleted some foo._destroy, verified that the test suite passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100980 Approved by: https://github.com/soulitzer, https://github.com/bdhirsh	2023-05-12 13:49:18 +00:00
Richard Zou	3ffeab7f80	[custom_op] Make repeated registrations error gracefully (#100979 ) Previously the error message went through torch.library. This PR changes it so that on each custom_op.impl_* call: - we store a (function, location) tuple - if a (function, location) tuple exists already, then we raise an error. This logic already existed for the abstract impl (the impl for meta and fake tensors), so this PR just extends it to the others. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/100979 Approved by: https://github.com/bdhirsh, https://github.com/soulitzer	2023-05-12 13:49:15 +00:00
Yukio Siraichi	b3b333205f	Fix `asarray` doc examples. (#100971 ) Fixes issue raised on [PyTorch discuss](https://discuss.pytorch.org/t/confused-on-an-example-on-pytorch-official-documentation/178785). Summary: the examples in `asarray` docs have a few mistakes that makes it not work. This PR fixes those. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100971 Approved by: https://github.com/Skylion007, https://github.com/lezcano	2023-05-12 11:52:10 +00:00
Ran Ding	b5c8d0359c	Update autograd.rst (#101007 ) Fixes #ISSUE_NUMBER typo fix and small change to improve clarity Pull Request resolved: https://github.com/pytorch/pytorch/pull/101007 Approved by: https://github.com/lezcano, https://github.com/anjali411	2023-05-12 11:47:51 +00:00
William Wen	aa8dcab1ce	[dynamo 3.11] enable other torch 3.11 dynamo-related tests (#99180 ) Notes: - No segfaults observed in any CI tests: dynamo unittests, inductor unittests, dynamo-wrapped pytorch tests. So we remove the warning that using dynamo 3.11 may result in segfaults. - Some dynamo-wrapped pytorch tests hang. They will be skipped in the dynamo-wrapped test suite and will be addressed in a future PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/99180 Approved by: https://github.com/malfet	2023-05-12 07:03:09 +00:00
Sun, Jiayi	d56e1b2f67	add Half support for unary ops on CPU (#98493 ) Add Half support for log_sigmoid and some unary ops on CPU, including sinc, acosh, asinh, atanh, digamma, trigamma, rsqrt, acos, asin, atan, ceil, cos, erf, erfc, erfinv, exp, expml, floor, log, log10, log1p, log2, i0, round, sin, sqrt, tan, tanh, trunc, lgamma. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98493 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/ngimel	2023-05-12 04:52:34 +00:00
soulitzer	98f6b815b7	[BE] Make some simplifications to torch.utils.checkpoint logic (#101193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101193 Approved by: https://github.com/albanD	2023-05-12 04:35:22 +00:00
Michael Suo	e568c5a18d	[fx] change from #users to num_users in graph printout (#101140 ) `#users` means stuff in various chat apps, which makes it annoying to copypasta graphs into them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101140 Approved by: https://github.com/ezyang	2023-05-12 04:34:01 +00:00
Daniel Dale	2c29149109	Enhance Composable FSDP cast forward input tests (#100349 ) The fix for https://github.com/pytorch/pytorch/pull/99545 (https://github.com/pytorch/pytorch/pull/99546) explicitly required users to set `cast_forward_inputs=False` if they wanted to avoid hitting #99545 while using an FSDP root module with no direct parameters. After further consideration, [the team believes](https://github.com/pytorch/pytorch/pull/99546#discussion_r1180898687) it is sufficiently common for the default `cast_forward_inputs=False` to be used with a FSDP root module possessing no direct parameters that a solution to #99545 that accommodates this use case is desired. This PR builds on @zhaojuanmao's https://github.com/pytorch/pytorch/pull/100290 (nice!) to enhance the FSDP cast forward inputs testing to include a broader range of scenarios and to include `model.eval()` testing as well as training mode validation. (I unfortunately don't have permissions that would allow me to use ghstack directly but I can rebase this PR however the team desires, once #100290 lands etc.) Currently, the evaluation mode testing is commented out while the team decides on the best approach to implementing the broader solution to https://github.com/pytorch/pytorch/pull/99545. Once an implementation is decided, the evaluation mode validation function in the new tests added in this PR can be uncommented and should continue to pass. I also include one potential evaluation mode solution suggestion in this PR but leave the existing code unchanged since I know the team is intending to consider a range of solutions this week. Test notes: 1. The 8 tests added here are a superset of the current `test_float16_on_one_submodule` tests, including validation of the following configurations: (`cast_root_forward_inputs_submodule` = True/False, `cast_forward_inputs_submodule` = True/False, `use_root_no_params` = True/False) across both training and evaluation modes. 2. The `float16_on_one_submodule` model configuration is currently only tested in the FSDP root module with parameters scenarios (as was the existing case) but this test can be easily extended to test it in the FSDP root module with no parameters scenarios as well if the team thinks the additional test resource usage is justified. 3. Since this test amortizes the cost of test setup across the aforementioned range of scenarios, the loop-based implementation of `dtype` validation (below) would have been undesirably complex IMHO[^1] : ```python ############### Logical equivalent of current test result matrix ############ if self.cast_root_forward_inputs_submodule or self.cast_forward_inputs_submodule: self.assertEqual(self.forward_inputs[self.c2].dtype, torch.float16) if use_root_no_params: if self.cast_root_forward_inputs_submodule: self.assertEqual(self.forward_inputs[self.model].dtype, torch.float16) else: self.assertEqual(self.forward_inputs[self.model].dtype, torch.float32) self.assertEqual(self.forward_inputs[self.c1].dtype, torch.float16) else: self.assertEqual(self.forward_inputs[self.c1].dtype, torch.float32) else: self.assertEqual(self.forward_inputs[self.model].dtype, torch.float32) self.assertEqual(self.forward_inputs[self.c1].dtype, torch.float32) if not use_root_no_params: # this input will only exist in the root with params case until eval fix is applied self.assertEqual(self.forward_inputs[self.c2].dtype, torch.float32) ``` so I implemented the validation function as an expected result lookup that provides the added benefit of explicitly specifying the failed subtest upon failed `dtype` assertions, e.g.: ```python AssertionError: None mismatch: torch.float32 is not None Subtest `no_cast_root_no_cast_child_no_root_params` failed. ``` The potential solution to https://github.com/pytorch/pytorch/pull/99545 that I added as a suggestion in the file conversation passes this test set but I know there are a lot of different ways that it could be resolved so I'll assume that change will be tackled in a separate PR unless the team wants to include it in this one. As mentioned, I've currently based this PR off of https://github.com/pytorch/pytorch/pull/100290 so am happy to either wait for that to land first or rebase this PR however the team wants. [^1]: Batching the scenarios into different tests is also possible of course but would involve unnecessary test setup overhead, happy to switch to that approach if the team prefers that though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100349 Approved by: https://github.com/awgu	2023-05-12 04:23:18 +00:00
Natalia Gimelshein	49578913fb	update timm commit (#100931 ) Fixes #100903 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100931 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-05-12 04:22:08 +00:00
Wanchao Liang	3ae612ba7f	[dtensor] remove assertions about submesh checks (#101229 ) This PR removes assertions from submesh checks to directly return local tensor, this is so that all the other APIs can work with submesh Pull Request resolved: https://github.com/pytorch/pytorch/pull/101229 Approved by: https://github.com/fduwjj	2023-05-12 04:20:35 +00:00
zhi.cai	bf50180b4a	enable dispatch stub for backend PrivateUse1 (#99611 ) When expanding the new backend of pytorch in the form of out ot tree, Privateuse1 will be reused. So we also need to support PrivateUse1 in the dispatch stub module Pull Request resolved: https://github.com/pytorch/pytorch/pull/99611 Approved by: https://github.com/ezyang	2023-05-12 04:02:12 +00:00
Ran Ding	e98d762f21	update requirements.txt in /docs (#101092 ) Fixes #101090 Local `make html` works under /docs. Not super sure how to verify doc build actually have no issue due to this update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101092 Approved by: https://github.com/kiersten-stokes, https://github.com/ezyang	2023-05-12 03:19:36 +00:00
Animesh Jain	de15e740a1	[dynamo] Activation checkpointing as higher order op (#101028 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101028 Approved by: https://github.com/voznesenskym, https://github.com/zou3519	2023-05-12 03:17:41 +00:00
PyTorch MergeBot	c5c75aa06d	[vision hash update] update the pinned vision hash (#101230 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101230 Approved by: https://github.com/pytorchbot	2023-05-12 03:16:39 +00:00
Nikita Shulga	ce76670c6f	[GHF][BE] Add `__repr__` to `FlakyRule` (#101234 ) <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 9d9f7a4</samp> > _`FlakyRule` class_ > _Defines rules for flaky tests_ > _Autumn leaves falling_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/101234 Approved by: https://github.com/kit1980	2023-05-12 02:08:20 +00:00
Nikita Shulga	f0cc535c28	[GHF][BE] Memoize `read_flaky_rule (#101239 ) Added caching to `read_flaky_rules`, as it's called several times during the merge process and every call incurs network access. Also, one should not expect flaky rules to change while PR is landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101239 Approved by: https://github.com/huydhn, https://github.com/kit1980	2023-05-12 02:07:37 +00:00
Kiersten Stokes	47ec9cc26d	Improve error messages in `THPVariable_set_grad` (#100683 ) Fixes #100174 I'm not sure if there's another direction that we had in mind for this issue, but if so I'm happy to make the changes 🙂 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100683 Approved by: https://github.com/soulitzer	2023-05-12 01:54:20 +00:00
Sergii Dymchenko	02f152626c	Fix typos in error message (#101231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101231 Approved by: https://github.com/huydhn	2023-05-12 01:34:55 +00:00
mikey dagitses	d9cfa0461a	use const_data_ptr in get_device_pointers (#100997 ) use const_data_ptr in get_device_pointers Summary: These are just inputs to arange. Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100997 Approved by: https://github.com/lezcano, https://github.com/ezyang	2023-05-12 01:24:19 +00:00
Elias Ellison	b9bfc2b2d9	Warn on failure to end warmup, add explicit api for start of model invocation (#101129 ) CUDAGraph trees needs to known when you are doing a new invocation of your model. We have two heuristics for that : - you invoke torch.compile again (like as a top level module you compiled) - you have run a forward with a corresponding backward that hasn't been invoked yet, in which case we ignore the above This doesn't always get it right, especially if you forget to use torch.no_grad() in inference. This adds a warning for that case, and adds an explicit `cudagraph_mark_step_begin` api. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101129 Approved by: https://github.com/ezyang	2023-05-12 01:15:01 +00:00
ts	74dc2a53f6	Thread generator through trunc_normal_ (#100810 ) This will solve @albertz's issue as described in #98200 , threading the generator argument through the trunc_normal_ function. I'm still working on #99796 (and won't let it stall out), but this fix doesn't trigger any JIT issues, so I think it might be helpful to get it merged now. Would be happy to iterate on this if there are any issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100810 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-05-12 01:04:08 +00:00
Brian Hirsh	4c8ee583c3	[inductor] fix incorrect strides in copy() decomp, fix hf_LongFormer + hf_BigBird errors (#100115 ) Fixes https://github.com/pytorch/pytorch/issues/100067 and https://github.com/pytorch/pytorch/issues/93428. See the comment [here](https://github.com/pytorch/pytorch/issues/100067#issuecomment-1523856970) for details. The bug was that the decomposition that inductor uses for `aten.copy` doesn't respect the strides of the input in all cases. The fixes that I added should work, but will be pretty slow - we allocate a tensor (potentially larger than `self` if `self` is a slice), and perform an `as_strided_scatter` + `as_strided`. Longer term, stride-agnostic IR should let us remove this decomp? cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/100115 Approved by: https://github.com/albanD	2023-05-12 00:50:35 +00:00
Brian Hirsh	a6b8e69d36	[aot autograd] fix de-dupping metadata computation bug (#100431 ) Fixes https://github.com/pytorch/pytorch/issues/100224 There was a bug in the way that metadata was computed when going from "metadata before-removing-dupes" to "metadata after-removing-dupes". In fact, when I ran the repro with `functorch.config.debug_assert = True`, that immediately signaled to me that the metadata was incorrect. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100431 Approved by: https://github.com/ngimel, https://github.com/albanD	2023-05-12 00:50:35 +00:00
Brian Hirsh	5651006b9d	[aot_autograd] proper handling for when outputs are aliased but have identical size/stride/offset metadata (#100430 ) Fixes https://github.com/pytorch/pytorch/issues/100348, see the discussion in the issue for details. The problem was that for code like this: ``` def f(x): out = ... return out, out.detach() ``` The `.detach()` would turn into a `.alias()`, and inductor turns `.alias()` calls into no-ops. Inductor would effectively see that the two graph outputs have the same metadata, and return `out, out`. cc @ngimel alternatively we could have inductor try to detect when it's not ok to make `.alias()` a no-op, but that would probably require some custom logic instead of making `.alias()` a decomposition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100430 Approved by: https://github.com/ngimel	2023-05-12 00:50:35 +00:00
Edward Z. Yang	2c786961b7	Towards making torch._inductor.ir typed (#100712 ) This PR just contains some mild gyrations necessary to appease mypy. However, it is not complete; there are a number of legitimate bugs and mistyping that I need to work out before I can actually turn this on. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100712 Approved by: https://github.com/ngimel	2023-05-12 00:07:33 +00:00
Edward Z. Yang	380054ebb2	Add IRNode.realize stub with docs (#100710 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100710 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-05-12 00:07:33 +00:00
Aaron Gokaslan	738ba13b35	[BE]: enable PLE error codes in ruff and fix bugs (#101079 ) Enables PyLint error codes implemented in ruff. These are un-opinionated static analysis checks on Python code that finds common bugs. After running all the PLE error codes that are implemented in ruff, I fixed the bugs, added a few ignores for malformed Python code that is part of our JIT test script, and finally added a few ignores for a false positive on PLE0605 and submitted an issue upstream to fix in ruff https://github.com/charliermarsh/ruff/issues/4345 . Common bugs found here include analysis for malformed logging format calls, bad string format calls, invalid escape sequences, and more. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101079 Approved by: https://github.com/malfet	2023-05-11 23:57:25 +00:00
Nikita Shulga	b7bf953bbc	[MPS] Fix bernoulli for int types (#100946 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 069fd23</samp> This pull request enhances the MPS implementation of random operations in `Distributions.mm` and adds more dtype tests for the bernoulli distribution in `test_mps.py`. This improves the performance, correctness, and usability of the MPS backend for PyTorch. Fixes https://github.com/pytorch/pytorch/issues/100717 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100946 Approved by: https://github.com/kulinseth	2023-05-11 23:52:38 +00:00
Wanchao Liang	599ae95d1a	[dtensor] use stack to manage mesh resources (#101202 ) This PR changes the context manager behavior of device mesh, now we use a mesh env to track the current mesh and save the mesh to a stack so that we can allow nested context manager Pull Request resolved: https://github.com/pytorch/pytorch/pull/101202 Approved by: https://github.com/wz337	2023-05-11 23:48:36 +00:00
Rohan Varma	6d6abba0d8	[IValue] Better handle sparseTensors in extractStorages (#100783 ) Sparse tensors don't seem to be handled when we have tensors instead of pyobjects. Differential Revision: [D45632427](https://our.internmc.facebook.com/intern/diff/D45632427/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100783 Approved by: https://github.com/H-Huang	2023-05-11 23:44:51 +00:00
Jane Xu	cb94ea6044	[BE] Simplify tests, elaborate testnames in test_optim.py (#101004 ) - Deletes unused kwargs - Make test names more descriptive to remove need of comments. Overall it's better to codify over comment - Added a test for duplicate params across groups - Greatly simplified test_empty_grad to discover that the crux of the bug was NOT its emptiness, but rather with multi-dim emptiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101004 Approved by: https://github.com/albanD	2023-05-11 23:27:24 +00:00
Chien-Chin Huang	49c8a0cad0	[SPMD][BE] Remove the legacy tracing code (#100858 ) Remove the legacy tracing code as it cause several test and benchmark issues. Differential Revision: [D45649123](https://our.internmc.facebook.com/intern/diff/D45649123/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100858 Approved by: https://github.com/wanchaol	2023-05-11 23:08:27 +00:00
Edward Z. Yang	c567748e16	Make interpolate_bilinear deterministic using decomposition (#101115 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101115 Approved by: https://github.com/ngimel	2023-05-11 22:48:01 +00:00
Ke Wen	daed3bf8f9	Implement coalesced all_gather_into_tensor (#101157 ) This PR adds support for the following use cases: - Sync style: ``` with dist._coalescing_manager(): for i in range(num_coll): dist.all_gather_into_tensor(output_tensors[i], input_tensors[i]) ``` - Async style: ``` with dist._coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.all_gather_into_tensor(output_tensors[i], input_tensors[i]) # do a bunch of other things cm.wait() # do things that depend on the all-gather's ``` Each `all_gather_into_tensor` would be independent in terms of data and their buffer location. But could be executed in parallel by supported backends (like NCCL). Pull Request resolved: https://github.com/pytorch/pytorch/pull/101157 Approved by: https://github.com/kumpera, https://github.com/wanchaol	2023-05-11 20:58:47 +00:00
Joel Schlosser	e47cdd0ca4	[BE] Testing docs: clarify test instantiation function usage (#100905 ) Beefing up docs with discussion about when to use `instantiate_device_type_tests()` vs. `instantiate_parametrized_tests()` + description on what each does. Spoiler: use only one - the former for device-specific and the latter for device-agnostic tests. Both support `@parametrize`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100905 Approved by: https://github.com/janeyx99	2023-05-11 20:48:03 +00:00
Edward Z. Yang	ae23328625	Remove obsolete upsample_bilinear2d lowerings (#101111 ) We pre-autograd decompose, so we never need to lower this! Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101111 Approved by: https://github.com/ngimel	2023-05-11 20:41:57 +00:00
Nikita Vedeneev	346e1f512f	sparse compressed validation: allow empty-batched inputs (#101180 ) Fixes https://github.com/pytorch/pytorch/issues/101179. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101180 Approved by: https://github.com/pearu, https://github.com/cpuhrsch	2023-05-11 20:30:20 +00:00
Pearu Peterson	65b15be04c	Fix incorrect sparse_dim in COO.zero_() and in binary operations with zero-sized COO operands (#98292 ) Fixes https://github.com/pytorch/pytorch/issues/97627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98292 Approved by: https://github.com/nikitaved, https://github.com/cpuhrsch, https://github.com/amjames	2023-05-11 19:05:34 +00:00
Edward Z. Yang	41a4e22015	Update torchbench pin (#101071 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101071 Approved by: https://github.com/malfet	2023-05-11 18:09:40 +00:00
David Berard	f7571507e0	Add global boolean for controlling whether to record concrete shapes or not (#101043 ) Summary: We don't think the performance impact of recording concrete shapes is significant; but it's good to have a knob for turning it off quickly in case it has a large performance impact. Test Plan: Ran D45681838. It prints the state of that "concrete inputs" boolean. I ran it before and after canarying a change to `pytorch/kineto:pytorch_record_concrete_inputs`; before, it returns true; after, it returns false. Note that D45681838 had to add `service` on the main function. That's because we need to `initFacebook` in order to use jks. Differential Revision: D45680162 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101043 Approved by: https://github.com/aaronenyeshi	2023-05-11 18:07:35 +00:00
Yu, Guangye	14964b3aa5	Add is_xpu to torch type (#101072 ) # Motivate Without this PR: ```python >>>import torch >>>torch.IntTensor.is_cuda False >>>torch.IntTensor.is_xpu <attribute 'is_xpu' of 'torch._C._TensorBase' objects> ``` With this PR: ```python >>>import torch >>>torch.IntTensor.is_xpu False ``` Align to CUDA, some customer code use is_xpu to check the backend. Without this PR, the check is always True which result in an unexpected behavior Pull Request resolved: https://github.com/pytorch/pytorch/pull/101072 Approved by: https://github.com/mikaylagawarecki	2023-05-11 17:50:59 +00:00
Xiaodong Wang	1e6002bef6	[pt2] Skip if curr_size is None (#101170 ) Summary: There is a chance curr_size is None (got an error None cannot set item), so skip if it's None Test Plan: unit test in D44736488 Differential Revision: D45767829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101170 Approved by: https://github.com/ezyang	2023-05-11 17:20:38 +00:00
Elias Ellison	0ec4646588	CUDA Graph Trees - error on deallocated access (#100927 ) Turn warning to error if we detect tensor is accessed after its memory is overwritten/released by a new invocation of cudagraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100927 Approved by: https://github.com/zou3519	2023-05-11 17:17:14 +00:00
Yanbo Liang	369a256381	[Dynamo] Remove cross import in dynamo unit tests (#100851 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100851 Approved by: https://github.com/jansel	2023-05-11 17:07:25 +00:00
ecao	502e791241	Update cpuinfo submodule to include AVX512-FP16 detection (#100865 ) Update submodule cpuinfo to include AVX512-FP16 detection. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100865 Approved by: https://github.com/Blackhex, https://github.com/jgong5, https://github.com/malfet	2023-05-11 15:13:19 +00:00
Bert Maher	4a4854f6b2	[inductor] Test for shape padding (#100493 ) Summary: This wasn't tested anywhere as far as I can tell, so it was breaking on Triton updates that fiddled with the signature of `do_bench` Test Plan: `test_shape_padding` Differential Revision: D45499978 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100493 Approved by: https://github.com/Chillee, https://github.com/jansel	2023-05-11 15:10:54 +00:00
Bert Maher	d283075282	Reduce fake_tensor create_mode logging (#101074 ) A lot of Meta-internal logging is at INFO level, so this produces a lot of spam Differential Revision: [D45732720](https://our.internmc.facebook.com/intern/diff/D45732720/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101074 Approved by: https://github.com/eellison, https://github.com/ezyang	2023-05-11 13:26:38 +00:00
Edward Z. Yang	ad070b6dfa	Check canary_models for models too in torchbench.py (#101081 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101081 Approved by: https://github.com/desertfire	2023-05-11 13:23:17 +00:00
PyTorch MergeBot	4eaaa08623	Revert "Fix header inclusions in c10 by iwyu (#100304 )" This reverts commit 6037ee8cc914d64a27965a35b20472044416a2a5. Reverted https://github.com/pytorch/pytorch/pull/100304 on behalf of https://github.com/jeanschmidt due to Breaking meta internal builds and fbgemm builds ([comment](https://github.com/pytorch/pytorch/pull/100304#issuecomment-1543919257))	2023-05-11 12:37:35 +00:00
Tugsbayasgalan Manlaibaatar	683adb2091	Add dummy CUDA kernel for assert_async.msg (#101130 ) This PR doesn't actually propogate the error messages down to internal kernel implementations. Will do in a follow up PR. This op is only really meant to be used in export context and export doesn't support CUDA right now so it is safe to just ignore the assert message. This PR is for quickly unblocking https://github.com/pytorch/pytorch/pull/100791 and https://github.com/pytorch/pytorch/issues/100918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101130 Approved by: https://github.com/anijain2305, https://github.com/yanboliang	2023-05-11 12:27:10 +00:00
PyTorch MergeBot	cbfed470bd	Revert "CUDA Graph Trees - error on deallocated access (#100927 )" This reverts commit 3941bbc5ba10acdf103cd91bab7de67bfef95957. Reverted https://github.com/pytorch/pytorch/pull/100927 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100927#issuecomment-1543874258))	2023-05-11 12:07:20 +00:00
Nikita Vedeneev	dd2c22f4bb	bsr_dense_bmm(): enable more precise float32 support with float64 accumulators (#100882 ) Float64 is there in Triton! This PR increases precision for float32 inputs with float64 accumulation dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100882 Approved by: https://github.com/cpuhrsch	2023-05-11 11:22:55 +00:00
mikey dagitses	979f55d3bc	implementation of DataPtr context for copy-on-write tensors (#100818 ) implementation of DataPtr context for copy-on-write tensors Summary: Copy-on-write storage ===================== This library adds support for copy-on-write storage, i.e. lazy copies, to tensors. The design maintains the PyTorch invariant that tensors alias if and only if they share a storage. Thus, tensors that are lazy copies of one another will have distinct storages that share a data allocation. Thread-safety ------------- The correctness of this design hinges on the pre-existing PyTorch user requirement (and general default programming assumption) that users are responsible for guaranteeing that writes do not take places concurrently with reads and other writes. Lazily copied tensors add a complication to this programming model because users are not required to know if lazy copies exist and are not required to serialize writes across lazy copies. For example: two tensors with distinct storages that share a copy-on-write data context may be given to different threads that may do whatever they wish to them, and the runtime is required to guarantee its safety. It turns out that this is not that difficult to protect because, due to the copy-on-write requirement, we just need to materialize a tensor upon writing. This could be done entirely without synchronization if we materialized each copy, however, we have a common-sense optimization to elide the copy for the last remaining reference. This requires waiting for any pending copies. ### Thread-safety detailed design There are two operations that affect the copy-on-write details of a tensor: 1) lazy-clone (e.g. an explicit call or a hidden implementation detail added through an operator like reshape) 2) materialization (i.e. any write to the tensor) The key insight that we exploit is that lazy-clone is logically a read operation and materialization is logically a write operation. This means that, for a given set of tensors that share a storage, if materialization is taking place, no other read operation, including lazy-clone, can be concurrent with it. However, this insight only applies within a set of tensors that share a storage. We also have to be concerned with tensors with different storages that share a copy-on-write context. In this world, materialization can race with lazy-clone or even other materializations. _However_, in order for this to be the case, there must be _at least_ two references to the context. This means that the context _can not_ vanish out from under you if you are performing a lazy-clone, and hence, it only requires an atomic refcount bump. The most complicated case is that all lazy-copies are concurrently materializing. In this case, because a write is occurring, there are no in-flight lazy-copies taking place. We must simply ensure that all lazy-copies are able to materialize (read the data) concurrently. If we didn't have the aforementioned optimization where the last copy steals the data, we could get away with no locking whatsoever: each makes a copy and decrements the refcount. However, because of the optimization, we require the loser of the materializing race wait for the pending copies to finish, and then steal the data without copying it. We implement this by taking a shared lock when copying the data and taking an exclusive lock when stealing the data. The exclusive lock acquisition ensures that all pending shared locks are finished before we steal the data. Test Plan: 100% code coverage. --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/100818). * #100821 * #100820 * #100819 * __->__ #100818 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100818 Approved by: https://github.com/ezyang	2023-05-11 11:13:51 +00:00
Nikita Shulga	87084643e5	[CI][MPS] Actually make grid_sampler_2d available (#101108 ) In CI older MacOS SDK can be used to compile the binary, so add guard for availability of `MPSGraphResizeNearestRoundingModeRoundToEven` enum value. MPS feature availability checks are deliberately done at runtime (by using `is_macos_13_or_newer` and forward-declaring methods in `MPSGraphVenturaOps.h`) rather than at compile time (by using `#ifdef`s). Modify error message and XFAIL condition in `test_mps.py` to fail test due to missing conditional on macOS-13.2 or newer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101108 Approved by: https://github.com/kulinseth	2023-05-11 10:35:09 +00:00
Li-Huai (Allan) Lin	c4752b1a91	[MPS] Rename metalIndexingFunction to metalIndexingPSO (#101156 ) Rename to reflect its return type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101156 Approved by: https://github.com/DenisVieriu97	2023-05-11 09:27:29 +00:00
lezcano	8b4e28d65d	Fix microbenchmarks (#101065 ) As per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/101065 Approved by: https://github.com/jansel	2023-05-11 09:14:22 +00:00
Jason Ansel	036a8d6b4a	Remove NullContext() from benchmark runners (#100309 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100309 Approved by: https://github.com/Skylion007, https://github.com/anijain2305	2023-05-11 06:42:27 +00:00
Eddie Yan	c25fdc20c2	[cuBLAS][cuBLASLt] Allow user-specified cuBLASLt workspace size via `CUBLASLT_WORKSPACE_SIZE` (#101145 ) Provide an option to configure the workspace size used by cuBLASLt rather than fixing it as a compile-constant of 1MiB due to observed performance differences on H100 and recommendations from cuBLAS e.g., https://docs.nvidia.com/cuda/archive/11.8.0/cuda-toolkit-release-notes/index.html#title-cublas-library. Some quick profiling shows that in some cases up to 32MiB of workspace is needed on H100: ``` import torch import time m = 1024 n = 2048 warmup = 20 iters = 200 dtype = torch.bfloat16 for k in (1024, 2048, 4096, 8192, 9376, 16384, 32768): a = torch.randn(m, k, device='cuda', dtype=dtype) b = torch.randn(n, k, device='cuda', dtype=dtype).transpose(1, 0) i = torch.randn(n, device='cuda', dtype=dtype) for _ in range(warmup): torch.addmm(i, a, b) torch.cuda.synchronize() t1 = time.perf_counter() for _ in range(iters): torch.addmm(i, a, b) torch.cuda.synchronize() t2 = time.perf_counter() print(f"m:{m}, n:{n}, k:{k} TFLOP/s: {( 2mnk)iters/(t2 - t1)/1e12}") ``` 1MiB: ``` m:1024, n:2048, k:1024 TFLOP/s: 62.40964655242158 m:1024, n:2048, k:2048 TFLOP/s: 79.33321703070685 m:1024, n:2048, k:4096 TFLOP/s: 96.69701590181765 m:1024, n:2048, k:8192 TFLOP/s: 83.2892371366678 m:1024, n:2048, k:9376 TFLOP/s: 83.91872373271516 m:1024, n:2048, k:16384 TFLOP/s: 86.57820235279185 m:1024, n:2048, k:32768 TFLOP/s: 88.37227761178467 ``` 32 MiB: ``` m:1024, n:2048, k:1024 TFLOP/s: 73.50633352382425 m:1024, n:2048, k:2048 TFLOP/s: 104.32016319633199 m:1024, n:2048, k:4096 TFLOP/s: 131.37290416527784 m:1024, n:2048, k:8192 TFLOP/s: 152.08780769805506 m:1024, n:2048, k:9376 TFLOP/s: 154.93898780286096 m:1024, n:2048, k:16384 TFLOP/s: 165.13973167154688 m:1024, n:2048, k:32768 TFLOP/s: 160.62065020500813 ``` CC @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/101145 Approved by: https://github.com/ngimel	2023-05-11 06:38:32 +00:00
Jiong Gong	b46553f652	[inductor] simplify test_cpu_repro with self.common (#101050 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101050 Approved by: https://github.com/jansel	2023-05-11 06:29:39 +00:00
chunyuan	ea86eb3197	inductor: fallback ConvTranspose when output_padding is big (#100846 ) Fixes https://github.com/pytorch/pytorch/issues/100225. Do not use mkldnn when `output_padding` is big to align the behavior with eager mode: `7d0e4e2aa8/aten/src/ATen/native/Convolution.cpp (L500-L507)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100846 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-05-11 06:18:37 +00:00
leslie-fang-intel	a66de845de	[Quant][PT2E]Fix pt2e quantization maxpool input observer issue (#100961 ) Summary Fix the issue https://github.com/pytorch/pytorch/issues/100959. The root cause is for node of `torch.ops.aten.max_pool2d_with_indices.default`, there are 2 output node as output tensor and max indices. So in its `node.meta["val"]` is a tuple of `FakeTensors` (For example: `'val': (FakeTensor(..., size=(1, 2, s1, s1)), FakeTensor(..., size=(1, 2, s1, s1), dtype=torch.int64))`). It will fail the check of inserting observer since which only accept one `FakeTensor` case. Test Plan ``` python -m pytest test_quantize_pt2e.py -k test_max_pool2d_quantizer ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100961 Approved by: https://github.com/jerryzh168, https://github.com/jgong5	2023-05-11 06:14:34 +00:00
cyy	6037ee8cc9	Fix header inclusions in c10 by iwyu (#100304 ) This work introduces include-what-you-use support for c10 by a CMake option defaulting to off. We also remove some unused header inclusions and fix a trivial inclusion error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100304 Approved by: https://github.com/ezyang	2023-05-11 05:19:42 +00:00
PyTorch MergeBot	da02ccc60e	Revert "PyTorch -> C++17 (#98209 ) (#100557 )" This reverts commit 083f88e12632059e7e710634fc8708c8205678d5. Reverted https://github.com/pytorch/pytorch/pull/100557 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100557#issuecomment-1543285863))	2023-05-11 03:43:11 +00:00
Edward Z. Yang	2621fbda7d	Turn on anomaly detection for AOTAutograd backward tracing (#101047 ) Previously, anomaly detection was only enabled on the inner forward function, and not on the overall joint function that calls backward. I believe this impeded us from printing "this is the forward that triggered the backward" because that printing only happens if anomaly mode is enabled when you run backward(). This PR fixes it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101047 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2023-05-11 03:38:20 +00:00
Natalia Gimelshein	15a51e2012	simplify sdpa backward meta registration (#101128 ) Per title. there's an off chance that query_reshaped etc was actually discontiguous after reshape, but even in that case I'm pretty sure the computed gradients would still be contiguous, and we are properly transposing output gradients to produce correct strides. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101128 Approved by: https://github.com/drisspg	2023-05-11 03:30:07 +00:00
PyTorch MergeBot	5f89d89ada	[vision hash update] update the pinned vision hash (#101142 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101142 Approved by: https://github.com/pytorchbot	2023-05-11 03:16:23 +00:00
Yanbo Liang	075d36d37f	[Dynamo] Fix nested function resume execution (#100426 ) Fixes #99665 Let me explain the root cause using the unit test I added: * This bug is triggered when: * ```wrapped``` is a nested function. * ```wrapped``` is in another module which is different from the main function ```fn```. * There is a graph break inside of ```wrapped```. * The root cause is when resuming nested function, actually we are using the outermost function(```fn``` in my example)'s global variables, but ```wrapped``` calls ```inner_func``` which is not part of ```fn```'s globals, so we have to set correct globals when nested function resume execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100426 Approved by: https://github.com/jansel	2023-05-11 03:10:23 +00:00
XiaobingSuper	c84627c2ee	benchmarks: make --amp works for cpu path (#101057 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101057 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-05-11 02:51:38 +00:00
Wanchao Liang	a1aa32e204	[dtensor] tensor ops to use strategy based sharding prop (#100607 ) This is the first series of PR that adopts operator impls to use a strategy based approach, each op utilizes OpStrategy and PlacementStrategy to generate their own strategy. By utilizing the strategy based approach along with the op graph, we could enable more advanced op implementation (decomp is possible), and turn the sharding prop to be more like a contraint satisfication problem. This PR alone only adds some basic tensor op strategies, and it directly works on the op graph that was used for metadata propagation. The tensor ops added in this PR mainly follows one of the arg strategy. The next set of PRs would add more op strategies to other ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100607 Approved by: https://github.com/XilunWu	2023-05-11 02:47:20 +00:00
Huy Do	d1f0c8e2d0	Run C++ test_api binary directly in CI slow jobs (#101088 ) Similar to ASAN, the test starts to timeout on slow jobs such as slow gradcheck, for example `30cecc0e11`. This needs to be investigated later, but it's of low priority as we can run test_api binary directly in the meantime in these jobs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101088 Approved by: https://github.com/clee2000	2023-05-11 02:08:20 +00:00
Ke Wen	0848ed21b8	[c10d] Figure out device to use for object collectives (#100954 ) Fixes https://github.com/pytorch/pytorch/issues/97938 this pr is clone from https://github.com/pytorch/pytorch/pull/100238, which is important to me. But @kwen2501 has not resolved the confliction. So, this pr is submitted to resolve the confliction. the only confliction is `distributed_c10d.py:2653` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100954 Approved by: https://github.com/kwen2501	2023-05-11 01:49:09 +00:00
Huy Do	a0e6ae2c01	Restore Vulkan tests to periodic (#101026 ) The flaky issue has been fixed by https://github.com/pytorch/pytorch/pull/100909. In addition, retry support for C++ is now available after https://github.com/pytorch/pytorch/pull/99956. So it's safe to move this back from unstable now. Per the discussion with @clee2000, it makes sense to have this as a periodic job given that this one rarely fails. To prevent any gap here, I have created https://github.com/pytorch/test-infra/pull/4144 to add the necessary testing for Vulkan change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101026 Approved by: https://github.com/kit1980	2023-05-11 01:42:51 +00:00
clee2000	13d445c2c2	Move periodic dynamo benchmarks to inductor workflow (#100915 ) will this mess up the dashboard? add new workflow for dynamo benchmarks, triggers on ciflow/inductor, runs periodically on main dynamo benchmarks take about 7-8 hours total in the past week, the inductor workflow has triggered ~580 times on PRs (~850 total) and periodic has been triggered ~100 times total. This is an estimate Pull Request resolved: https://github.com/pytorch/pytorch/pull/100915 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/malfet	2023-05-11 01:11:24 +00:00
XiaobingSuper	b1a8a10a73	inductor(CPU): fix masked_fill issue when filled value is nan (#101058 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101058 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-05-11 00:57:04 +00:00
PyTorch MergeBot	3271413e74	Revert "Fix header inclusions in c10 by iwyu (#100304 )" This reverts commit 39ec5fa722730f6c25490c2c33933b014767f297. Reverted https://github.com/pytorch/pytorch/pull/100304 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, it is almost there but fails on Windows `39ec5fa722`, which is in unstable mode after https://github.com/pytorch/pytorch/pull/100548 ([comment](https://github.com/pytorch/pytorch/pull/100304#issuecomment-1542975714))	2023-05-11 00:37:32 +00:00
Annwesh Barik	bb7d9886fb	[efficiency_camp] Vector Realloc Optimize caffe2::BinaryElementwiseWithArgsOp::DoRunWithType (#100631 ) Summary: Reserve the vector capacity to avoid resizing(realloc) Test Plan: Internal: ``` $ buck test mode/dev-nosan caffe2/caffe2/python/operator_test:elementwise_ops_test ``` Differential Revision: D45529614 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100631 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-05-10 23:34:54 +00:00
Elias Ellison	7110060cff	Enable reordering pass (#100747 ) Restricts the reordering pass to only reorder nodes before the `copy_` epilogue that functionalization generates. Brings `python benchmarks/dynamo/torchbench.py --performance --backend inductor --amp --inference --only hf_Bert` from 1.46 -> 1.49 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100747 Approved by: https://github.com/jansel	2023-05-10 23:23:55 +00:00
PyTorch MergeBot	91ca9a276f	Revert "Enable reordering pass (#100747 )" This reverts commit 6308563a39aab6c261e4adf81391c53707f785e7. Reverted https://github.com/pytorch/pytorch/pull/100747 on behalf of https://github.com/jeanschmidt due to braking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100747#issuecomment-1542906461))	2023-05-10 22:54:40 +00:00
PyTorch MergeBot	6c3af6a966	Revert "inductor(CPU): fix issue when padding/stride/dilation size is one for cpu weight packing pass (#100951 )" This reverts commit 2b250e19210b256edcd8f94652d33c3bbbe382fb. Reverted https://github.com/pytorch/pytorch/pull/100951 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, Jasson A should follow up to make sure we have this sorted out ASAP ([comment](https://github.com/pytorch/pytorch/pull/100951#issuecomment-1542878888))	2023-05-10 22:17:24 +00:00
Nikita Shulga	176dabf88c	[MPS] Export check for 13.3 to `is_macos13_or_newer` (#101119 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at b420550</samp> Updated `isOnMacOS13orNewer` function in `MPSHooks.cpp` to handle macOS 13.3. This fixes a bug with MPS on this version of macOS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101119 Approved by: https://github.com/kit1980, https://github.com/huydhn	2023-05-10 22:14:03 +00:00
Xiaodong Wang	c650b12e0b	[pt2] Add some helper function for SymIntVector (#101056 ) Summary: To SymInt-ify some fbgemm kernels `f4c83b4fb3/fbgemm_gpu/src/jagged_tensor_ops_autograd.cpp (L109)`, we need to provide a toSymIntVector helper function in pytorch Test Plan: Run fbgemm op + torch.compile to make sure the symbolic shape is produced. Differential Revision: D45724091 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101056 Approved by: https://github.com/suo	2023-05-10 21:27:55 +00:00
Yanbo Liang	8a20ea0a1f	[Dynamo] Fix torch.{cuda/cpu}.amp.autocast arguments binding bug (#101052 ) Fixes Meta internal user case. Repro: ``` import torch import torch._dynamo def fn(x): with torch.cuda.amp.autocast(False): x = torch.sin(x + 1) return x x = torch.randn([2, 3]) ref = fn(x) print(ref) opt_fn = torch._dynamo.optimize(backend="inductor")(fn) print(opt_fn(x)) ``` Error: ``` Traceback (most recent call last): File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/convert_frame.py", line 425, in _compile out_code = transform_code_object(code, transform) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/bytecode_transformation.py", line 1000, in transform_code_object transformations(instructions, code_options) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/convert_frame.py", line 410, in transform tracer.run() File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 2010, in run super().run() File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 703, in run and self.step() File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 663, in step getattr(self, inst.opname)(inst) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 385, in wrapper return inner_fn(self, inst) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 1095, in CALL_FUNCTION self.call_function(fn, args, {}) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 554, in call_function self.push(fn.call_function(self, args, kwargs)) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/variables/torch.py", line 381, in call_function return AutocastModeVariable.create(target_values=args, kwargs=kwargs) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/variables/ctx_manager.py", line 198, in create bound_args = inspect.signature(torch.autocast).bind(target_values, *kwargs) File "/scratch/ybliang/work/env/lib/python3.9/inspect.py", line 3045, in bind return self._bind(args, kwargs) File "/scratch/ybliang/work/env/lib/python3.9/inspect.py", line 2984, in _bind raise TypeError( TypeError: multiple values for argument 'device_type' from user code: File "/scratch/ybliang/work/repos/debug/debug6.py", line 10, in fn with torch.cuda.amp.autocast(False): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101052 Approved by: https://github.com/anijain2305	2023-05-10 21:19:18 +00:00
Nikita Shulga	08ef92e711	Delete Python-2 checks from setup.py (#101112 ) <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 557960b</samp> > _`Python 2` is gone_ > _PyTorch cleans up its code_ > _Winter of legacy_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/101112 Approved by: https://github.com/kit1980, https://github.com/albanD	2023-05-10 20:17:31 +00:00
Zain Rizvi	96f46316c9	Preserve PyTest Cache across job runs (#100522 ) Preserves the PyTest cache from one job run to the next. In a later PR, this will be used to change the order in which we actually run those tests The process is: 1. Before running tests, check S3 to see if there is an uploaded cache from any shard of the current job 2. If there are, download them all and merge their contents. Put the merged cache in the default .pytest_cache folder 3. After running the tests, merge the now-current .pytest_cache folder with the cache previously downloaded for the current shard. This will make the merged cache contain all tests that have ever failed for the given PR in the current shard 4. Upload the resulting cache file back to S3 The S3 folder has a retention policy of 30 days, after which the uploaded cache files will get auto-deleted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100522 Approved by: https://github.com/huydhn	2023-05-10 18:37:28 +00:00
blorange-amd	2dc93c20ac	[ROCm]Fixed ut test_memory_timeline (#96752 ) Fixed test_memory_profiler::TestMemoryProfilerE2E::test_memory_timeline by changing the (arbitrary) threshold for logging. We observe differently-sized allocations on different AMD GPUs, so chose a higher threshold of 512 to account for those differences and yet satisfy the test requirements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96752 Approved by: https://github.com/jithunnair-amd, https://github.com/kit1980	2023-05-10 17:49:35 +00:00
Kristoffer Carlson	e762cce61f	Allow cmake vars in docker build (#100867 ) Fixes #100866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100867 Approved by: https://github.com/malfet	2023-05-10 17:44:30 +00:00
Jerry Zhang	058d740f59	[reland][quant][pt2e] Change input act annotation to a map and allow dynamic quantization for non zeroth argument (#101005 ) (#101041 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/101005 Previously the node annotation looks like the following: ``` node.meta["..."] = { "input_act_obs_or_fq_ctr": ..., "weight_obs_or_fq_ctr": ..., "weight_index": 1, } ``` Basically we need specifiy the index for weight and also have a separate key for weight config, in this PR we changed that to: ``` node.meta["..."] = { "input_act_obs_or_fq_ctr_map": {input_node: ..., weight_node: ...}, } ``` This can support specifying the observer/fake quant constructor for any argument of the node Test Plan: buck2 test @//mode/opt //caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)' Differential Revision: D45719781 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101041 Approved by: https://github.com/andrewor14	2023-05-10 17:43:21 +00:00
Elias Ellison	3941bbc5ba	CUDA Graph Trees - error on deallocated access (#100927 ) Turn warning to error if we detect tensor is accessed after its memory is overwritten/released by a new invocation of cudagraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100927 Approved by: https://github.com/zou3519	2023-05-10 17:15:33 +00:00
Peter Bell	3b2a93a3b5	[inductor] Make codecache file permissions less restrictive (#100870 ) `tempfile.mkstemp` always creates the file `0o600` permissions, so only the current user can access it. Instead, this salts the original filename with the pid and thread id to avoid conflicts between temporary files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100870 Approved by: https://github.com/jansel	2023-05-10 16:32:45 +00:00
Bin Bao	32c9e7d377	[CI] Run test_multi_gpu in test_inductor_distributed (#100135 ) Summary: The guard reason string change is needed after https://github.com/pytorch/pytorch/pull/98107/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/100135 Approved by: https://github.com/anijain2305	2023-05-10 16:16:55 +00:00
Luthaf	000368b092	Allow C++ custom class to define __repr__ and use it from Python (#100724 ) When handling custom classes from Python, it is nice to be able to specify how they are displayed to the user. Out of the two standard functions to do this, only `__str__` could be implemented in C++. This PR add `__repr__` to the allowlist of magic methods. The second commit tweaks the default output of `__str__` to make it more informative, but I can remove the change if you want. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100724 Approved by: https://github.com/ezyang	2023-05-10 15:46:45 +00:00
Nikita Karetnikov	c0d33f66c9	[pt2] remove unused `meta_linalg_eigh` (#100965 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100965 Approved by: https://github.com/ezyang	2023-05-10 15:45:36 +00:00
Nikita Karetnikov	6abde61f8e	[pt2] add meta function for `_linalg_eigh` (#100964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100964 Approved by: https://github.com/ezyang	2023-05-10 15:45:15 +00:00
cyy	39ec5fa722	Fix header inclusions in c10 by iwyu (#100304 ) This work introduces include-what-you-use support for c10 by a CMake option defaulting to off. We also remove some unused header inclusions and fix a trivial inclusion error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100304 Approved by: https://github.com/ezyang	2023-05-10 15:42:43 +00:00
Edward Z. Yang	c658732950	[RFC] Add tqdm to benchmarking script (#100969 ) Here's what it looks like, on a slower running benchmark: https://github.com/pytorch/pytorch/assets/13564/47c4a5bd-e963-45de-a15c-2fd943de0fa4 There's actually quite a bit of dead time, it's possible there are more spots we should add tqdm to. Looking for opinions on utility of this. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100969 Approved by: https://github.com/Skylion007	2023-05-10 15:39:24 +00:00
Chien-Chin Huang	0fbe55ea8f	[FSDP][state_dict] Make sharded_state_dict work with composable fully_shard (#100856 ) The current implementation of sharded_state_dict only works with wrapper based FSDP (both use_orig_params and not use_orig_params work) but not with fully_shard. This PR changes the implementation of sharded_state_dict when loading to fix the incompatibility. Differential Revision: [D45626856](https://our.internmc.facebook.com/intern/diff/D45626856/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D45626856/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/100856 Approved by: https://github.com/awgu, https://github.com/zhaojuanmao	2023-05-10 15:32:45 +00:00
Rohan Varma	9ba2bfea9c	[PG Wrapper] Add diff capability (#100214 ) Currently we print out the mismatched collectives, but it is hard to tell exactly the mismatch. This diff adds functionality to detect the exact mismatch and report it. New error is as follows: ``` Detected mismatch between collectives on ranks. Rank 0 is running collecti ve: CollectiveFingerPrint(SequenceNumber=1151423632, OpType=ALLREDUCE, TensorShape =[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (defaul t), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_me mory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=1151423632, OpType=REDUCE, TensorShape=[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), de vice=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=f alse (default), memory_format=(nullopt))). Collectives differ in the following aspects: Op type: ALLREDUCEvs REDUCE ``` i.e. the "Collectives differ in the following..." messaging is added. Differential Revision: [D45375737](https://our.internmc.facebook.com/intern/diff/D45375737/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100214 Approved by: https://github.com/H-Huang	2023-05-10 15:32:30 +00:00
PyTorch MergeBot	9ff547a57f	Revert "Fix ordered dict loading with LibTorch (#100743 )" This reverts commit d371a890a21ab4b39905ed1797da6a15c0c43f53. Reverted https://github.com/pytorch/pytorch/pull/100743 on behalf of https://github.com/jeanschmidt due to New test introduced SerializationTest.SaveStateDict is adding regressions ([comment](https://github.com/pytorch/pytorch/pull/100743#issuecomment-1542400538))	2023-05-10 15:29:14 +00:00
Devashish Shankar	cb668b1291	Optimize split-split pass (#100983 ) Summary: Previously, we were replacing all getitems of a split - even the ones not affected by the pattern. For large split nodes, this was inefficient. For instance, on an internal ads model - split-split pass took ~1100s. This is down to ~18s after this optimization Test Plan: * Compiled and tested on internal model (compilation time down by ~1100s) * CI tests Differential Revision: D45698034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100983 Approved by: https://github.com/jansel	2023-05-10 14:43:03 +00:00
Tugsbayasgalan Manlaibaatar	f542b31c9d	[export] More robust view->view_copy pass (#100908 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100908 Approved by: https://github.com/ydwu4	2023-05-10 14:25:17 +00:00
Kevin Tse	8a193c6dc5	[DataPipe] Add generated docstring to functional form DataPipe (#100503 ) This PR modified the generation process of `datapipe.pyi` to include the doc strings for each DataPipe in functional form. The new generated `.pyi` file will look like [this](https://gist.github.com/NivekT/95095f14da85a837a0727a19a5ba367c). I have confirmed the doc string will be visible in PyCharm. You can copy this [file](https://gist.github.com/NivekT/95095f14da85a837a0727a19a5ba367c) and overwrite your local `datapipe.pyi` to validate this change as well. Note: We need to create a similar change in TorchData to allow DataPipes in that library to show the doc strings as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100503 Approved by: https://github.com/ejguan	2023-05-10 14:06:46 +00:00
Khushi	51fe53e619	[opinfo] item (#100313 ) Follows #100223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100313 Approved by: https://github.com/ezyang	2023-05-10 11:32:45 +00:00
Chien-Chin Huang	55844dfdbc	[FSDP][state_dict] Restore the state_dict_config for NO_SHARD (#100855 ) Any change to the user configurations should be temporary. This PR fixes the issue when NO_SHARD state_dict/load_state_dict is called, the state_dict_config and state_dict_type are changed permanently. Differential Revision: [D45593313](https://our.internmc.facebook.com/intern/diff/D45593313/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D45593313/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/100855 Approved by: https://github.com/awgu, https://github.com/zhaojuanmao, https://github.com/rohan-varma	2023-05-10 10:01:21 +00:00
XDaoHong	a723f1f2b9	fix _privateuse1_tag problem (#100632 ) Fix _privateuse1_tag bug in torch/serialization.py Add device_index after device_type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100632 Approved by: https://github.com/ezyang	2023-05-10 09:53:19 +00:00
Khushi	5a933d044f	[opinfo prims] equal (#100663 ) Follows: #100223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100663 Approved by: https://github.com/ezyang	2023-05-10 08:16:00 +00:00
eqy	33f3dca6b5	[CUDA][CUBLAS] Fix BF16 reduced precision reduction note in docs (#101044 ) #100966 CC @ngimel @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/101044 Approved by: https://github.com/ngimel	2023-05-10 06:50:58 +00:00
eqy	6e2efd16d8	[CUDA][CUBLAS] Add cuBLAS workspace allocation behavior to docs (#100919 ) Adding to the docs for now, hopefully we can move to `cudaMallocAsync`-backed cuBLAS workspaces soon which should alleviate the recent confusion around `cuBLAS` "leaking" memory through workspaces. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100919 Approved by: https://github.com/ngimel	2023-05-10 06:40:26 +00:00
Edward Z. Yang	1e89a56a5b	Apply static policy correctly to unspec (#98983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98983 Approved by: https://github.com/ezyang	2023-05-10 05:59:12 +00:00
Djacon	dfa951171a	Fix typo in RELEASE.md and README.md (#100536 ) Some minor spelling, grammar and typographical mistakes have been fixed in RELEASE.md & README.md files Pull Request resolved: https://github.com/pytorch/pytorch/pull/100536 Approved by: https://github.com/ezyang	2023-05-10 05:06:45 +00:00
XiaobingSuper	2b250e1921	inductor(CPU): fix issue when padding/stride/dilation size is one for cpu weight packing pass (#100951 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100951 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-05-10 04:53:37 +00:00
Richard Barnes	083f88e126	PyTorch -> C++17 (#98209 ) (#100557 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 4f0b524</samp> This pull request updates the codebase and the documentation to use C++17 instead of C++14 as the minimum required C++ standard. This affects the `ATen`, `c10`, and `torch` libraries and their dependencies, as well as the CI system and the `conda` package metadata. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100557 Approved by: https://github.com/malfet	2023-05-10 04:47:35 +00:00
Nikita Shulga	30cecc0e11	[MPS] Fix build regressions introduced by #92868 (#101036 ) https://github.com/pytorch/pytorch/pull/92868 introduced `OBJC` and `OBJCXX` language dialects, but fails to propagate some important flags, like OpenMP include path(if found), `-fno-objc-arc` and `-Wno-unguarded-availability-new` suppression. This PR remedies that and fixes https://github.com/pytorch/pytorch/issues/100925 <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 62677d4</samp> This pull request improves the support for MPSGraph on Apple platforms by fixing some CMake flags for parallelism and memory management. It modifies `cmake/Dependencies.cmake` and `CMakeLists.txt` accordingly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101036 Approved by: https://github.com/atalman, https://github.com/huydhn	2023-05-10 04:15:41 +00:00
hsfzxjy	b004c0b3c6	[inductor] default cache_dir in torch._inductor.codecache should be lazily evaluated (#100824 ) `getpass.getuser` may raise exceptions in some circumstances, where users cannot override the default cache dir with env `TORCHINDUCTOR_CACHE_DIR`. Hence the assemble of default cache dir should be lazily evaluated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100824 Approved by: https://github.com/ezyang	2023-05-10 03:36:39 +00:00
Kunhao ZHENG	b06c180a32	CUBLAS Flag (`CUBLAS_GEMM_DFALT_TENSOR_OP` -> `CUBLAS_GEMM_DEFAULT_TENSOR_OP`) (#100976 ) Looking at the docs https://docs.nvidia.com/cuda/cublas/index.html?highlight=cublasGemmEx#cublasgemmex It seems like the flag should be `CUBLAS_GEMM_DEFAULT_TENSOR_OP` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100976 Approved by: https://github.com/ezyang	2023-05-10 03:26:31 +00:00
PyTorch MergeBot	535368f00e	[vision hash update] update the pinned vision hash (#101032 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101032 Approved by: https://github.com/pytorchbot	2023-05-10 03:24:19 +00:00
cdzhan	649e609667	[c10d] make ProcessGroupNCCL work.wait() respect timeout (#100162 ) Fixes #83486 TestDistBackendWithSpawn.test_monitored_barrier_allreduce_hang and NcclErrorHandlingTest.test_nccl_timeout passed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100162 Approved by: https://github.com/ezyang	2023-05-10 03:07:47 +00:00
Jiong Gong	b33c9c7c9f	[inductor] support vec type conversion between float and bool (#100950 ) Fix https://github.com/pytorch/pytorch/issues/100466 Fix https://github.com/pytorch/pytorch/issues/100800 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100950 Approved by: https://github.com/EikanWang, https://github.com/jansel	2023-05-10 02:16:06 +00:00
Bartosz Szmelczynski	44e73da444	Extend assert statement to include ListVariable (#100841 ) Fixes #100697 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100841 Approved by: https://github.com/lezcano, https://github.com/anijain2305	2023-05-10 01:57:10 +00:00
Peter Bell	27d5019e39	STFT: correct stft definition and better document tensor shapes (#100427 ) Fixes #100177 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100427 Approved by: https://github.com/lezcano	2023-05-10 01:42:01 +00:00
PyTorch MergeBot	2241aaa60c	Revert "[quant][pt2e] Change input act annotation to a map and allow dynamic quantization for non zeroth argument (#101005 )" This reverts commit f08ddae8885bc703408b949642d4a5bee30efce8. Reverted https://github.com/pytorch/pytorch/pull/101005 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/101005#issuecomment-1541143426))	2023-05-10 01:27:47 +00:00
Bin Bao	76cc3ab4f3	[CI] Delete skips from https://github.com/pytorch/pytorch/issues/93847 (#96049 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96049 Approved by: https://github.com/jansel	2023-05-10 01:27:27 +00:00
Jeff Daily	bf214f40d4	explicitly check or discard cudaGetLastError return value (#100488 ) cudaGetLastError and hipGetLastError will clear any error value within CUDA and HIP, respectively. This is often done on purpose to clear benign errors. Discarding the return value should be indicated by casting to void and a nearby comment. This silences warnings from HIP: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result] Performing an audit of pytorch sources found one use of cudaGetLastError that was incorrectly ignored in IndexKernel.cu. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100488 Approved by: https://github.com/ezyang	2023-05-10 01:24:07 +00:00
Jerry Zhang	f08ddae888	[quant][pt2e] Change input act annotation to a map and allow dynamic quantization for non zeroth argument (#101005 ) Summary: Previously the node annotation looks like the following: ``` node.meta["..."] = { "input_act_obs_or_fq_ctr": ..., "weight_obs_or_fq_ctr": ..., "weight_index": 1, } ``` Basically we need specifiy the index for weight and also have a separate key for weight config, in this PR we changed that to: ``` node.meta["..."] = { "input_act_obs_or_fq_ctr_map": {input_node: ..., weight_node: ...}, } ``` This can support specifying the observer/fake quant constructor for any argument of the node Test Plan: buck2 test @//mode/opt //caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)' Reviewed By: kimishpatel Differential Revision: D45553195 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101005 Approved by: https://github.com/kimishpatel	2023-05-10 00:42:25 +00:00
Zain Rizvi	20a231b55b	[BE] Prevent pytest from thinking this class defines any tests (#100949 ) Fixes the error: ``` /var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py:6021: PytestCollectionWarning: cannot collect test class 'TestFailure' because it has a __init__ constructor (from: test/inductor/test_torchinductor.py) class TestFailure: ``` It does so by marking the class as not actually being a test class, despite it's name starting with `Test`. For more details see: https://stackoverflow.com/a/72465142/21539 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100949 Approved by: https://github.com/huydhn	2023-05-10 00:20:11 +00:00
Elias Ellison	6308563a39	Enable reordering pass (#100747 ) Restricts the reordering pass to only reorder nodes before the `copy_` epilogue that functionalization generates. Brings `python benchmarks/dynamo/torchbench.py --performance --backend inductor --amp --inference --only hf_Bert` from 1.46 -> 1.49 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100747 Approved by: https://github.com/jansel	2023-05-09 22:36:57 +00:00
William Wen	7da8705f18	[dynamo 3.11] fix segfault when printing stack trace (#99934 ) Dynamo will frequently segfault when attempting to print stack traces. We fix this by: - Fixing stack size calculations, as we did not account for exception tables - Creating shadow execution frames in a way that more closely resembles what CPython does to create its execution frames Dynamo/inductor-wrapped pytorch tests are enabled up the stack - those need to be green before this PR can be merged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99934 Approved by: https://github.com/albanD, https://github.com/malfet, https://github.com/jansel	2023-05-09 22:12:45 +00:00
PyTorch MergeBot	d57544d39a	Revert "fix specify_constraints's signature when exporting model (#100739 )" This reverts commit b0a372e1faef97adf46bab08510c7e0abdff2611. Reverted https://github.com/pytorch/pytorch/pull/100739 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100739#issuecomment-1540920698))	2023-05-09 21:42:35 +00:00
PyTorch MergeBot	4b8127b90e	Revert "[Dynamo] Fix nested function resume execution (#100426 )" This reverts commit d719f0276d69a8315b65f4c4500cfc1cdaddb025. Reverted https://github.com/pytorch/pytorch/pull/100426 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100426#issuecomment-1540915913))	2023-05-09 21:32:13 +00:00
Hansong Zhang	fa6df34d30	[ET selective build] add kernel metadata section to selective_build.yaml (#100665 ) Summary: For each op, we have a List[List[dtype;dim-order]]: - the inner list contains the `dtype;dim-order` info for each arg if we have a Tensor/TensorList/OptionalTensorList - the outer list contains different occurances of dtype/dim-order combinations for that op in the program Example: ``` et_kernel_metadata: aten::add.out: # A list of different dtype/dim-order combinations used in model - # Each contains the list of args of Tensor dtype and dim order if applicable - FLOAT;0,1 - FLOAT;0,1 - NON_TENSOR_ARG - FLOAT;0,1 - FLOAT;0,1 - - INT;0,1 - INT;0,1 - NON_TENSOR_ARG - INT;0,1 - INT;0,1 aten::mul.out: - - FLOAT;0,1 - FLOAT;0,1 - FLOAT;0,1 - FLOAT;0,1 ``` We don't have the arg name so far; we need to parse the schema (functions.yaml) to get that info. We depend on the order of args from that file. Test Plan: `buck run fbcode//executorch/codegen/tools:test_gen_oplist_real_model` Differential Revision: D45551409 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100665 Approved by: https://github.com/larryliu0820	2023-05-09 21:30:01 +00:00
Huy Do	35834a405c	Run C++ tests on CI with run_test.py (#99956 ) After https://github.com/pytorch/pytorch/pull/99559, we can now run C++ test with `run_test.py`. Although advance features such as `--import-slow-tests` and `--import-disabled-tests` won't work for now, there will still be a gain in reliability and performance as C++ can now be retried and run in parallel. This covers all C++ tests in the CI including aten, libtorch, and Vulkan C++ tests across all platforms Linux, Windows, MacOS. Notes: * To support C++ test discovery, the env variable `CPP_TESTS_DIR` can be set to where the C++ test binaries is located * Support pytest -k argument via run_test as this is used by pytest-cpp to replace `--gtest-filter` * The XML output is in pytest format, but it's ok now because we don't have slow test or flaky test support for C++ test yet * ~~I need to figure out why conftest.py doesn't work when I invoke pytest directly for C++ test, so `--sc` is not available for C++ tests at the moment. Proper pytest plugin like stepwise works fine though. I'll investigate and fix it in a separate PR~~ Found the cause, `conftest.py` is per directory and needs to be in any arbitrary directory that holds C++ test * Two tests `test_api` and `test_tensorexpr` timed out on ASAN, I suspect that ASAN is now used on top of the python executable, which is slower than running native C++ code. IMO, it's ok to run these tests as before on ASAN for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/99956 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi	2023-05-09 21:24:12 +00:00
Aleksandar Samardžić	a8c2cd1039	Add CUTLASS-based MM for structured sparse linear operator (#100485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100485 Approved by: https://github.com/cpuhrsch	2023-05-09 21:05:15 +00:00
Jane Xu	d63e0b1578	[optim] More cleanup and reorg of test_optim.py (#100917 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100917 Approved by: https://github.com/albanD	2023-05-09 21:03:15 +00:00
Jane Xu	d0dab772df	[BE][optim] Remove objects from being globals and comment to clarify (#100899 ) What happened in this PR? 1. Added comments to clarify rosenbrock 2. Moved global objects to be within classes for better readability/grouping 3. Renamed dnn to cnn for consistency This is the very first of the cleanup of test_optim.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/100899 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-05-09 21:03:15 +00:00
Manuel Candales	e353013aa4	[Vulkan] Ensure non-zero divisors in Vulkan API Tests (#100909 ) Summary: This fixes flakiness of div_to_scalar_wrapped See [here](`b89f74aa35`) for flakiness of div_to_scalar_wrapped Test Plan: On Devserver: ``` LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run //xplat/caffe2:pt_vulkan_api_test_bin ``` On Mac: ``` buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 ``` To test that these changes fixed flakiness of div_to_scalar_wrapped, I ran the test 1000 times on devserver before the changes, and observed failures. Then ran it 1000 times after the changes and didn't observe any failures. Reviewed By: SS-JIA Differential Revision: D45670642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100909 Approved by: https://github.com/SS-JIA	2023-05-09 20:55:33 +00:00
Manuel Candales	17fec516fe	[Vulkan] Test conv2d after division (#100910 ) Summary: This tests running a conv2d with clamp after dividing the input tensor by another tensor. Both tensors have number channels = 3 (i.e. not a multiple of 4) and therefore, the channel dimension was padded. Hence, we are testing our divide-by-zero fix (D44392406) Test Plan: ``` buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -- --gtest_filter="VulkanAPITest.conv2d_clamp_after_div" ``` Reviewed By: SS-JIA Differential Revision: D44550026 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100910 Approved by: https://github.com/SS-JIA	2023-05-09 20:30:59 +00:00
Huy Do	9035b6a651	Allow disable binary build jobs on CI (#100754 ) Given the recent outage w.r.t. binary workflows running on CI, I want to close the gap between them and regular CI jobs. The first part is to add the same filter step used by regular CI jobs so that oncalls can disable the job if need. * Nightly runs are excluded as it includes the step to publish nightly binaries. Allowing oncalls to disable this part requires more thoughts. So this covers only CI binary build and test jobs * As binary jobs doesn't have a concept of test matrix config which is a required parameter to the filter script, I use a pseudo input of test config default there ### Testing * https://github.com/pytorch/pytorch/issues/100758. The job is skipped in https://github.com/pytorch/pytorch/actions/runs/4911034089/jobs/8768782689 * https://github.com/pytorch/pytorch/issues/100759. The job is skipped in https://github.com/pytorch/pytorch/actions/runs/4911033966/jobs/8768713669 Note that Windows binary jobs are not run in PR anymore after https://github.com/pytorch/pytorch/pull/100638, and MacOS binary jobs only run nightly. So there are only Linux jobs left. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100754 Approved by: https://github.com/ZainRizvi	2023-05-09 20:01:00 +00:00
Jerry Zhang	c3f3cb5b0f	[quant][pt2e] Support conv bn fusion in convert step for QAT flow (#100442 ) Summary: This PR adds support for folding bn weights into conv for QAT flow, this is equivalent to the QAT branch of `from_float` in eager mode quantized conv module: https://github.com/pytorch/pytorch/blob/main/torch/ao/nn/quantized/modules/conv.py#L223 Items that needs followup: * there are some workaround I did because quantize_per_tensor is using float/int args and dynamo does not support these args, need to fix after we change the quantized model representation and also change these args to Tensor Test Plan: buck2 test @//mode/opt //caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_convert_qat_conv_bn_fusion (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)' Reviewed By: andrewor14 Differential Revision: D45344281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100442 Approved by: https://github.com/kimishpatel	2023-05-09 19:43:51 +00:00
Nikita Shulga	f92b3e1477	[MPS][BE] `std::is_same::value` -> `std::is_same_v` (#100975 ) PyTorch is C++17 project, so let's use some C++17 features. I.e. `s/std::is_same<X, Y>::value/std::is_same_v<X, Y>` And use `if constexpr` in few places when this construct is used. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 7b7683f</samp> > _We're sailing on the sea of code, we're making it more neat_ > _We're using `is_same_v` and `if constexpr` to keep it sweet_ > _We're refactoring the range tensor logic, we're avoiding duplication_ > _We're heaving on the ropes of `Distributions.mm`, on the count of three, with elation_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/100975 Approved by: https://github.com/jeanschmidt, https://github.com/albanD, https://github.com/kulinseth, https://github.com/Skylion007	2023-05-09 19:27:28 +00:00
albanD	858657090b	Make sure we get full file path for filtering in pr-sanity-check (#100978 ) The filename get a `.../` path which makes the next diff not work for files that are too nested. This can seen in https://github.com/pytorch/pytorch/actions/runs/4920598180/jobs/8789553971?pr=100583 for example where most files are ignored. The detection properly works after this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100978 Approved by: https://github.com/seemethere	2023-05-09 18:42:42 +00:00
Luthaf	5970fb402e	C++ CustomClass in Python: indicate which methods are not implemented (#100171 ) Without these changes, it can be hard to know which magic methods are not implemented on a given ScriptObject. before: ```py torch.ops.load_library("somelib.so") c = torch.classes.somelib.SomeClass() print(len(c)) # raise NotImplementedError ``` after: ```py torch.ops.load_library("somelib.so") c = torch.classes.somelib.SomeClass() print(len(c)) # raise NotImplementedError: '__len__' is not implemented for __torch__.torch.classes.somelib.SomeClass ``` ------ I could not find a linked issue, if you want me to open one as well I can do this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100171 Approved by: https://github.com/ezyang	2023-05-09 18:41:40 +00:00
Damian Szwichtenberg	0073d4cd27	Update FBGEMM submodule (#100236 ) To [pytorch/FBGEMM@d0ee798](`d0ee798b1f`) that among other things includes [pytorch/FBGEMM#1672](https://github.com/pytorch/FBGEMM/pull/1672) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100236 Approved by: https://github.com/ezyang	2023-05-09 18:39:28 +00:00
PyTorch MergeBot	d98d95fb9f	Revert "[Dynamo] Remove cross import in dynamo unit tests (#100851 )" This reverts commit c4bbeb5b8a27259fc2b644e3f185b4ba859a2d39. Reverted https://github.com/pytorch/pytorch/pull/100851 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100851#issuecomment-1540646623))	2023-05-09 18:30:01 +00:00
Rodrigo Kumpera	6aa80beca1	[c10d] Implement new Store methods in TCPStore. (#100383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100383 Approved by: https://github.com/fduwjj	2023-05-09 17:43:16 +00:00
Aaron Gokaslan	8769fb854d	[BE] Fix flake8 B027 errors - missing abstractmethod decorator (#100715 ) Enables B027 and applies fixes by adding abstract method decorators. Autofix generated by ruff master. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100715 Approved by: https://github.com/ezyang	2023-05-09 17:28:48 +00:00
Li-Huai (Allan) Lin	bd18225c04	[functorch] Remove internal assert in index_put batch rule (#100516 ) Fixes #94630 Context: https://github.com/pytorch/pytorch/issues/94630#issuecomment-1518946929 I'm not sure how to correctly test it using pure `vmap` without joining `jacrev` here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100516 Approved by: https://github.com/zou3519	2023-05-09 17:23:26 +00:00
PyTorch MergeBot	19be2bb875	Revert "[MPS] Add support for Custom Kernels (#100661 )" This reverts commit f39cda83d133cdd92e06e3d8cdd91340b43eb2c2. Reverted https://github.com/pytorch/pytorch/pull/100661 on behalf of https://github.com/malfet due to Break internal builds, but also guarding dispatch_t define behind __OBJC__ guard is not a good practices ([comment](https://github.com/pytorch/pytorch/pull/100661#issuecomment-1540540002))	2023-05-09 17:02:04 +00:00
Jane Xu	f558af2a55	[adam] Use the right params in weight_decay, rename for clarity, fixes #100707 (#100973 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100973 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-05-09 17:00:27 +00:00
Angela Yi	ba47a2b227	[export] Pickle of ExportGraphModule (#100924 ) try 2 of reland of https://github.com/pytorch/pytorch/pull/100620 bc merge conflict 😭... Pull Request resolved: https://github.com/pytorch/pytorch/pull/100924 Approved by: https://github.com/tugsbayasgalan	2023-05-09 16:58:24 +00:00
PyTorch MergeBot	b71ec6bdf3	Revert "Forward fix lint failure from #100661 (#100907 )" This reverts commit fb69aa15921895a21fe8bd9b1b7807f3d4c1cbe3. Reverted https://github.com/pytorch/pytorch/pull/100907 on behalf of https://github.com/jeanschmidt due to Required in order to revert #100661 ([comment](https://github.com/pytorch/pytorch/pull/100907#issuecomment-1540504748))	2023-05-09 16:55:20 +00:00
Guang Yang	0e08a9b057	Wrap more constraint violation cases to UserError (#100897 ) Cases covered in this PR: - Example inputs conflict with input constraints - Example inputs conflict with inline constraints - Suggest users to use `constrain_as_*()` when trying to export with data-dependent operations Differential Revision: [D45666627](https://www.internalfb.com/diff/D45666627) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100897 Approved by: https://github.com/avikchaudhuri	2023-05-09 16:44:57 +00:00
Natalia Gimelshein	b179d34a19	Handle negative padding in reflect_pad_backward (#100923 ) Fixes #100793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100923 Approved by: https://github.com/jansel	2023-05-09 16:30:48 +00:00
Pearu Peterson	92a7640b76	Add mul tests with sparse sample inputs (#100393 ) This PR implements sparse sample inputs and error inputs for mul OpInfo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100393 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2023-05-09 16:13:14 +00:00
Nikita Vedeneev	0141a242fd	bsr_dense_bmm(): remove sparse_rowspace kernel and some dead code (#100876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100876 Approved by: https://github.com/cpuhrsch, https://github.com/Skylion007	2023-05-09 16:12:11 +00:00
Will Constable	793bd6993a	Work around torchdynamo import error with functional collectives (#100901 ) Summary: Currently there are build configs where the torchdynamo import trips over a strange SystemError related to some module's __dict__.items() returning NULL, while torchdynamo tries to iterate all torch modules and process them for its allowed functions list. While this is hard to repro, we should be able to work around it and then fix it properly. Test Plan: Rely on others to test this, assuming CI passes. Reviewed By: anijain2305 Differential Revision: D45663313 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100901 Approved by: https://github.com/yanboliang, https://github.com/malfet	2023-05-09 16:09:42 +00:00
mikey dagitses	93aac15d82	make torch/csrc/jit/backends/coreml/objc/PTMCoreMLFeatureProvider.mm data_ptr-correct (#100886 ) make torch/csrc/jit/backends/coreml/objc/PTMCoreMLFeatureProvider.mm data_ptr-correct Summary: https://developer.apple.com/documentation/coreml/mlmultiarray shows that this is looking for a mutable input and is permitted to mutate the data in subsequent operations. Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100886 Approved by: https://github.com/Skylion007	2023-05-09 15:35:48 +00:00
Edward Z. Yang	c5d7226ab9	Upgrade torchbench pin (#100937 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100937 Approved by: https://github.com/albanD	2023-05-09 15:34:27 +00:00
mikey dagitses	1ea224c2a4	make torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp data_ptr-correct (#100888 ) make torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100888 Approved by: https://github.com/ezyang	2023-05-09 15:29:08 +00:00
mikey dagitses	bc3108c2e2	make torch/csrc/jit/runtime/register_prim_ops.cpp data_ptr-correct (#100832 ) make torch/csrc/jit/runtime/register_prim_ops.cpp data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100832 Approved by: https://github.com/ezyang	2023-05-09 15:07:54 +00:00
PyTorch MergeBot	de02c8bed4	Revert "Rename DispatchKey.PrivateUse1 to custom device in torchgen. (#99406 )" This reverts commit c0ecd9895831f9329bd189de9b1e28ad68c93b5b. Reverted https://github.com/pytorch/pytorch/pull/99406 on behalf of https://github.com/ezyang due to we're doing it another way ([comment](https://github.com/pytorch/pytorch/pull/99406#issuecomment-1540295309))	2023-05-09 15:04:16 +00:00
PyTorch MergeBot	01476465dd	Revert "add a cast function that suppresses -Wcast-function-type-strict (#100170 )" This reverts commit 642f4ed606b3d66bab21d44019ae5762637eeca9. Reverted https://github.com/pytorch/pytorch/pull/100170 on behalf of https://github.com/atalman due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100170#issuecomment-1540140636))	2023-05-09 13:56:48 +00:00
Daniel Falbel	d371a890a2	Fix ordered dict loading with LibTorch (#100743 ) Fixes #100741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100743 Approved by: https://github.com/Skylion007	2023-05-09 13:52:45 +00:00
mikey dagitses	a3f656cc6c	use const_data_ptr as source of std::copy (#100885 ) use const_data_ptr as source of std::copy Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100885 Approved by: https://github.com/Skylion007	2023-05-09 13:47:34 +00:00
yanbing-j	36d91b5513	Add differentiable mkldnn_rnn_layer_backward to support double backward of LSTM (#100627 ) ### Description This PR is to fix #99413, which shows the limitation of double backward using oneDNN in LSTM. This PR does not implement double backward function itself, because that is pretty hard to spell out. Instead, it implements mkldnn_rnn_layer_backward using differentiable operations, so that double backward can be done automatically. During backward process, it needs to use gates and hidden states between cells during one layer. However, these middle variables are stored in the `workspace`, and it is hard to figure them out. Therefore, in backward, we need re-calculate them first. Corresponding UT has been added based on the failing case in # 99413. The UT with gradcheck and gradgradcheck which is added in https://github.com/pytorch/pytorch/pull/26660 cannot test LSTM using oneDNN, because UT only supports `double` datatype, while oneDNN does not support it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100627 Approved by: https://github.com/jgong5, https://github.com/soulitzer	2023-05-09 12:58:57 +00:00
kshitij12345	d261e43c37	[fix] cat_slice_cat : slice with negative size (#100828 ) Fixes https://github.com/pytorch/pytorch/issues/100807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100828 Approved by: https://github.com/ngimel	2023-05-09 11:53:13 +00:00
vfdev-5	622e582a2b	Register get_cpu_capability for jit (#100723 ) Description: Context: In torchvision we ensure that functional ops are torchscriptable. Recently exposed `torch.backends.cpu.get_cpu_capability()` in https://github.com/pytorch/pytorch/pull/100164 is failing in torchvision CI ``` RuntimeError: Python builtin <built-in function _get_cpu_capability> is currently not supported in Torchscript: File "/usr/local/lib/python3.10/dist-packages/torch/backends/cpu/__init__.py", line 17 - "AVX512" """ return torch._C._get_cpu_capability() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE ``` Ref: https://github.com/pytorch/vision/pull/7557 In this PR, `torch._C._get_cpu_capability()` is explicitly registered for JIT and tested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100723 Approved by: https://github.com/albanD	2023-05-09 09:52:29 +00:00
Nikita Vedeneev	c4bc259f00	bsr_dense_mm(): better test coverage (#100543 ) This PR improves test coverage for `bsr_dense_mm` by: - ~~enabling correctness tests for `float32`~~. - extending and testing input correctness checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100543 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2023-05-09 09:26:02 +00:00
PyTorch MergeBot	43127f19f1	Revert "Allow disable binary build jobs on CI (#100754 )" This reverts commit 4c3b52a5a99e8aed2576773829e6a353a65aa2b4. Reverted https://github.com/pytorch/pytorch/pull/100754 on behalf of https://github.com/huydhn due to The subset of Windows binary jobs running only in trunk fails because the runners do not have Python setup ([comment](https://github.com/pytorch/pytorch/pull/100754#issuecomment-1539586399))	2023-05-09 07:15:32 +00:00
Huy Do	4c3b52a5a9	Allow disable binary build jobs on CI (#100754 ) Given the recent outage w.r.t. binary workflows running on CI, I want to close the gap between them and regular CI jobs. The first part is to add the same filter step used by regular CI jobs so that oncalls can disable the job if need. * Nightly runs are excluded as it includes the step to publish nightly binaries. Allowing oncalls to disable this part requires more thoughts. So this covers only CI binary build and test jobs * As binary jobs doesn't have a concept of test matrix config which is a required parameter to the filter script, I use a pseudo input of test config default there ### Testing * https://github.com/pytorch/pytorch/issues/100758. The job is skipped in https://github.com/pytorch/pytorch/actions/runs/4911034089/jobs/8768782689 * https://github.com/pytorch/pytorch/issues/100759. The job is skipped in https://github.com/pytorch/pytorch/actions/runs/4911033966/jobs/8768713669 Note that Windows binary jobs are not run in PR anymore after https://github.com/pytorch/pytorch/pull/100638, and MacOS binary jobs only run nightly. So there are only Linux jobs left. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100754 Approved by: https://github.com/ZainRizvi	2023-05-09 06:53:34 +00:00
Rohan Varma	e72385af20	[Reducer] Move require_finalize_ (#100782 ) This doesn't need to be set in the loop, just once. Differential Revision: [D45632426](https://our.internmc.facebook.com/intern/diff/D45632426/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100782 Approved by: https://github.com/Skylion007, https://github.com/fegin	2023-05-09 05:30:55 +00:00
Rohan Varma	d90f71ea0b	[PG NCCL] Provide work obj in postProcess (#100781 ) This will be needed for allreduce_sparse so we can set the outputs on the work object. Differential Revision: [D45632428](https://our.internmc.facebook.com/intern/diff/D45632428/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100781 Approved by: https://github.com/fegin	2023-05-09 05:30:55 +00:00
Rohan Varma	a0752b68e7	[BE] Remove empty pre and post proc functions (#100780 ) These functions are noops, so just use the collective() override. Differential Revision: [D45632429](https://our.internmc.facebook.com/intern/diff/D45632429/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100780 Approved by: https://github.com/Skylion007, https://github.com/fegin	2023-05-09 05:30:52 +00:00
chunyuan	7012600abe	fix cpu autocast check in rnn (#100621 ) https://github.com/pytorch/pytorch/pull/100100 added Typechecking while `torch.is_autocast_enabled()` always return `False` on cpu. This PR fixes the autocast check for cpu. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100621 Approved by: https://github.com/albanD	2023-05-09 04:34:18 +00:00
ydwu4	26cd958718	Support runtime assertion for inline constraints (#100763 ) This pr does the following: 1. previously, inline constraints is not properly set for tensor output data-dependent ops such as a.nonzero because of its return value is not symint. This pr just uses all the unbacked symbols i.e.those start with "i"/"f" in create_unbacked_sym* functions. Note that these symbols are guaranteed to be a super set of inline user constraints. 2. add inline assertions support by checking. Currently, it only deal with tensor, SymInt, SymFloat, SymBool output data-dependent ops and ignore the rest. It's good enough for now as we only have a limited number of data-dependent ops (.item and .nonzero are explicitly tested). The examples for graph that is added assertions is shown below: ``` class ExportGraphModule(torch.nn.Module): def forward(self, x): arg0: i64[s0], = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec) nonzero_default: i64[i0, 1] = torch.ops.aten.nonzero.default(arg0); arg0 = None return pytree.tree_unflatten([nonzero_default], self._out_spec) class GraphModule(torch.nn.Module): def forward(self, x): arg0: i64[s0], = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec) sym_size: Sym(s0) = torch.ops.aten.sym_size(arg0, 0) nonzero_default: i64[i1, 1] = torch.ops.aten.nonzero.default(arg0); arg0 = None sym_size_1: Sym(i1) = torch.ops.aten.sym_size(nonzero_default, 0) ge: Sym(i1 >= 3) = sym_size_1 >= 3 scalar_tensor_default: f32[] = torch.ops.aten.scalar_tensor.default(ge); ge = None _assert_async_msg = torch.ops.aten._assert_async.msg(scalar_tensor_default, 'nonzero_default.shape[0] is outside of inline constraint [3, 5].'); scalar_tensor_default = None le: Sym(i1 <= 5) = sym_size_1 <= 5; sym_size_1 = None scalar_tensor_default_1: f32[] = torch.ops.aten.scalar_tensor.default(le); le = None _assert_async_msg_1 = torch.ops.aten._assert_async.msg(scalar_tensor_default_1, 'nonzero_default.shape[0] is outside of inline constraint [3, 5].'); scalar_tensor_default_1 = None return pytree.tree_unflatten([nonzero_default], self._out_spec) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100763 Approved by: https://github.com/tugsbayasgalan	2023-05-09 04:19:57 +00:00
Nikita Shulga	75e4214f92	Fix `recursive_store` for smaller elementSize (#100902 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 8cbc54f</samp> Add support for symbolic integers of different sizes in `tensor_new.cpp`. Use a switch statement to cast them to the appropriate fixed-width integer type. Fixes crash reported in https://github.com/pytorch/pytorch/issues/100455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100902 Approved by: https://github.com/ngimel	2023-05-09 04:10:29 +00:00
Ramin Azarmehr	cecfcf1e17	[MPS] Handle MPS failures of test_modules.py in common_modules.py (#95334 ) - Also cleaned up `test_modules.py` from skipMPS code. - Added `skipMPS` for unsupported or failing tests on MPS backend in common_modules.py. (We'll remove `skipMPS` from those tests once a fix is available for them.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95334 Approved by: https://github.com/kulinseth, https://github.com/albanD	2023-05-09 03:55:16 +00:00
PyTorch MergeBot	97bb4c2538	[vision hash update] update the pinned vision hash (#100926 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100926 Approved by: https://github.com/pytorchbot	2023-05-09 02:51:29 +00:00
Edward Z. Yang	9eab13fc90	Reenable llama benchmark (#100877 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100877 Approved by: https://github.com/albanD	2023-05-09 01:12:54 +00:00
Han Zhu	5ef50ef2d8	[caffe2] Remove inline keyword of function CUDACachingAllocator::format_size (#100734 ) Summary: `CUDACachingAllocator::format_size` is used not only in CUDACachingAllocator.cpp but also in CUDAMallocAsyncAllocator.cpp. This caused a breakage when the compiler inlined the function and the linker couldn't find it when resolving symbols for CUDAMallocAsyncAllocator.cpp. Differential Revision: D45612790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100734 Approved by: https://github.com/interwq, https://github.com/kit1980	2023-05-09 01:03:39 +00:00
Huy Do	4447dfa673	Remove MacOS workflow step to disable XProtect (#100692 ) I added this few weeks back trying to fix the flaky dependencies missing on MacOS in https://github.com/pytorch/pytorch/pull/99506. But I think this step is not really needed. More importantly, it starts to hang on MacOS 13, for example https://github.com/pytorch/pytorch/actions/runs/4889081518/jobs/8727397905. The reason is unclear, but this should be removed nonetheless. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100692 Approved by: https://github.com/ZainRizvi	2023-05-09 00:34:24 +00:00
Nikita Shulga	660a0d8622	[Functorch] Skip docs setup if called in optimize mode (#100750 ) Test plan: `python3 -OO -c "import torch._functorch.deprecated"` Fixes https://github.com/pytorch/pytorch/issues/100680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100750 Approved by: https://github.com/albanD	2023-05-08 23:36:57 +00:00
Elias Ellison	16a4075327	Throw if 'dropout' argument name but func does not have nondeterministic_seeded (#100771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100771 Approved by: https://github.com/ezyang	2023-05-08 23:34:28 +00:00
Jane Xu	9a811d1df2	[BE] Update notes linkage in common_device_type, fix very minor grammar (#100898 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100898 Approved by: https://github.com/jbschlosser, https://github.com/soulitzer	2023-05-08 22:21:57 +00:00
Janet Yang	812cadf90a	[3/n] loading meta to device (#100495 ) Summary: Make it possible to `torch.jit.load(model, device)` to a device when `model` contains weights that are on device `meta`. Just leave the `meta` weights on `meta`, and load the weights that can be loaded to the target device. Reviewed By: singlaiiit, RoshanPAN, sayitmemory Differential Revision: D45099145 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100495 Approved by: https://github.com/houseroad	2023-05-08 22:14:38 +00:00
Shawn Xu	bde7b81f34	[S337714] Back out "[PyTorch] Don't do extra numel() check in TensorImpl::data() (#98090 )" (#100893 ) Summary: Original commit changeset: 9875964c3b32 Original Phabricator Diff: D44586464 Reviewed By: drdarshan Differential Revision: D45664329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100893 Approved by: https://github.com/xush6528	2023-05-08 21:56:44 +00:00
Angela Yi	2d2f716ddc	[export] Fix cond for pass_base (#100836 ) I ported over the code for the inline interpreter incorrectly in the pass base 😅 Originally the function `make_inline_interpreter` is supposed to take in a fx.Interpreter type but I accidentally passed in an fx.Interpreter object. Also realized while modifying this diff (and comments from Tugsuu) that we don't really need this InlineInterpreter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100836 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2023-05-08 21:51:03 +00:00
ydwu4	b0a372e1fa	fix specify_constraints's signature when exporting model (#100739 ) Currently, when f is a Module, the signature should be the "forward" methods signature. For example, ```python class Module(torch.nn.Module): def forward(self, x): return x.sin() mod = Module() x = torch.ones([3, 3]) torch._dynamo.export(mod, x, constraints=[dynamic_dim(x, 0)]) ``` Previously, it prints following: ```python def specify_constraints(args, *kwargs): return [ 2 <= dynamic_dim(x, 0), 2 <= dynamic_dim(x, 1), ] ``` After the pr, it prints: ```python def specify_constraints(x): return [ 2 <= dynamic_dim(x, 0), 2 <= dynamic_dim(x, 1), ] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100739 Approved by: https://github.com/avikchaudhuri	2023-05-08 21:26:30 +00:00
Huy Do	fb69aa1592	Forward fix lint failure from #100661 (#100907 ) https://github.com/pytorch/pytorch/pull/100661 breaks lint check, but it's just some empty spaces, so ... forward fixing Pull Request resolved: https://github.com/pytorch/pytorch/pull/100907 Approved by: https://github.com/ZainRizvi	2023-05-08 21:19:00 +00:00
Zain Rizvi	95f191a248	Always run prioritized tests first, even if they're expected to run serially (#100748 ) Today, we prioritize running test files that were edited in the user's PR, with the idea being to run them before we run any other test. Except, if the modified test is supposed to run serially, then we still end up running it after all the parallelized tests have finished running. This PR fixes that to _always_ run the prioritized tests before the regular tests, regardless of if the test is supposed to run serially or in parallel Pull Request resolved: https://github.com/pytorch/pytorch/pull/100748 Approved by: https://github.com/huydhn	2023-05-08 20:23:46 +00:00
Yanbo Liang	c4bbeb5b8a	[Dynamo] Remove cross import in dynamo unit tests (#100851 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100851 Approved by: https://github.com/jansel	2023-05-08 20:16:57 +00:00
Jason Ansel	5079bf3df6	[inductor] Add variable names to MemoryDep (#100308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100308 Approved by: https://github.com/eellison	2023-05-08 20:08:58 +00:00
Huy Do	651b5b0f5f	Fix nightly build of C++ docs (#100845 ) The fix is to upgrade breathe version (and sphinx accordingly), for example https://github.com/pytorch/pytorch/actions/runs/4898593997/jobs/8749163278. ``` Exception occurred: File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/breathe/renderer/sphinxrenderer.py", line 104, in DomainDirectiveFactory 'function': (python.PyModulelevel, 'function'), AttributeError: module 'sphinx.domains.python' has no attribute 'PyModulelevel' ``` This was missed in https://github.com/pytorch/pytorch/pull/100601 because `RUN_DOXYGEN` is only set to true in the nightly job. Specifically, the 2 plugins `breathe` and `exhale` are only used when `RUN_DOXYGEN` is set to true https://github.com/pytorch/pytorch/blob/main/docs/cpp/source/conf.py#L37-L42 ### Testing https://github.com/pytorch/pytorch/actions/runs/4910813882/jobs/8771541636 passes with RUN_DOXYGEN set to true and C++ docs looks ok https://docs-preview.pytorch.org/100845/cppdocs/index.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/100845 Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/malfet	2023-05-08 20:06:51 +00:00
Ramin Azarmehr	f39cda83d1	[MPS] Add support for Custom Kernels (#100661 ) - This change introduces these APIs to enable developing custom kernels on the MPS Stream: `torch::mps::get_command_buffer()` `torch::mps::get_dispatch_queue()` `torch::mps::commit()` - Add ObjC test case Pull Request resolved: https://github.com/pytorch/pytorch/pull/100661 Approved by: https://github.com/kulinseth, https://github.com/malfet	2023-05-08 20:05:46 +00:00
Huy Do	d9d98b4d54	Skip DNS host fix on ROCm runners (#100861 ) The quick fix in https://github.com/pytorch/pytorch/pull/100507 is only needed on Linux runners, not ROCm as the latter doesn't have the corresponding step in [setup-linux](https://github.com/pytorch/pytorch/pull/100436). ROCm runners use `setup-rocm` action instead, and it doesn't seem to have any issue with DNS, so there is no need to add the quick fix to ROCm. This is currently breaking ROCm binary test job in trunk, for example https://github.com/pytorch/pytorch/actions/runs/4905878029/jobs/8761271482. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100861 Approved by: https://github.com/ZainRizvi, https://github.com/jithunnair-amd	2023-05-08 19:16:41 +00:00
Philip Meier	aaa1323c97	remove double doc upload after CloudFront fix (#99032 ) Grace period is over. See pytorch/test-infra#3894 for details. This needs to be merged simultaneously with the CloudFront update to avoid disruption. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99032 Approved by: https://github.com/kit1980, https://github.com/atalman	2023-05-08 19:09:48 +00:00
Sergei Vorobev	116e04be29	Initialize view_replay_enabled_ in the AutogradState ctor (#100822 ) Cruise uses [clang static analyzer](https://clang-analyzer.llvm.org/) internally. In the v2.0.0 release of PyTorch it found this problem ``` In file included from external/pytorch/aten/src/ATen/ATen.h:7: In file included from external/pytorch/aten/src/ATen/Context.h:3: In file included from external/pytorch/aten/src/ATen/CPUGeneratorImpl.h:3: In file included from external/pytorch/aten/src/ATen/core/Generator.h:22: In file included from external/pytorch/c10/core/GeneratorImpl.h:8: In file included from external/pytorch/c10/core/TensorImpl.h:6: external/pytorch/c10/core/InferenceMode.h:58:5: warning: Passed-by-value struct argument contains uninitialized data (e.g., field: 'view_replay_enabled_') AutogradState::set_tls_state(AutogradState( ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 warning generated. ``` In other words, the value of `view_replay_enabled_` could be initialized which may lead to subtle bugs later on. This PR addresses the warning by explicitly initializing it to `false`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100822 Approved by: https://github.com/Skylion007	2023-05-08 18:57:14 +00:00
lessw2020	ec144b9412	handle new param from torch.compile (Inductor pattern matcher), enable_log (#100814 ) This PR puts a placeholder param handler for a new param being passed in from Inductor, enable log. Fixes this error below, where I've been unable to run torch.compile on NanoGPT due to this error: ~~~ File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/_inductor/fx_passes/fuse_attention.py", line 219, in _sfdp_init register_replacement( File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/_inductor/pattern_matcher.py", line 658, in register_replacement search_gm = trace_fn(search_fn, example_inputs) File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, *kwargs) File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/_inductor/pattern_matcher.py", line 828, in training_graph aot_function( torch._dynamo.exc.BackendCompilerFailed: backend='compile_fn' raised: TypeError: patched_aot_function() got an unexpected keyword argument 'enable_log' ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/100814 Approved by: https://github.com/fegin	2023-05-08 18:34:45 +00:00
Ivan Zaitsev	ccd060abd8	[stronghold][bc-linter] Switch to reusable action, enable for everyone (#100737 ) * Switches BC-linter to reusable action (see https://github.com/pytorch/test-infra/pull/4109) * Removes dogfooding check / enables it for everybody * Adds the link to the docs/user guide: https://github.com/pytorch/test-infra/wiki/BC-Linter in case of failure To be merged on Monday, May 8 (BC linter launch date). Pull Request resolved: https://github.com/pytorch/pytorch/pull/100737 Approved by: https://github.com/osalpekar	2023-05-08 18:28:29 +00:00
Jason Ansel	176ef97fc1	[inductor] Fix bug where a node gets erased twice (#100848 ) Fixes #100806 The underlying bug is if you erase an FX node twice, everything runs without error, but `len(graph.nodes)` reports the incorrect value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100848 Approved by: https://github.com/ngimel, https://github.com/Skylion007	2023-05-08 18:24:19 +00:00
Xing Liu	0731420645	[PyTorch/Distributed]Only sync buffers when broadcast_buffers is True (#100729 ) Summary: Disable buffers sync in _sync_module_states(...) when broadcast_buffers is False. This change will memory usage when a model has huge buffers and does not need broadcast buffers. Test Plan: . Differential Revision: D45610709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100729 Approved by: https://github.com/mrshenli	2023-05-08 16:34:29 +00:00
Natalia Gimelshein	bfe5f5bbe1	[WIP] enable cuda graphs support for flash attention with dropout (#100196 ) Fixes #99905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100196 Approved by: https://github.com/drisspg	2023-05-08 16:19:18 +00:00
TachikakaMin	bb28f3f519	`USE_PRECOMPILED_HEADERS` is not supported on Apple M1 (#92868 ) Fixes #80018 ```bash MACOSX_DEPLOYMENT_TARGET=12.6 CC=gcc CXX=g++ DEBUG=1 USE_DISTRIBUTED=0 USE_MKLDNN=0 USE_CUDA=0 BUILD_TEST=0 USE_FBGEMM=0 USE_NNPACK=0 USE_QNNPACK=0 USE_XNNPACK=0 USE_PRECOMPILED_HEADERS=1 USE_MPS=1 python setup.py develop ``` `error: Objective-C was disabled in PCH file but is currently enabled` This PR(https://github.com/pytorch/pytorch/pull/80432) has been reverted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92868 Approved by: https://github.com/kulinseth, https://github.com/malfet	2023-05-08 16:03:34 +00:00
Bin Bao	86ddfc7f68	[inductor] Move cpp wrapper trigger logic to inner_compile (#100611 ) Summary: This enables cpp wrapper for backward as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100611 Approved by: https://github.com/jansel	2023-05-08 15:24:02 +00:00
PyTorch MergeBot	62c53aabdb	Revert "[xla hash update] update the pinned xla hash (#100369 )" This reverts commit 41bafb0b7bd75cb5f3f86015d871e731e0ae8488. Reverted https://github.com/pytorch/pytorch/pull/100369 on behalf of https://github.com/ezyang due to bot ignored signal? ([comment](https://github.com/pytorch/pytorch/pull/100369#issuecomment-1538550434))	2023-05-08 15:21:16 +00:00
Nikita Karetnikov	6eb0d7541d	[pt2] add `SymInt` support for `linalg_qr_backward` (#100833 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100833 Approved by: https://github.com/ezyang	2023-05-08 13:48:25 +00:00
Nikita Karetnikov	1e591a8b64	[pt2] add meta function for `solve_triangular` (#100829 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100829 Approved by: https://github.com/ezyang	2023-05-08 13:48:15 +00:00
Nikita Vedeneev	cd8b82e5c6	bsr_dense_mm(): code refactoring (#100634 ) Code unification/refactoring for better re-use. Intended for easier `sampled_addmm` implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100634 Approved by: https://github.com/cpuhrsch	2023-05-08 13:27:39 +00:00
PyTorch MergeBot	41bafb0b7b	[xla hash update] update the pinned xla hash (#100369 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100369 Approved by: https://github.com/pytorchbot	2023-05-08 10:28:08 +00:00
cyy	333de1fdb0	Fix some NVCC warnings (#100823 ) PR #95568 enables more NVCC warnings. However, some cu files need to be modified to make building process more warning free. Therefore, this work aims to fix them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100823 Approved by: https://github.com/ngimel, https://github.com/Skylion007	2023-05-08 10:23:01 +00:00
XiaobingSuper	46affcb004	inductor(CPU): skip weight packing when autocast is enabled (#100844 ) Currently, the packed op doesn't support autocast and the packing path happened before AOTAutograd, which changes the default autocast behavior. Now, we disable the packing path, and the bfloat16 packing path can work after we move this path after AOTAutograd(I will do it after https://github.com/pytorch/pytorch/pull/100652 is done). Pull Request resolved: https://github.com/pytorch/pytorch/pull/100844 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-05-08 07:15:09 +00:00
XiaobingSuper	92cecb8e3c	inductor(CPU): don't do binary fusion if binary's inputs are same tensor (#100843 ) This PR will fix https://github.com/pytorch/pytorch/issues/100802. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100843 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-05-08 07:12:32 +00:00
blzheng	970c60b336	inductor: disable lowmem_dropout on CPU (#100702 ) In https://github.com/pytorch/pytorch/pull/97002, we fall back bernoulli and disabled lowmem_dropout on CPU, which brings significant performance improvements for both bernoulli and dropout. PR https://github.com/pytorch/pytorch/pull/97931 disabled lowmem_dropout by default, thus removed the code that disabled lowmem_dropout on CPU, but unfortunately, it introduced performance regression on CUDA (https://github.com/pytorch/pytorch/issues/98614). Then https://github.com/pytorch/pytorch/pull/98631 reenabled lowmem_dropout by default. As a result, the performance of dropout on CPU has decreased since https://github.com/pytorch/pytorch/pull/98631. This pr re-added the code to disable lowmem_dropout on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100702 Approved by: https://github.com/jgong5, https://github.com/ngimel	2023-05-08 04:51:16 +00:00
Peter Bell	7d0e4e2aa8	Fix AT_USE_JITERATOR checks (#100464 ) `AT_USE_JITERATOR` evaluates to false when the definition isn't included, so these files were not using jiterator at all. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100464 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-05-08 01:55:41 +00:00
Li-Huai (Allan) Lin	3b6a7f4d51	[MPS] Fix index_put with deterministic algorithm enabled (#97660 ) Prevent using parallel computing when deterministic algorithm is set. Fixes #97574 Benchmark: ``` [--------------- index_put_ Deterministic Algorithm Enabled ---------------] \| cpu \| mps 1 threads: ----------------------------------------------------------------- Dtype: torch.float32 Features: 1024; Num Indices: 512 \| 37 \| 49 Dtype: torch.float32 Features: 1024; Num Indices: 1024 \| 54 \| 50 Dtype: torch.float32 Features: 1024; Num Indices: 2048 \| 86 \| 50 Dtype: torch.float32 Features: 1024; Num Indices: 4096 \| 150 \| 49 Times are in microseconds (us). [-------------- index_put_ Deterministic Algorithm Disabled ---------------] \| cpu \| mps 1 threads: ----------------------------------------------------------------- DType: torch.float32 Features: 1024; Num Indices: 512 \| 37 \| 49 DType: torch.float32 Features: 1024; Num Indices: 1024 \| 53 \| 49 DType: torch.float32 Features: 1024; Num Indices: 2048 \| 86 \| 49 DType: torch.float32 Features: 1024; Num Indices: 4096 \| 147 \| 50 Times are in microseconds (us). ``` <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at ebf2ff3</samp> Added a deterministic version of `index_put` for MPS tensors that runs on a single thread and can be enabled by a global context flag. Refactored the existing `index_put` function and the kernel selection logic to support both parallel and serial modes. Added a test function to verify the deterministic behavior of `index_put` under different conditions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97660 Approved by: https://github.com/kulinseth	2023-05-08 00:57:29 +00:00
andrewor14	4154c8ea15	[quant][pt2] Add Conv + BN + ReLU fusion for prepare QAT (#100283 ) Summary: This follows https://github.com/pytorch/pytorch/pull/98568, which lays all the groundwork for Conv + BN fusion in prepare QAT. Conv + BN + ReLU fusion can reuse the same match and replace patterns and is handled similarly. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_relu_fusion python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_relu_numerics Reviewers: kimishpatel, jerryzh168 Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D45515494](https://our.internmc.facebook.com/intern/diff/D45515494) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100283 Approved by: https://github.com/jerryzh168	2023-05-07 20:35:16 +00:00
Natalia Gimelshein	05e355022f	[inductor] track provenance of masks from indices (#100816 ) Fixes #100530 When indices for indirect read are computed rather than read from another tensor, they should be masked according to the index used in computation. Currently though we don't associate masks with index variables, so the computed indices don't have associated masks also. This PR associates masks with index variables when they are created. On this PR, both the device assert and masked load are generated, and hopefully device assert should be removed later once your value analysis PR lands. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100816 Approved by: https://github.com/Chillee, https://github.com/lezcano	2023-05-07 18:51:43 +00:00
fduwjj	953aa6d90e	[TP] Enable more generic attn in Tensor Parallelism (#100508 ) To make TP more generic for Attention module, we come up with this new col/rowwise parallel style. Basically, the idea behind is that: We only do DTensor op for Col/Rowwise sharded part. For the rest of ATen ops, we will leave it to Tensor ops. And we set this behavior as default for Colwise and Rowwise parallel style. If people want to customize it, they can always pass in different prepare_input or prepare_output Pull Request resolved: https://github.com/pytorch/pytorch/pull/100508 Approved by: https://github.com/wanchaol	2023-05-07 18:15:49 +00:00
Bin Bao	03433080e6	[inductor] Support FallbackKernel in cpp wrapper codegen (#100553 ) Summary: This works well for ops without kwargs. For ops with kwargs, we need to register ordered_kwargs_for_cpp_kernel for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100553 Approved by: https://github.com/jansel	2023-05-07 14:33:53 +00:00
cyy	5293dee920	fix missing-prototypes warnings in torch_cpu (Part 3) (#100245 ) This PR fixes more missing-prototypes violations in the torch_cpu source following PRs #100053 and #100147 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100245 Approved by: https://github.com/Skylion007	2023-05-07 07:54:44 +00:00
kshitij12345	358fe95088	[fix] check for histogramdd when bins is int[] (#100624 ) Fixes https://github.com/pytorch/pytorch/issues/93274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100624 Approved by: https://github.com/lezcano	2023-05-07 05:54:54 +00:00
Avik Chaudhuri	ca9f55f79d	misc. fixes to constraints warnings and errors (#100745 ) 1. Move constraint violation error after constraint discovery warning, and attach them when we have both. 2. Remove verbose internal traceback for relevant guard in constraint violation error. 3. Remove mention of `assume_static_by_default` in specialization warning. 4. Fix indenting of `specializations` body and make it assert individually instead of returning a conjunction. 5. Remove return annotation on signature used in generated `specializations` and `specify_constraints` functions. 6. Split `&` ranges because we don't support them yet. Differential Revision: [D45619852](https://our.internmc.facebook.com/intern/diff/D45619852/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100745 Approved by: https://github.com/tugsbayasgalan	2023-05-06 18:22:31 +00:00
Sun, Jiayi	0bf9722a3a	modify ipex backend (#99499 ) The ipex backend of torch.compile calls ipex.compile instead of torch.jit.trace and torch.jit.freeze. ipex.compile will handle the compilation process for IPEX internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99499 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-05-06 16:51:13 +00:00
Thiago Crepaldi	82091d666c	[ONNX] Refactor Input/Output Adapter (#100490 ) This PR refactors how InputAdapter and OutputAdapter is used throughout the exporter. During refactoring, API issues with passes (torch.onnx._internal.fx._pass.Transform) were identified and should be tackled on another API. In short, some passes can modify the input/output of the model and the input/output adapter must be in sync with such change, otherwise, the adapters will not reflect the actual model input/output. The first instance of this issue was with `ReplaceGetAttrWithPlaceholder` pass that adds new inputs to the model. In order to work this around, a new input adapt step to append new inputs (generated by the pass) was introduced. That resulted in the number of inputs of the ONNX model to mismatch the numer of inputs of the pytorch model, though. Follow up on https://github.com/pytorch/pytorch/pull/98421 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100490 Approved by: https://github.com/BowenBao	2023-05-06 16:01:49 +00:00
Nikita Shulga	aa081d8f27	[CI] Update torchtext commit (#100767 ) Also, install torchdata as torchtext depends on it <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at d3ec9e4</samp> > _`pytorch/data` joins_ > _CI scripts get updated_ > _Winter of changes_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/100767 Approved by: https://github.com/ngimel, https://github.com/r-barnes	2023-05-06 15:24:48 +00:00
Rodrigo Kumpera	00d4890218	[c10d] Apply EFA workaround to Store tests. (#100382 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100382 Approved by: https://github.com/fduwjj	2023-05-06 15:16:13 +00:00
Nikita Karetnikov	266c84e3ab	[pt2] add meta function for `linalg_qr` (#100714 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100714 Approved by: https://github.com/ezyang, https://github.com/lezcano	2023-05-06 15:04:02 +00:00
mikey dagitses	8d56b0a5b5	remove unused tuple_cat utility (#100731 ) remove unused tuple_cat utility Test Plan: Verified unused with `git grep`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100731 Approved by: https://github.com/ezyang	2023-05-06 12:48:44 +00:00
XiaobingSuper	44caa395cb	inductor: fix mm_plus_mm fusion pattern issue when has broadcast add (#100679 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100679 Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/jgong5	2023-05-06 07:13:53 +00:00
Yanbo Liang	d719f0276d	[Dynamo] Fix nested function resume execution (#100426 ) Fixes #99665 Let me explain the root cause using the unit test I added: * This bug is triggered when: * ```wrapped``` is a nested function. * ```wrapped``` is in another module which is different from the main function ```fn```. * There is a graph break inside of ```wrapped```. * The root cause is when resuming nested function, actually we are using the outermost function(```fn``` in my example)'s global variables, but ```wrapped``` calls ```inner_func``` which is not part of ```fn```'s globals, so we have to set correct globals when nested function resume execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100426 Approved by: https://github.com/jansel	2023-05-06 05:04:50 +00:00
Michael Lazos	f73973d789	Expose function to retrieve list of registered loggers (#100776 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100776 Approved by: https://github.com/ezyang	2023-05-06 04:22:28 +00:00
Li-Huai (Allan) Lin	6af509860e	Add logcumsumexp forward-ad (#100629 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 8bb6158</samp> This pull request adds forward and backward AD support for the `logcumsumexp` operator in functorch, a library for composable function transformations. It implements a forward-mode formula and a decomposition in `derivatives.yaml`, a C++ function for computing directional derivatives in `FunctionsManual.cpp`, and updates the tests and metadata in `test_ops.py` and `common_methods_invocations.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100629 Approved by: https://github.com/soulitzer	2023-05-06 04:08:55 +00:00
Jason Ansel	71c4becda7	[inductor] Track operator counts (#100329 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100329 Approved by: https://github.com/ngimel	2023-05-06 03:00:56 +00:00
Rodrigo Kumpera	8360b6c2a8	[c10d] Expose new Store methods. (#100381 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100381 Approved by: https://github.com/fduwjj	2023-05-06 02:50:51 +00:00
Scott Ramsby	19d8d31c94	[fbcode/caffe2] Make fmt formatter methods const (#100616 ) Summary: Staging an update to the latest fmt version triggered lots of build errors due to non-`const` methods on custom formatters. This fixes the `format()` methods to be `const` as they don't mutate any state anyway, as well as `parse()` methods that don't need to mutate internal state. This mitigates many future build errors. Updates were identified and executed by using regular expression search/replacements such as: `(constexpr auto parse$ParseContext& [^)]$) \{` -> `$1 const {` `(constexpr auto parse$ParseContext& [^)]$) ->` -> `$1 const ->` `(auto format$., FormatContext& [^)]$) \{` -> `$1 const {` `(auto format$., FormatContext& [^)]$) ->` -> `$1 const ->` Any changes to third-party code was then reverted. Some small changes detected from subsequent build errors were then applied. Test Plan: CI Differential Revision: D45463620 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100616 Approved by: https://github.com/davidberard98	2023-05-06 01:38:25 +00:00
Bert Maher	a5cb888013	[inductor] Do not try to shape-pad symbolic-sized tensors (#100738 ) Summary: We use benchmarking to decide whether to pad tensors for mm alignment, but if the sizes are symbolic, we can't really do that. Test Plan: ``` pytest test_torchinductor_dynamic_shapes.py -k padding ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100738 Approved by: https://github.com/jiawenliu64, https://github.com/ngimel	2023-05-06 01:35:10 +00:00
David Berard	e55f02f4d0	lint test/inductor/test_cuda_repro.py (#100777 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100777 Approved by: https://github.com/yanboliang, https://github.com/bertmaher	2023-05-06 01:31:58 +00:00
Michael Lazos	850556ed6e	Add "all" option to logging (#100664 ) Adds the long-promised "all" option to logging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100664 Approved by: https://github.com/lezcano	2023-05-06 01:11:18 +00:00
blzheng	f1b2e00700	graph break when calling resize_as_() on graph input (#100148 ) Fix #94831 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100148 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-05-06 01:03:48 +00:00
Bert Maher	ce50674f85	[inductor] TARGETS for all inductor tests (#100744 ) We had many test scripts for inductor that aren't covered by TARGETS files. This diff adds all the ones that work. Differential Revision: [D45606775](https://our.internmc.facebook.com/intern/diff/D45606775/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D45606775/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/100744 Approved by: https://github.com/Chillee	2023-05-06 00:55:15 +00:00
Peter Pham	3362c1d240	[ONNX] add cast operator after reduce to match desired dtype (#100700 ) This PR conditionally inserts a cast operator after a reduction operation to match the specified dtype in the exported ONNX model. The code changes affect opset9, and opset13. I understand there's an [automatic upcast to int64](`c91a41fd68/torch/onnx/symbolic_opset9.py (L783)`) before reduction most likely to prevent overflow so I left that alone and only conditionally add casting back to desired dtype. ## Test int32 ``` import torch import onnx a = torch.tensor([10, 20, 30, 80], dtype=torch.int32) def test(): class SumInt32(torch.nn.Module): def forward(self, a): return torch.sum(a, dtype=torch.int32) sumi = SumInt32().eval() assert sumi(a).dtype == torch.int32 print("Torch model output type matches input type") torch.onnx.export(sumi, (a), "/tmp/sumi_int32.onnx", opset_version=12) model = onnx.load("/tmp/sumi_int32.onnx") assert model.graph.output[0].type.tensor_type.elem_type == onnx.TensorProto.INT32 print("ONNX model output type matches input type") test() ``` ![sumi_int32 onnx](https://user-images.githubusercontent.com/10516699/236499220-59b64821-5807-4f69-b0e2-90ae34280e03.png) ## Test int64 ``` import onnx import torch a = torch.tensor([10, 20, 30, 80], dtype=torch.int64) def test(): class SumInt64(torch.nn.Module): def forward(self, a): return torch.sum(a, dtype=torch.int64) sumi = SumInt64().eval() assert sumi(a).dtype == torch.int64 print("Torch model output type matches input type") torch.onnx.export(sumi, (a), "/tmp/sumi_int64.onnx", opset_version=12) model = onnx.load("/tmp/sumi_int64.onnx") assert model.graph.output[0].type.tensor_type.elem_type == onnx.TensorProto.INT64 print("ONNX model output type matches input type") test() ``` ![sum_int64 onnx](https://user-images.githubusercontent.com/10516699/236422133-15f9cda3-242f-46da-9b23-c2e920f27078.png) Fixes https://github.com/pytorch/pytorch/issues/100097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100700 Approved by: https://github.com/thiagocrepaldi	2023-05-06 00:05:57 +00:00
PyTorch MergeBot	fee6d46940	Revert "Bump up flatbuffer submodule version to the latest release (v23.3.3) (#100716 )" This reverts commit 8d31b81edce016652f9c4e8df4bdaf45db0758df. Reverted https://github.com/pytorch/pytorch/pull/100716 on behalf of https://github.com/malfet due to This will break internal builds, please wait for co-dev land ([comment](https://github.com/pytorch/pytorch/pull/100716#issuecomment-1536909954))	2023-05-05 23:45:11 +00:00
Michael Gschwind	e5b065525b	Add unit test for nested_tensor input to nn.TransformerEncoder (#100650 ) Summary: Add unit test for nested_tensor input & fix Test Plan: sandcastle Differential Revision: D45580393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100650 Approved by: https://github.com/jbschlosser	2023-05-05 23:34:14 +00:00
seanlatias	aad017183d	Introduce aggressive merge to `CapabilityPartitioner` (#100195 ) With the old partitioner, suppose `add` is supported, the following code ```python def fn(a, b, c, d): x = a + b # add y = c + d # add_1 return (x, y) traced = symbolic_trace(fn) partitioner = CapabilityBasedPartitioner(traced, supported_ops, allows_single_node_partition=True) partitions = partitioner.propose_partitions() ``` results in the partitions `[[add], [add_1]]`. However, since these two partitions do not depend on each other, they can be aggressively merged into a single partition `[[add, add_1]]` without causing any issues. This PR introduces a new feature that allows such aggressive merging by introducing an option `aggressive_merge` to the Partitioner class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100195 Approved by: https://github.com/SherlockNoMad	2023-05-05 23:20:17 +00:00
Natalia Gimelshein	9790f9174a	skip lcnet (#100726 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/100726 Approved by: https://github.com/voznesenskym	2023-05-05 23:19:42 +00:00
albanD	db5430fd25	fix bash math for pr-sanity-check? (#100735 ) The current version fails with `.github/scripts/pr-sanity-check.sh: line 44: "0" + "5": syntax error: operand expected (error token is ""0" + "5"")`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100735 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2023-05-05 23:14:39 +00:00
Jason Ansel	e3d783c013	[inductor] Cleanup strip_last_size logic (#100305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100305 Approved by: https://github.com/ngimel	2023-05-05 23:10:47 +00:00
Joel Schlosser	bd9d50a3fc	Remove future deprecation warning from kl_div docs (#96541 ) Fixes #95687 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96541 Approved by: https://github.com/albanD	2023-05-05 23:01:21 +00:00
Kulin Seth	e20c94bda9	[MPS] Add the test for 5D in test_mps which is skipped. (#99271 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99271 Approved by: https://github.com/DenisVieriu97	2023-05-05 22:57:06 +00:00
Catherine Lee	a1f318daba	Fix get_reordered_tests in run_test.py (#100752 ) i think get_reordered_tests broken since master -> main switch add typing for some functions checked for `prioritized` in the logs limited testing because I only care about one very small part of the log thats near the beginning Pull Request resolved: https://github.com/pytorch/pytorch/pull/100752 Approved by: https://github.com/huydhn	2023-05-05 22:46:56 +00:00
BowenBao	c9593bc0e1	[ONNX] Refactor diagnose_call message_formatter signature (#100299 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100299 Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi	2023-05-05 22:31:12 +00:00
Animesh Jain	3f025c607c	summarize graph breaks (#100696 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100696 Approved by: https://github.com/yanboliang	2023-05-05 22:27:47 +00:00
mikey dagitses	f76d0e1b82	remove unused extract_arg_by_filtered_index utility (#100649 ) remove unused extract_arg_by_filtered_index utility Test Plan: Verified unused with `git grep`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100649 Approved by: https://github.com/ezyang	2023-05-05 22:25:43 +00:00
Danni Li	4a90deb137	[Doc] Add GRU new gate calculation difference (#100646 ) Summary: Add a note for the calculation difference of GRU new gate `n_t` between PyTorch and original paper. Fix: #99531 Test Plan: Please see GitHub pipelines. Differential Revision: D45579790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100646 Approved by: https://github.com/mikaylagawarecki	2023-05-05 22:18:54 +00:00
Nikita Shulga	e0a3d014e9	[CI] Do not auto-label nightly builds PR (#100740 ) As I'm tired of removing `ciflow/trunk`/`ciflow/inductor` on those, which would fail anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/100740 Approved by: https://github.com/huydhn	2023-05-05 22:09:13 +00:00
Xiang Gao	8d31b81edc	Bump up flatbuffer submodule version to the latest release (v23.3.3) (#100716 ) The current flatbuffer version uses `--std=c++0x` which is too old. On my system, one of flatbuffer's dependency has stopped supporting C++0x, causing a build issue on my system. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100716 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-05-05 21:58:36 +00:00
PyTorch MergeBot	5c14eea1de	Revert "extend serialization for tensor metadata (#99808 )" This reverts commit 73dd6f04c97f647470dbc55e03f666fa88f634c3. Reverted https://github.com/pytorch/pytorch/pull/99808 on behalf of https://github.com/atalman due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/99808#issuecomment-1536823538))	2023-05-05 21:55:52 +00:00
Rodrigo Kumpera	7961812c4d	Rename ForceInPlace to InPlaceHint. (#99764 ) The name makes more sense since it's a hint to scheduler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99764 Approved by: https://github.com/wanchaol	2023-05-05 21:29:05 +00:00
Catherine Lee	036d2f6593	Add unstable-periodic to upload test stats (#100751 ) add unstable-periodic workflow to upload test status Pull Request resolved: https://github.com/pytorch/pytorch/pull/100751 Approved by: https://github.com/huydhn	2023-05-05 21:13:40 +00:00
Avik Chaudhuri	d41134e2f2	dynamic equality constraint (#99993 ) This diff adds support for dynamic equality constraints of the form `dynamic_dim(x, 0) == dynamic_dim(y, 1)`. The process of constraint discovery can already understand equality guards between dimensions and suggests such equality constraints, so this closes the loop on that. Correspondingly we now raise `ConstraintViolation` when we find that such a guard is added on a dynamic dimension and the user did not specify such a constraint. (NOTE: This is distinct from a dynamic dimension being guarded equal to a constant, which is already an error.) Differential Revision: [D45279437](https://our.internmc.facebook.com/intern/diff/D45279437/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99993 Approved by: https://github.com/tugsbayasgalan	2023-05-05 21:09:18 +00:00
Pruthvi Madugundu	2f5e9b60f9	[ROCm] Limiting the NUM_PROCS to 8 while UT testing (#100133 ) - Few AMD machines have >8 GPUs so limiting the NUM_PARALLEL_PROCS to 8, so number of test shards are also max 8 - Parallelizing for >8 is limited by memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100133 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/malfet	2023-05-05 20:05:59 +00:00
Will Constable	67e3dd86b5	Update Multipy CI pin (#100640 ) picks up a change that adds testing for multipy importing functional collectives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100640 Approved by: https://github.com/PaliC, https://github.com/malfet	2023-05-05 20:01:45 +00:00
Xiongfei Wei	59cb02db54	Symintify TensorFactories.empty_like (#100668 ) Fixes [#ISSUE_NUMBER](https://github.com/pytorch/xla/pull/4876) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100668 Approved by: https://github.com/ezyang	2023-05-05 19:38:31 +00:00
David Berard	447a20fdb1	[profiler] provide torch.profiler._utils._init_for_cuda_graphs() as a workaround (#100441 ) There are known issues with profiling cuda graphs - particularly, if you create a cuda graph before the first use of the profiler, and then run that cuda graph during profiling. One workaround is to add `with profile(): pass` before creating the cuda graph that you want to profile later. For convenience, we provide this function to use the workaround. This also adads a test for this workaround, to ensure that it continues working. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100441 Approved by: https://github.com/Chillee, https://github.com/aaronenyeshi	2023-05-05 19:25:37 +00:00
Jason Ansel	41dc25d5fc	[inductor] Pattern to replace cumsum with arange (#100673 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100673 Approved by: https://github.com/eellison	2023-05-05 19:08:34 +00:00
PyTorch MergeBot	f42eae4755	Revert "[export] Pickle of ExportGraphModule (#100620 )" This reverts commit d4975a5fe0b263087c8f060409a9331a1dbdde76. Reverted https://github.com/pytorch/pytorch/pull/100620 on behalf of https://github.com/clee2000 due to broke export/test_serialize.py::TestSerialize::test_pickle_dynamic across various jobs `d4975a5fe0`, i think you hit another landrace? ([comment](https://github.com/pytorch/pytorch/pull/100620#issuecomment-1536643519))	2023-05-05 18:52:48 +00:00
Radek Bartoň	bf2258f582	Fix frequent "warning C4141: 'dllimport': used more than once" (#100708 ) This was introduced recently for MSVC builds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100708 Approved by: https://github.com/ezyang	2023-05-05 18:45:58 +00:00
Angela Yi	d4975a5fe0	[export] Pickle of ExportGraphModule (#100620 ) reland of https://github.com/pytorch/pytorch/pull/100423 bc merge conflict... Pull Request resolved: https://github.com/pytorch/pytorch/pull/100620 Approved by: https://github.com/mergennachin	2023-05-05 18:21:39 +00:00
Natalia Gimelshein	4ca26d183a	[CI] update hf version for ci (#100666 ) per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/100666 Approved by: https://github.com/malfet	2023-05-05 18:12:53 +00:00
Jack Taylor	b89b5716a9	ROCm fixes for PyT2.0 (#100089 ) This PR brings some updates and fixes in regards to PyT2.0 functionality 1 - ROCm's version of triton does not yet support tl.reduce Until supported we are opting to revert the removal of the aten.prod make_fallback for ROCm brought in with `7a6c650b81` This issue was found locally with the latest aten.prod UTs on ROCm ``` FAILED [0.0916s] inductor/test_torchinductor.py::CudaTests::test_prod_cuda - torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised: AttributeError: module 'triton.language' has no attribute 'reduce' ``` 2 - Adds aten.miopen_batch_norm as an explicit fallback as perf issues are observed when registered as a decomposition, setting warning=False as the fallback is expected 3 - Fixes a typo and redundant assignment in _inductor/triton_heuristics.py brought in with `dd778a7610` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100089 Approved by: https://github.com/kit1980, https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet, https://github.com/jansel	2023-05-05 18:07:57 +00:00
Jason Ansel	3f725db4a6	[inductor] Run dead_node_elimination to a fixed point (#100672 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100672 Approved by: https://github.com/lezcano	2023-05-05 17:55:30 +00:00
PyTorch MergeBot	d66add688f	Revert "Add logcumsumexp forward-ad (#100629 )" This reverts commit d658c62677b7c096b0fda3ce7a4f0accc727430e. Reverted https://github.com/pytorch/pytorch/pull/100629 on behalf of https://github.com/clee2000 due to broke slow test, see above comment for details ([comment](https://github.com/pytorch/pytorch/pull/100629#issuecomment-1536575442))	2023-05-05 17:42:35 +00:00
Ilya Sherstyuk	40df6e1647	[ONNX] Simplify repeat_intereleave export for scalar-valued 'repeat' (#100575 ) This PR simplifies the ONNX export of torch.repeat_interleave when 'repeat' is a scalar value (so each index in the input is repeated the same number of times). (Issue #100438) Here is a before/after of a simple model export: ```python # Model + export code import torch class RepeatInterleaveModel(torch.nn.Module): def forward(self, x): return x.repeat_interleave(2, dim=-1) args = (torch.rand((2, 2, 16)),) model = RepeatInterleaveModel() torch.onnx.export(model, args, "repeat_interleave.onnx", opset_version=17) ``` Before (static shapes) ![repeat_interleave onnx(1)](https://user-images.githubusercontent.com/46343317/236014996-00726832-1e76-4fb4-950d-4b54cc5cc20c.png) ----- Before (dynamic shapes, second graph is Loop body) <p float="left"> <img src="https://user-images.githubusercontent.com/46343317/236029895-20b0ae0a-240f-466d-bb01-e619ec5967ad.png" width="45%" /> <img src="https://user-images.githubusercontent.com/46343317/236029915-e67b808a-029b-4997-bc05-1ce59eec409a.png" width="47%" /> </p> ----- After (for both static and dynamic shapes) <img src="https://user-images.githubusercontent.com/46343317/236015235-633811cb-09a2-435d-a293-1b2bcb7dea50.png" width="66%" /> ----- This PR also fixes a bug where the exporter throws an expection when the input has dynamic shapes and the 'dim' parameter is not specified to torch.repeat_interleave. Also adds a new testcase to cover this. (Issue #100429) Fixes #100438 and #100429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100575 Approved by: https://github.com/BowenBao	2023-05-05 17:00:42 +00:00
PyTorch MergeBot	3f2336d3fe	Revert "[EZ] move test decorator up in the class def (#100719 )" This reverts commit daf5100656c65cb6f1777f7e4173fd494624b565. Reverted https://github.com/pytorch/pytorch/pull/100719 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it breaks lint in trunk ([comment](https://github.com/pytorch/pytorch/pull/100719#issuecomment-1536514589))	2023-05-05 16:47:27 +00:00
mikey dagitses	6b20ac3bc4	make torch/csrc/jit/passes/onnx/unpack_quantized_weights.cpp data_ptr-correct (#100681 ) make torch/csrc/jit/passes/onnx/unpack_quantized_weights.cpp data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100681 Approved by: https://github.com/ezyang	2023-05-05 16:30:55 +00:00
mikey dagitses	1fe91f5922	make torch/csrc/distributed/c10d/quantization/quantization.cpp data_ptr-correct (#100688 ) make torch/csrc/distributed/c10d/quantization/quantization.cpp data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100688 Approved by: https://github.com/ezyang	2023-05-05 16:27:14 +00:00
mikey dagitses	c676aa8bee	make torch/csrc/distributed/c10d/ProcessGroupGloo.cpp data_ptr-correct (#100689 ) make torch/csrc/distributed/c10d/ProcessGroupGloo.cpp data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100689 Approved by: https://github.com/ezyang	2023-05-05 16:26:43 +00:00
mikey dagitses	543e9c6517	use const_ and mutable_ data_ptr for much of torch/csrc/jit/runtime/static/ops.cpp (#100678 ) use const_ and mutable_ data_ptr for much of torch/csrc/jit/runtime/static/ops.cpp Summary: We can't address the TEWrapper cases yet because it erases all arguments to mutable void*. Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100678 Approved by: https://github.com/ezyang	2023-05-05 16:20:24 +00:00
Edward Z. Yang	4101de342b	Type torch._inductor.codegen.wrapper (#100657 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100657 Approved by: https://github.com/voznesenskym	2023-05-05 16:19:23 +00:00
mikey dagitses	642f4ed606	add a cast function that suppresses -Wcast-function-type-strict (#100170 ) add a cast function that suppresses -Wcast-function-type-strict Summary: These casts are a necessary evil due to the design of Python. Python ultimately casts it back to the original type based on the flags specified in the PyMethodDef. Nevertheless, the new Clang flag -Wcast-function-type-strict breaks with this. Test Plan: Passes builds with Clang 16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100170 Approved by: https://github.com/ezyang	2023-05-05 16:16:06 +00:00
mikey dagitses	e53b288679	remove unused filter_map utility (#100647 ) remove unused filter_map utility Test Plan: Verified unused with `git grep`. --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/100647). * #100649 * __->__ #100647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100647 Approved by: https://github.com/ezyang	2023-05-05 16:15:56 +00:00
Tugsbayasgalan Manlaibaatar	bf08b072a7	Add functionalization pass in TorchDynamo (#99461 ) Fixes: https://github.com/pytorch/pytorch/issues/99000 Differential Revision: [D45106409](https://our.internmc.facebook.com/intern/diff/D45106409) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99461 Approved by: https://github.com/bdhirsh, https://github.com/anijain2305, https://github.com/zou3519	2023-05-05 16:08:14 +00:00
Guang Yang	31fdd19b5b	Add support for list copy in dynamo export (#100669 ) Summary: Issue: `torch._dynamo.exc.Unsupported: call_method ListVariable() copy [] {}` Fix: Add `copy()` to "method_call" to _dynamo/variables/lists.py Take it over from #98184. To unblock a meta internal model onboarding to ExecuTorch. Differential Revision: D45592416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100669 Approved by: https://github.com/jansel	2023-05-05 16:04:19 +00:00
mikey dagitses	0e017af35b	make torch/csrc/jit/python/pybind_utils.cpp data_ptr-correct (#100682 ) make torch/csrc/jit/python/pybind_utils.cpp data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100682 Approved by: https://github.com/Skylion007	2023-05-05 15:53:06 +00:00
Ramil Nugmanov	a2e81a8004	[DataLoader] `__getitems__` added to description of Dataset API and better supported within `Subset` (#100375 ) DataLoader supports batched loading from Mapped Datasets. This is the fetcher's implementation of auto-detection of batch loading support. torch.utils.data._utils.fetch._MapDatasetFetcher ``` class _MapDatasetFetcher(_BaseDatasetFetcher): def fetch(self, possibly_batched_index): if self.auto_collation: if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__: data = self.dataset.__getitems__(possibly_batched_index) else: data = [self.dataset[idx] for idx in possibly_batched_index] ``` Description of Dataset API now shows this feature. Additionally, Subset dataset now supports `__getitems__` if parent dataset supports it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100375 Approved by: https://github.com/ejguan, https://github.com/NivekT	2023-05-05 15:52:28 +00:00
Jack Taylor	6064c4c64c	Disable core dumping on ROCm UT workflows (#100532 ) Recently an issue was observed on PyTorch CI in which the ROCm nodes were running out of space due to out of control core dumping. https://github.com/pytorch/pytorch/issues/99578 To mitigate this issue we have proposed to disable core dumping on the ROCm workers with the `--ulimit core=0` flag in the docker run command in `_rocm-test.yml` https://stackoverflow.com/questions/58704192/how-to-disable-core-file-dumps-in-docker-container/59611557#59611557 Before this change ``` ulimit -a core file size (blocks, -c) unlimited ``` After this change ``` ulimit -a core file size (blocks, -c) 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100532 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/malfet	2023-05-05 15:48:31 +00:00
Tugsbayasgalan Manlaibaatar	daf5100656	[EZ] move test decorator up in the class def (#100719 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100719 Approved by: https://github.com/angelayi	2023-05-05 15:35:56 +00:00
Rodrigo Kumpera	7a15e82388	Fix tensor registration to work with coalescing collectives. (#99763 ) We do it by making it possible to register multiple tensors for the same worker and coordinate waiting/cleanup among them. This ensures waiting on any number the output tensors will result in a single stream sync. This simplifies codegen by inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99763 Approved by: https://github.com/wanchaol	2023-05-05 14:25:35 +00:00
mikey dagitses	54f27c7d5c	make torch/csrc/distributed/c10d/quantization/quantization_gpu.cu data_ptr-correct (#100685 ) make torch/csrc/distributed/c10d/quantization/quantization_gpu.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100685 Approved by: https://github.com/Skylion007	2023-05-05 14:14:36 +00:00
Devashish Shankar	57e19ad86d	Add pattern to merge consecutive splits (#100107 ) Summary: Fx pass for "split->split => split" pattern {F959486105} Test Plan: * CI tests * Run in a notebook on CMF MIMO {F959938752} Differential Revision: D45204109 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100107 Approved by: https://github.com/jansel	2023-05-05 09:52:35 +00:00
leslie-fang-intel	c91a41fd68	[Inductor][Quant]Enable the decomposed dequant maxpooling2d loop fusion (#99132 ) Summary Lowering of [`max_pool2d` ](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/lowering.py#L2732) will check the `num_reads` of input `StorageBox.data`. When num of reads is larger than 1, input of `StorageBox` will invoke `realize` and break the loop fusion with previous node. The previous node could be `decomposed.dequant_per_tensor.tensor` in quantization use case. For `decomposed.dequant_per_tensor.tensor`, it has 3 num of reads. But 2 of these 3 num of reads are scalar tensors as `zero point` and `scale`. In this PR, we try to relax the criterion for `StorageBox.realize`. Specifically, when the input is an instance of `Pointwise`, we will also check the number of non scalar tensor's read, and only invoke `StorageBox.realize` when the number of non scalar tensor's read is also larger than 1. It helps enable the loop fusion and vec code gen of pattern `decomposed.dequant_per_tensor.tensor - max_pool2d`. Test Plan ``` cd test/inductor && python -m pytest test_cpu_repro.py -k test_dequant_maxpool2d_lowering ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99132 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-05-05 08:20:16 +00:00
XiaobingSuper	675029aabf	inductor: add params check before doing sfdp fusion (#100619 ) We need to add params check before doing sfdp fusion as https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/attention.cpp#L544. This PR will fix #100315 and #100318. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100619 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-05-05 06:47:21 +00:00
Nikita Karetnikov	37f1be041a	[pt2] enable `svd` in `fake_tensor` (#100130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100130 Approved by: https://github.com/ezyang, https://github.com/lezcano	2023-05-05 06:27:59 +00:00
Huy Do	bb6b24c622	[BE] Dockerize PyTorch docs jobs (#100601 ) Saw some connection error to pip in docs jobs today, so let's dockerize it: * https://github.com/pytorch/pytorch/actions/runs/4877612277/jobs/8702572072 * https://github.com/pytorch/pytorch/actions/runs/4877612277/jobs/8702572072 Some additional fixes: * Moving the docs script from under `.circleci` to under `.ci` as they should be * Linter (as scripts under .ci are subjected to shellcheck) * Fix some minor Sphinx warnings in functorch docs ### Testing Docs previews look fine: * https://docs-preview.pytorch.org/100601/index.html * https://docs-preview.pytorch.org/100601/cppdocs/index.html * https://docs-preview.pytorch.org/100601/functorchdocs/index.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/100601 Approved by: https://github.com/clee2000	2023-05-05 06:24:46 +00:00
XiaobingSuper	05adf4d49d	inductor(cpu): skip ConvTranspose2d packing if has output_size input (#100612 ) Fix https://github.com/pytorch/pytorch/issues/100344. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100612 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-05-05 06:19:20 +00:00
Nikita Shulga	63f2f9fb0d	[BE] Remove unused CircleCI checks (#100630 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100630 Approved by: https://github.com/atalman	2023-05-05 05:45:48 +00:00
Edward Z. Yang	ee4cb4b1e7	Add --offload-to-disk support to minifier (#100546 ) When minifying extremely large repros, the minifier can run out of memory. This is because, for delta debugging, the minifier keeps a copy of every intermediate output in the network. This can easily put you over the memory limit for your GPU. To make matters worse, we cannot easily delta debug in such a situation, as delta debugging involves replacing intermediates with inputs, but doing so can cause an intermediate to become live longer than its actual extent in the original model (since inputs all have to be allocated up front). The strategy in this PR is to use `load_tensor` from the previous PR to offer a low memory mode for delta debugging. Instead of putting intermediates as inputs, we instead load them in the middle of the graph in question. If, through DCE, the load_tensor ends up floating to the top of the graph, we can input-ify it. We now no longer save all intermediates in memory, but instead save them to disk. I used this to successfully minify the repro that helped us solve https://github.com/pytorch/pytorch/pull/100332 The testing is not very good. I can try to add more robust testing but it will involve a more involved refactor to FX minifier. Let me know if that's what you want. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100546 Approved by: https://github.com/anijain2305, https://github.com/voznesenskym	2023-05-05 05:25:03 +00:00
Edward Z. Yang	ce1ad1c143	Add load_storage (#100519 ) This adds a new operator debugprims::load_storage which does the unusual thing of loading a tensor from disk (via ContentStoreReader). This will be used in a later PR to implement delta debugging in the minifier, even when the repro is too big to fit into memory. The way it works is that you specify a name of the tensor you want to load, as well as enough metadata to reconstruct the tensor, if the store isn't available. If there is an active content store, we read and return the tensor from that store; otherwise we use `rand_strided` to create it. I needed some infra improvements to do this: * `custom_op` now supports factory functions. Factory functions have to be registered specially via `impl_factory` * I modified `clone_input` to also support dtype conversion, which I use to change the dtype of a loaded tensor if necessary. * ContentStore needs to work with a device argument, so we torch.load directly to the correct device. This is for fake tensor support. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100519 Approved by: https://github.com/zou3519, https://github.com/anijain2305	2023-05-05 05:25:03 +00:00
andrewor14	d4dad36cf1	[quant][pt2] Improve prepare_qat Conv + BN numerics test (#100271 ) Summary: This commit makes two improvements to the existing test for Conv + BN fusion in `prepare_qat_pt2e`: (1) Test `per_tensor_symmetric` in addition to `per_channel_symmetric` (2) Initialize BN stats the same way in both flows. This is necessary to get the `per_tensor_symmetric` case to pass. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_numerics Reviewers: jerryzh168, kimishpatel Differential Revision: [D45512851](https://our.internmc.facebook.com/intern/diff/D45512851) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100271 Approved by: https://github.com/jerryzh168	2023-05-05 04:46:13 +00:00
Mark Saroufim	bf52d570d9	torch.save/load torch.compiled models (#97565 ) Opening this so I can discuss with @albanD I built a proof of concept of an in place API for an nn.Module that allows us to save and load a torch.compiled model with no issues https://github.com/msaroufim/mlsys-experiments/blob/main/save-compiled-model.py So users can run` model.compile()` and then run `torch.save(model, "model.pt")` and `torch.load(model, "model.pt)` with no issues unlike the rather strange current suggestion we give to users which is `opt_mod = torch.compile(mod); torch.save(mod, "model.pt")` Right now I'm trying to extend this to work for nn.modules more generally TODO: Failing tests * [x] torch.jit.load -> issue was because of aliasing `__call__` to `_call_impl`, _call_impl used to be skipped when now it lo longer is so expanded the skip check. I added an explicit `torch.jit.load()` test now which @davidberard98 suggested * [x] functorch seems to be a flake - ran locally and it worked `pytest functorch/test_eager_transforms.py` * [x] a test infra flake - `test_testing.py::TestImports::test_no_mutate_global_logging_on_import_path_functorch` * [x] It seems like I broke inlining in dynamo though `python -m pytest test/dynamo/test_dynamic_shapes.py -k test_issue175` chatting with Voz about it but still not entirely sure how to fix - found a workaround after chatting with @yanboliang * [x] `pytest test/dynamo/test_modules.py` and `test/dynamo/test_dynamic_shapes` `test/dynamo/test_misc.py` seem to be failing in CI but trying it out locally they all pass tests passed with 0 failures * [x] `pytest test/profiler/test_profiler_tree.py ` these tests have ProfilerTrees explicitly printed and will now break if __call__ is not in tree - ran with `EXPECT_ACCEPT=1` * [x] `pytest test/test_torch.py::TestTorch::test_typed_storage_deprecation_warning` a flake, ran this locally and it works fine * [x] I reverted my changes to `_dynamo/nn_module.py` since it looks like @wconstab is now directly handling `_call_impl` there but this is triggering an infinite inlining which is crashing * [x] Tried out to instead override `__call__`, python doesnt like this though https://github.com/pytorch/pytorch/pull/97565#issuecomment-1524570439 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97565 Approved by: https://github.com/aaronenyeshi, https://github.com/albanD, https://github.com/voznesenskym	2023-05-05 03:57:49 +00:00
PyTorch MergeBot	2f9538006e	[vision hash update] update the pinned vision hash (#100671 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100671 Approved by: https://github.com/pytorchbot	2023-05-05 02:35:49 +00:00
Li-Huai (Allan) Lin	d658c62677	Add logcumsumexp forward-ad (#100629 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100629 Approved by: https://github.com/soulitzer	2023-05-05 02:21:27 +00:00
Chase	2f41bc5465	[DataLoader] Add context to NotImplementedErrors in dataset.py (#100667 ) Add helpful context message to `NotImplementedError`'s thrown by Dataset and IterableDataset, reminding users that they must implement `__getitem__`/`__iter__` in subclasses. Currently, users are presented with a bare `NotImplementedError` without describing the remedy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100667 Approved by: https://github.com/NivekT	2023-05-05 02:16:42 +00:00
mikey dagitses	a3989b2802	remove unused concat_iseq (#100648 ) remove unused concat_iseq Test Plan: Verified with `git grep`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100648 Approved by: https://github.com/Skylion007	2023-05-05 02:02:57 +00:00
Yanan Cao (PyTorch)	35a6b04419	Set assume_static_by_default to True in Dynamo config (#100458 ) We expect fine grained dynamic shape enabled at all times, which means that a dimension is assumed to be static unless user explicitly says otherwise. Differential Revision: D45473365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100458 Approved by: https://github.com/avikchaudhuri	2023-05-05 00:50:41 +00:00
Natalia Gimelshein	73eab18ac8	set lowmem_dropout and fallback_random configs for all tests in test_… (#100506 ) …fused_attention This allows all the tests in test_fused_attention to succeed when run together, otherwise replacements are registered without proper config set, and thus some tests fail and succeed only on rerun. This is also confusing when running full file locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100506 Approved by: https://github.com/drisspg	2023-05-05 00:37:35 +00:00
Andrew Gu	b6d318291b	[FSDP] Do not `sys.exit(0)` explicitly at end of unit test (#100645 ) We are going to see if this closes https://github.com/pytorch/pytorch/issues/100641. The guess is that this might allow NCCL to be destroyed before Python finalizes, avoiding any issues with calling `pybind11::gil_scoped_release` like in [`destroy_nccl_comm`](`8994d9e610/torch/csrc/cuda/python_nccl.cpp (L46)`). Test plan: ``` CUDA_VISIBLE_DEVICES=0,7 numactl -C 2 python test/distributed/fsdp/test_fsdp_unshard_params.py -v -k test_with_grads_core --repeat 200 2>&1 \| tee out ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100645 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2023-05-05 00:09:17 +00:00
PyTorch MergeBot	6d2f8114be	Revert "[BE] Dockerize PyTorch docs jobs (#100601 )" This reverts commit 2703684acf5643ca69ecfac4bdb861abe2a8aa41. Reverted https://github.com/pytorch/pytorch/pull/100601 on behalf of https://github.com/huydhn due to Curiously, this breaks inductor jobs ([comment](https://github.com/pytorch/pytorch/pull/100601#issuecomment-1535515587))	2023-05-04 23:13:15 +00:00
mikey dagitses	da0993280d	use const_data_ptr in torch/csrc/lazy/core/hash.h (#100644 ) use const_data_ptr in torch/csrc/lazy/core/hash.h Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100644 Approved by: https://github.com/ezyang	2023-05-04 22:56:19 +00:00
PaliC	fe3c83d349	Have testing overhead dashboard only use successful workflows (#100580 ) Once this pr is approved I'll update the data to do this. Relevant queryhttps://console.rockset.com/query?query_text_id=8b05b37e-747b-421c-9347-151586b0ac80 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100580 Approved by: https://github.com/huydhn	2023-05-04 22:49:34 +00:00
Huy Do	1d5577b601	No need to run Windows binary build for every PR (#100638 ) Per the discussion with @malfet , there is no need to run Windows binary build for every PR. We will keep it running in trunk (on push) though just in case. This also moves the workflow back from unstable after the symlink copy fix in `860d444515` Another data point to back this up is the high correlation between Windows binaries debug and release build v.s. Windows CPU CI job. The numbers are: * `libtorch-cpu-shared-with-deps-debug` and `win-vs2019-cpu-py3` has 0.95 correlation * `libtorch-cpu-shared-with-deps-release` and `win-vs2019-cpu-py3` has the same 0.95 correlation The rest is noise, eh? Pull Request resolved: https://github.com/pytorch/pytorch/pull/100638 Approved by: https://github.com/atalman	2023-05-04 21:57:39 +00:00
Michael Lazos	c525440ba3	Logging documentation updates (#100595 ) Updated the logging.rst with info about the env var. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100595 Approved by: https://github.com/msaroufim, https://github.com/lezcano	2023-05-04 21:54:02 +00:00
Kimish Patel	24e9b8f5f4	[PT2E][Quant] Use subgraph matcher annotate linear pattern (#100566 ) This diff adds subgraph matcher for pattern matching. Furthermore, we also move annotations for the matched subgraph in a way that only input and output nodes of the matched subgraph have quantization related valid annotations. Differential Revision: [D45535539](https://our.internmc.facebook.com/intern/diff/D45535539/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100566 Approved by: https://github.com/jerryzh168	2023-05-04 21:31:59 +00:00
Rohan Varma	8869897ebe	[replicate] support simpler device_id (#100217 ) Allow passing in `device_id=[device]` regardless of CPU or GPU. We modify the kwarg as needed to pass to DDP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100217 Approved by: https://github.com/awgu, https://github.com/zhaojuanmao	2023-05-04 21:06:04 +00:00
Huy Do	2703684acf	[BE] Dockerize PyTorch docs jobs (#100601 ) Saw some connection error to pip in docs jobs today, so let's dockerize it: * https://github.com/pytorch/pytorch/actions/runs/4877612277/jobs/8702572072 * https://github.com/pytorch/pytorch/actions/runs/4877612277/jobs/8702572072 Some additional fixes: * Moving the docs script from under `.circleci` to under `.ci` as they should be * Linter (as scripts under .ci is subjected to shellcheck) * Fix some minor Sphinx warnings in functorch docs ### Testing Docs previews look fine: * https://docs-preview.pytorch.org/100601/index.html * https://docs-preview.pytorch.org/100601/cppdocs/index.html * https://docs-preview.pytorch.org/100601/functorchdocs/index.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/100601 Approved by: https://github.com/clee2000	2023-05-04 20:33:51 +00:00
Bug Hunter Yan	73dd6f04c9	extend serialization for tensor metadata (#99808 ) Fixes #ISSUE_NUMBER Add the serialization logic of backend metadata to the serialization of tensor, which is implemented through custom registration functions. In #97429 , the structure backendMeta is provided in TensorImpl, and we think that this part of information may also need to be serialized for custom. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99808 Approved by: https://github.com/ezyang	2023-05-04 20:32:11 +00:00
Yanbo Liang	25b42aef67	[Inductor] Using PythonPrinter for SymInt arguments codegen for FallbackKernal (#100606 ) Fixes Meta internal user case. Repro: ``` import torch import torch._inductor torch._inductor.config.disable_cpp_codegen = True @torch.compile(backend="inductor", dynamic=True) def func(input: torch.Tensor) -> torch.Tensor: n = input.size(-1) output = input + int(n * 0.2) + 1 return output, input + 1 print(func(torch.rand(5, device="cpu"))) print(func(torch.rand(10, device="cpu"))) ``` Error: ``` Traceback (most recent call last): File "/scratch/ybliang/work/repos/debug/debug7.py", line 20, in <module> print(func(torch.rand(10, device="cpu"))) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/eval_frame.py", line 280, in _fn return fn(args, kwargs) File "/scratch/ybliang/work/repos/debug/debug7.py", line 12, in func @torch.compile(backend="inductor", dynamic=True) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/eval_frame.py", line 280, in _fn return fn(args, *kwargs) File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/external_utils.py", line 17, in inner return fn(args, *kwargs) File "/scratch/ybliang/work/repos/pytorch/torch/_functorch/aot_autograd.py", line 3346, in forward return compiled_fn(full_args) File "/scratch/ybliang/work/repos/pytorch/torch/_functorch/aot_autograd.py", line 1260, in g return f(args) File "/scratch/ybliang/work/repos/pytorch/torch/_functorch/aot_autograd.py", line 2210, in runtime_wrapper all_outs = call_func_with_args( File "/scratch/ybliang/work/repos/pytorch/torch/_functorch/aot_autograd.py", line 1285, in call_func_with_args out = normalize_as_list(f(args)) File "/scratch/ybliang/work/repos/pytorch/torch/_functorch/aot_autograd.py", line 1372, in rng_functionalization_wrapper return compiled_fw(args) File "/tmp/torchinductor_ybliang/od/codk4bo4oqmjiec35zlz2rsildcix33lsxpdcy7pi6p4nvdrofpu.py", line 27, in call buf0 = torch.ops.aten.add.Tensor(arg1_1, floor(0.2*s0)) NameError: name 'floor' is not defined ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100606 Approved by: https://github.com/xw285cornell	2023-05-04 20:10:23 +00:00
Ramin Azarmehr	4bad3f62f7	[MPS] Add support for MPSProfiler (#100635 ) - Enable event and interval-based os signpost tracing via env-var 'PYTORCH_MPS_TRACE_SIGNPOSTS' (python bindings sent in separate PR). - Enable logging of MPS graphs, native kernels, and copies and their GPU times via env-var `PYTORCH_MPS_LOG_PROFILE_INFO`. - Enable dumping the table of kernel profiling results sorted based on Mean GPU time when the process ends (SIGINT also handled). - Fix a bug in MPSAllocator where the Allocator completionHandlers were called after MPSAllocator instance was destroyed. - Added option to use Schedule Handlers to begin signpost intervals. - Refer to comments in `MPSProfiler.h` to learn how to set env-vars for logging and signpost tracing. Proper documentation will be sent in a separate PR later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100635 Approved by: https://github.com/kulinseth	2023-05-04 20:02:33 +00:00
Animesh Jain	8994d9e610	[dynamo] Hide guard_fail_hook behind a flag to improve cache lookup time (+10% DebertaV2) (#100590 ) For TorchDynamo eager backend, DebertaV2 speedup improves from 0.77x to 0.87x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100590 Approved by: https://github.com/voznesenskym, https://github.com/wconstab	2023-05-04 18:52:21 +00:00
Bin Bao	edebad81a9	Add a rst doc for the performance dashboard (#100592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100592 Approved by: https://github.com/msaroufim, https://github.com/huydhn	2023-05-04 18:28:09 +00:00
Hongyi Jia	23a095ca5f	Chunked inplace weight loading API (#100615 ) Chunking inplace memory writing to save memory further Reviewed By: zyan0 Differential Revision: D45506186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100615 Approved by: https://github.com/davidberard98	2023-05-04 17:41:18 +00:00
PyTorch MergeBot	04d67e20a7	Revert "torch.save/load torch.compiled models (#97565 )" This reverts commit 87f08d717e022b8dd8de03c82ab77a9b3d52d5f6. Reverted https://github.com/pytorch/pytorch/pull/97565 on behalf of https://github.com/clee2000 due to sorry but I think this breaks dynamo tests `87f08d717e` ([comment](https://github.com/pytorch/pytorch/pull/97565#issuecomment-1535103171))	2023-05-04 17:07:33 +00:00
Bert Maher	67fc9bbb9b	Rename percentiles to quantiles in triton.testing.do_bench (#100477 ) Summary: To reflect upstream changes. Test Plan: CI Differential Revision: D45488904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100477 Approved by: https://github.com/ngimel	2023-05-04 16:53:17 +00:00
Richard Barnes	6370ac0251	[codemod] Replace hasattr with getattr in caffe2/torch/ao/quantization/stubs.py (#100597 ) Summary: The pattern ``` X.Y if hasattr(X, "Y") else Z ``` can be replaced with ``` getattr(X, "Y", Z) ``` The [getattr](https://www.w3schools.com/python/ref_func_getattr.asp) function gives more succinct code than the [hasattr](https://www.w3schools.com/python/ref_func_hasattr.asp) function. Please use it when appropriate. This diff is very low risk. Green tests indicate that you can safely Accept & Ship. Test Plan: Sandcastle Reviewed By: vkuzo Differential Revision: D44886422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100597 Approved by: https://github.com/Skylion007	2023-05-04 16:36:23 +00:00
Richard Barnes	9c185b6b46	[codemod] Replace hasattr with getattr in caffe2/docs/source/notes/extending.rst (#100598 ) Summary: The pattern ``` X.Y if hasattr(X, "Y") else Z ``` can be replaced with ``` getattr(X, "Y", Z) ``` The [getattr](https://www.w3schools.com/python/ref_func_getattr.asp) function gives more succinct code than the [hasattr](https://www.w3schools.com/python/ref_func_hasattr.asp) function. Please use it when appropriate. This diff is very low risk. Green tests indicate that you can safely Accept & Ship. Test Plan: Sandcastle Differential Revision: D44886464 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100598 Approved by: https://github.com/Skylion007	2023-05-04 16:36:15 +00:00
Mark Saroufim	87f08d717e	torch.save/load torch.compiled models (#97565 ) Opening this so I can discuss with @albanD I built a proof of concept of an in place API for an nn.Module that allows us to save and load a torch.compiled model with no issues https://github.com/msaroufim/mlsys-experiments/blob/main/save-compiled-model.py So users can run` model.compile()` and then run `torch.save(model, "model.pt")` and `torch.load(model, "model.pt)` with no issues unlike the rather strange current suggestion we give to users which is `opt_mod = torch.compile(mod); torch.save(mod, "model.pt")` Right now I'm trying to extend this to work for nn.modules more generally TODO: Failing tests * [x] torch.jit.load -> issue was because of aliasing `__call__` to `_call_impl`, _call_impl used to be skipped when now it lo longer is so expanded the skip check. I added an explicit `torch.jit.load()` test now which @davidberard98 suggested * [x] functorch seems to be a flake - ran locally and it worked `pytest functorch/test_eager_transforms.py` * [x] a test infra flake - `test_testing.py::TestImports::test_no_mutate_global_logging_on_import_path_functorch` * [x] It seems like I broke inlining in dynamo though `python -m pytest test/dynamo/test_dynamic_shapes.py -k test_issue175` chatting with Voz about it but still not entirely sure how to fix - found a workaround after chatting with @yanboliang * [x] `pytest test/dynamo/test_modules.py` and `test/dynamo/test_dynamic_shapes` `test/dynamo/test_misc.py` seem to be failing in CI but trying it out locally they all pass tests passed with 0 failures * [x] `pytest test/profiler/test_profiler_tree.py ` these tests have ProfilerTrees explicitly printed and will now break if __call__ is not in tree - ran with `EXPECT_ACCEPT=1` * [x] `pytest test/test_torch.py::TestTorch::test_typed_storage_deprecation_warning` a flake, ran this locally and it works fine * [x] I reverted my changes to `_dynamo/nn_module.py` since it looks like @wconstab is now directly handling `_call_impl` there but this is triggering an infinite inlining which is crashing * [x] Tried out to instead override `__call__`, python doesnt like this though https://github.com/pytorch/pytorch/pull/97565#issuecomment-1524570439 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97565 Approved by: https://github.com/aaronenyeshi, https://github.com/albanD	2023-05-04 16:23:12 +00:00
Rodrigo Kumpera	4b2f496eab	[c10d] Implement new Store methods in PrefixStore. (#100380 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100380 Approved by: https://github.com/fduwjj	2023-05-04 15:18:58 +00:00
Christian Puhrsch	97245a06e1	Turn on TORCH_CHECK for NT wrap_buffer (#100596 ) TORCH_INTERNAL_ASSERT_DEBUG_ONLY won't be enabled during non-debug builds, but for 1 dimension Tensors the check is cheap enough and not catching this can slow down development a lot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100596 Approved by: https://github.com/drisspg	2023-05-04 15:06:27 +00:00
Richard Barnes	26533349a7	[codemod] Replace hasattr with getattr in caffe2/torch/jit/_trace.py (#100362 ) Summary: The pattern ``` X.Y if hasattr(X, "Y") else Z ``` can be replaced with ``` getattr(X, "Y", Z) ``` The [getattr](https://www.w3schools.com/python/ref_func_getattr.asp) function gives more succinct code than the [hasattr](https://www.w3schools.com/python/ref_func_hasattr.asp) function. Please use it when appropriate. This diff is very low risk. Green tests indicate that you can safely Accept & Ship. Test Plan: Sandcastle Differential Revision: D44886479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100362 Approved by: https://github.com/Skylion007	2023-05-04 14:49:04 +00:00
Richard Barnes	6120c5842c	[codemod] Replace hasattr with getattr in caffe2/torch/ao/quantization/utils.py (#100361 ) Summary: The pattern ``` X.Y if hasattr(X, "Y") else Z ``` can be replaced with ``` getattr(X, "Y", Z) ``` The [getattr](https://www.w3schools.com/python/ref_func_getattr.asp) function gives more succinct code than the [hasattr](https://www.w3schools.com/python/ref_func_hasattr.asp) function. Please use it when appropriate. This diff is very low risk. Green tests indicate that you can safely Accept & Ship. Test Plan: Sandcastle Reviewed By: jerryzh168 Differential Revision: D44886493 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100361 Approved by: https://github.com/Skylion007	2023-05-04 14:46:38 +00:00
vfdev-5	ff974cd962	Fixing interpolate on uint8 unsqueezed 3D CL tensor (#100258 ) Description: - Fixed a bug with memory format issue: When input is channels last 4d tensor that was produced as following ``` t = torch.ones(1, 3, 32, 32).contiguous(memory_format=torch.channels_last) t = t[0] t = t[None, ...] ``` upsampling will produce output with channels first memory format but our avx code does not take that into account. Here is a repro code to show that nightly is broken for this particular case: ```python import torch torch.manual_seed(0) input = torch.randint(0, 256, size=(1, 3, 256, 256), dtype=torch.uint8).contiguous(memory_format=torch.channels_last) input = input[0] input = input[None, ...] assert input.is_contiguous(memory_format=torch.channels_last) output = torch.nn.functional.interpolate(input, (224, 224), mode="bilinear", antialias=True) expected = torch.nn.functional.interpolate(input.float(), (224, 224), mode="bilinear", antialias=True) assert output.is_contiguous() assert expected.is_contiguous() torch.testing.assert_close(expected, output.float(), atol=1, rtol=1) # > # Traceback (most recent call last): # File "<stdin>", line 1, in <module> # File "/pytorch/torch/testing/_comparison.py", line 1511, in assert_close # raise error_metas[0].to_error(msg) # AssertionError: Tensor-likes are not close! # # Mismatched elements: 14120 / 150528 (9.4%) # Greatest absolute difference: 214.6112518310547 at index (0, 1, 152, 13) (up to 1 allowed) # Greatest relative difference: 17.005144119262695 at index (0, 2, 26, 2) (up to 1 allowed) ``` - Also renamed needs_unpacking by skip_unpacking Pull Request resolved: https://github.com/pytorch/pytorch/pull/100258 Approved by: https://github.com/NicolasHug	2023-05-04 13:28:33 +00:00
Tugsbayasgalan Manlaibaatar	9b3552eb2c	Add runtime assertions for input shape constraints (#100247 ) This PR adds runtime assertions as an extra pass in the exported graph. Several high level information: 1. We specialize all dimensions that were not added to the user input constraints 2. We haven't added relational constraints as runtime assertions (e.g x[1] == x[0]), will do in a follow up diff Differential Revision: [D45408971](https://our.internmc.facebook.com/intern/diff/D45408971) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100247 Approved by: https://github.com/guangy10, https://github.com/avikchaudhuri	2023-05-04 13:26:58 +00:00
Peter Bell	8f1122ce7b	[inductor] Enable conditional use of tl.reduce (#100569 ) Splitting this into a separate PR so we can test both branches in CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100569 Approved by: https://github.com/ngimel	2023-05-04 13:07:34 +00:00
Peter Bell	94d306fd45	[inductor] Stop using `x + tl.zeros(...)` in generated triton (#100163 ) For reductions, this changes the accumulator ```python _tmp2 = tl.zeros([XBLOCK, RBLOCK], tl.int8) + -128 ``` to ```python _tmp2 = tl.full([XBLOCK, RBLOCK], -128, tl.int32) ``` which is equivalent since addition does type promotion from `int8` to `int32` For constant indexing, this changes ```python tl.store(in_out_ptr0 + (0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp4, None) ``` to ```python tl.store(in_out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` For variable indexing, this changes ```python tl.store(out_ptr0 + (0 + tl.zeros([XBLOCK], tl.int32)), tmp1, None) ``` to ```python tl.store(out_ptr0 + (tl.broadcast_to(x0, [XBLOCK])), tmp1, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100163 Approved by: https://github.com/ngimel	2023-05-04 13:07:34 +00:00
Peter Bell	06fbd5dc9c	[inductor] Fix argmin/max with duplicate values (#100573 ) Fixes #99879 This adds `minimum_with_index` helper functions to compute the minimum value and index simultaneously, with a preference for the smaller index which is required to match eager in case of duplicates. I also remove the mask-and-sum hack with a `tl.reduce` using the previously mentioned helper. This additionally fixes the indices being added together in the case of duplicates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100573 Approved by: https://github.com/ngimel	2023-05-04 13:07:31 +00:00
Peter Bell	4918940184	[inductor] Fix nan-handling of max and min reductions (#100572 ) This adds helpers that replace tritons `minimum`, `maximum`, `min` and `max` with the correct NaN prrpagation. I also removed `ops.int_minimum` in favor of `ops.minimum` because we can just omit the nan-checks by checking the dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100572 Approved by: https://github.com/ngimel	2023-05-04 13:07:27 +00:00
lezcano	19a57870a3	Fix a number of issues with divs in ValueRangeAnalysis (#100547 ) This PR: - Adds `floordiv` and `truncdiv` as they were missing - Maps `div` to its correct definition (it was being mapped to `floordiv`) - Simplifies the bounds of `floordiv` - Fixes some issues with the returned types of `floor` `ceil` - Adds tests for the previous point Pull Request resolved: https://github.com/pytorch/pytorch/pull/100547 Approved by: https://github.com/ezyang	2023-05-04 12:31:55 +00:00
Rodrigo Kumpera	a204f7f518	[c10d] Fix subprocess group handlig in scatter_object_list. (#100552 ) scatter_object_list assumed src was a group rank while all collectives use global ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100552 Approved by: https://github.com/fduwjj	2023-05-04 10:04:21 +00:00
kshitij12345	aecbaa5d45	[vmap] bucketize (#95783 ) Ref: https://github.com/pytorch/pytorch/issues/96740 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95783 Approved by: https://github.com/zou3519	2023-05-04 07:23:35 +00:00
PyTorch MergeBot	c4fd76e7b4	Revert "[export] Pickle result of export (#100423 )" This reverts commit 7226dbcbce87464fb170019a6ffeb80f82c37804. Reverted https://github.com/pytorch/pytorch/pull/100423 on behalf of https://github.com/angelayi due to merge conflict ([comment](https://github.com/pytorch/pytorch/pull/100423#issuecomment-1534163373))	2023-05-04 06:41:06 +00:00
Angela Yi	7226dbcbce	[export] Pickle result of export (#100423 ) Pickles the metadata["val"] into TensorMetadata struct so that it'll be retrained when we unpickle. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100423 Approved by: https://github.com/mergennachin	2023-05-04 06:37:16 +00:00
zhxchen17	8d598f2f25	[exportdb] Change case ids to case names for UserErrors. (#100600 ) Associate UserErrors with the unique case name instead of the case ids, because in practice they work similarly but names are more meaningful to use and remember. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100600 Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri	2023-05-04 06:14:50 +00:00
Edward Z. Yang	c58d9642d0	Don't build Triton from source in benchmarks/dynamo/Makefile (#100613 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100613 Approved by: https://github.com/voznesenskym	2023-05-04 06:13:42 +00:00
Angela Yi	8eb82135d1	[docs] Docs for writing ATen IR passes + FX Pattern matching (#100577 ) I'm not really sure where to put this...maybe just link it somewhere in torch.compile docs? Pull Request resolved: https://github.com/pytorch/pytorch/pull/100577 Approved by: https://github.com/msaroufim	2023-05-04 05:17:10 +00:00
Michael Voznesensky	fe3ecfe0cf	Add AotAutogradFallbackTests to dynamic suite (#100454 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100454 Approved by: https://github.com/ezyang	2023-05-04 04:28:45 +00:00
Will Constable	2dca418112	Reland basic dynamo support for traceable collectives (#100476 ) Relative to the original land, this also contains: - Fix torchdeploy import of functional collectives - Can't import torchdynamo utils due to torch._refs being missing Pull Request resolved: https://github.com/pytorch/pytorch/pull/100476 Approved by: https://github.com/kumpera	2023-05-04 04:25:35 +00:00
Michael Voznesensky	9f3c6b1b63	Fix graph break in a common `func(self, args)` pattern (Faster stable diffusion) (#100444 ) Stable Diffusion has a pattern like this: ``` def forward(self, hidden_states, encoder_hidden_states=None, attention_mask=None, cross_attention_kwargs): # The `Attention` class can call different attention processors / attention functions # here we simply pass along all tensors to the selected processor class # For standard processors that are defined here, `cross_attention_kwargs` is empty return self.processor( self, hidden_states, encoder_hidden_states=encoder_hidden_states, attention_mask=attention_mask, *cross_attention_kwargs, ) ``` Wherein processor is something like `AttnProcessor2_0`, which is callable but not an NNModule. This allows for a significant speedup in stable diffusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100444 Approved by: https://github.com/anijain2305	2023-05-04 03:38:52 +00:00
Edward Z. Yang	c2556c034d	Improve minifier printing to be more chatty when it makes sense (#100486 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100486 Approved by: https://github.com/voznesenskym	2023-05-04 02:51:26 +00:00
Edward Z. Yang	c7e9f40653	Misc accuracy improvements on minifier (#100447 ) The changes: * Add config knob `same_two_models_use_fp64` for toggling whether or not to use fp64 * Add a test showing that RMSE is superior to atol/rtol * Add `--strict-accuracy` options, which allows for testing against integral/boolean accuracy. Regular accuracy by default now ONLY. There's a test which exercises this, it's a little delicate but I had trouble thinking of a good test otherwise. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100447 Approved by: https://github.com/voznesenskym	2023-05-04 02:51:26 +00:00
Richard Barnes	7f997aa393	[codemod] Replace hasattr with getattr in caffe2/test/distributed/fsdp/test_fsdp_optim_state.py (#100360 ) Summary: The pattern ``` X.Y if hasattr(X, "Y") else Z ``` can be replaced with ``` getattr(X, "Y", Z) ``` The [getattr](https://www.w3schools.com/python/ref_func_getattr.asp) function gives more succinct code than the [hasattr](https://www.w3schools.com/python/ref_func_hasattr.asp) function. Please use it when appropriate. This diff is very low risk. Green tests indicate that you can safely Accept & Ship. Test Plan: Sandcastle Differential Revision: D44886500 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100360 Approved by: https://github.com/rohan-varma, https://github.com/Skylion007, https://github.com/awgu	2023-05-04 02:44:22 +00:00
PyTorch MergeBot	8df748f3be	[vision hash update] update the pinned vision hash (#100510 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100510 Approved by: https://github.com/pytorchbot	2023-05-04 02:32:43 +00:00
Xiaodong Wang	c29ab84115	Fix bug in process_group_name when there is duplicate pgs (#100518 ) Summary: with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group. Reviewed By: xunnanxu, eeggl Differential Revision: D45315615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100518 Approved by: https://github.com/kumpera	2023-05-04 02:12:28 +00:00
Rohan Varma	253b9d3247	[replicate] input casting support (#100216 ) Supports input casting by doing this in the pre hook. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100216 Approved by: https://github.com/awgu	2023-05-04 01:46:15 +00:00
Nikita Karetnikov	e87ed2a88d	[primTorch] add ref for `polar` (#100345 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100345 Approved by: https://github.com/ezyang	2023-05-04 01:37:02 +00:00
Driss Guessous	2892c06e82	Ensure device arg is passed to test_transformers (#100260 ) # Summary Follow up to #100121 to actually make sure that test functions are accepting a device arg as input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100260 Approved by: https://github.com/malfet, https://github.com/ngimel	2023-05-04 01:36:06 +00:00
Hongyi Jia	f558bb6f76	inplace PyTorchStreamReader getRecord() (#100418 ) Summary: Sometimes we want to getRecord into an pre-allocated memory to save cpu memory. Adding new API to support the inplace memory writing. Test Plan: caffe2/serialize/inline_container_test Reviewed By: zyan0 Differential Revision: D45439517 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100418 Approved by: https://github.com/davidberard98, https://github.com/houseroad	2023-05-04 01:30:59 +00:00
Elias Ellison	e6c0164f1c	Use Boxed Calling Convention for AOT Eager (#100417 ) The boxed format is more memory efficient, especially with backwards & activations Pull Request resolved: https://github.com/pytorch/pytorch/pull/100417 Approved by: https://github.com/ezyang	2023-05-04 01:22:36 +00:00
Elias Ellison	d67e4db8ff	Require contiguous for view_as_complex (#100428 ) Fix for https://github.com/pytorch/pytorch/issues/100086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100428 Approved by: https://github.com/ngimel	2023-05-04 01:18:07 +00:00
Edward Z. Yang	d25c93f919	Remove speech_transformer workaround, torchbench handles it correctly now (#100558 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100558 Approved by: https://github.com/albanD	2023-05-04 01:14:24 +00:00
Sam Gross	fd841763e1	[dynamo] Minor fixes and clean-up in eval_frame.c (#100496 ) This fixes a few reference counting bugs in eval_frame.c, simplifies a few functions a bit, and adds a few missing error handling code paths. Probably the only important reference counting bug is that `call_callback` previously leaked `THPPyInterpreterFrame` in Python 3.11+. Summary below: - eval_frame_callback_get shouldn't incref Py_None - Don't leak THPPyInterpreterFrame in Python 3.11+ - set_profiler_hooks would decref profiler_start_hook and profiler_end_hook too many times if called with None as an argument (but we never actually used that code path). - Simplify some argument parsing - Only create guard_profiler_name_str once - Add a few missing error checks Pull Request resolved: https://github.com/pytorch/pytorch/pull/100496 Approved by: https://github.com/albanD	2023-05-04 00:45:15 +00:00
shibo	6aeb85add8	add checkpoint support for custom device (#99626 ) Fixes #ISSUE_NUMBER 1、add checkpoint support for custom device 2、add a device argument, I want to add a device="cuda" parameter to the func `forward` of `CheckpointFunction`, and I can specify the device type when using it, but the func `apply` of `torch.autograd.Function` does not support `kwargs`, so I added a variable named `_device`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99626 Approved by: https://github.com/soulitzer	2023-05-04 00:23:42 +00:00
eqy	3c1dd0a4b1	[cuDNN][CUDA] Fix for `install_cudnn.sh` following 12.1 CI update (#100501 ) Trying out a fix for the path CC @Aidyn-A @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/100501 Approved by: https://github.com/ngimel, https://github.com/malfet	2023-05-04 00:17:10 +00:00
Rodrigo Kumpera	fa2bfab93e	[C10D] Drop the GIL when creating a TCPSTore to avoid deadlocks. (#100555 ) TCPSTore creation is a blocking operation so it can lead to a deadlock if multiple threads are trying to instantiate it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100555 Approved by: https://github.com/H-Huang	2023-05-04 00:15:55 +00:00
Masamitsu MURASE	c3bcf5f628	Support multiple separator characters, '/' and '\\', on Windows. (#98146 ) On Windows, both '/' and '\\' can be used as a path separator, so `StripBasename` should handle them as path separators. `StripBasename` is used in the `is_enabled` function in `torch\csrc\jit\jit_log.cpp` Therefore, without this pull request, is_enabled does not work properly on Windows. For more details, please refer to the issue #98145. Fixes #98145 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98146 Approved by: https://github.com/ezyang	2023-05-04 00:15:28 +00:00
BowenBao	f82756335d	[ONNX] Update 'Functionalize' pass to support pre-decomp graph; Drop 'aten_graph' arg for 'DynamoExporter' (#99667 ) Summary - Previously this was required by and entangled with `tracing_mode=symbolic` for `dynamic` tracing. That is resolved by #99555 and its follow ups. - Later decomposition pass will do graph lowering, so this step is duplicated. - Updated `Functionalization` to workaround https://github.com/pytorch/pytorch/issues/99774#issuecomment-1527949391 Todo - Training vs eval in dynamo_export So we are effectively exporting all models in traning mode by default. But for the sake of this export we are only interested in eval mode. The question is, should we call `model.eval()` in `dynamo_export`? Tests with model containing batch norm fails 'functionalization' in training mode. We are explicitly calling `model.eval()` for these model for now. - Merge decomp and functionalize pass. Both calls into `make_fx`. Merging potentially increases performance. However it is unclear if it will result in different behavior. Fixes #99662. (For the functionalization issue. Still need missing op support.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99667 Approved by: https://github.com/titaiwangms	2023-05-04 00:01:22 +00:00
Valentin Andrei	9bc68fcd25	[pytorch] Accelerate indexing_backward_kernel with duplicates (#99441 attempt 2) (#100505 ) By knowing the stride value ahead of time, we can simplify the kernel code as follows: If stride == 1 we can use the whole warp to reduce the gradients If stride < warp_size we don't need the internal while (start_feature < stride) loop as blockDim.x is always 32 This changes improve the performance of the kernel when duplicates are present and do not affect the performance with low amount of duplicates. The implementation is deterministic. The proposed implementation uses opmath_t to accumulate in registers the gradient values so when using FP16/BF16 it may overflow if the number of elements is large. This is different from the initial implementation who accumulates in scalar_t and does not overflow. In addition, when the stride is 1, we are using warp shuffles to sum the gradient so the order of the addition is slightly different than a reference implementation which causes some minor numerical differences when compared to a reference. TEST CODE: ``` # The first element is the number of iterations. # The second represents the number of unique elements. If # set to 0, the number of unique elements is equal to the # number of elements. # The remaining elements are the tensor dimensions. basic_indexing_tests = [ [10, 0, 12345], [10, 4, 12345], [10, 16, 512, 512, 32], [10, 0, 4, 4], [10, 0, 32, 32], [10, 8, 32, 32], [10, 8, 64, 32, 16], [10, 0, 64, 32, 16], [10, 16, 512, 512, 32], [10, 0, 675, 999, 13], [10, 0, 123, 456, 31], [10, 0, 512, 512, 32], [10, 4, 512, 512, 32], [10, 2, 512, 512, 32], [10, 0, 128, 128, 16, 16], [10, 8, 128, 126, 16, 16], [10, 4, 128, 126, 16, 16], [10, 0, 64, 64, 16, 16, 16], [10, 8, 64, 64, 16, 16, 16], [10, 2, 64, 64, 16, 16, 16], [10, 1, 64, 64, 16, 16, 16], ] def run_basic_indexing_on_device(x, index, expected, device_string, iters): x_dev = x.to(device_string) x_dev = x_dev.detach().requires_grad_() index_dev = index.to(device_string) # Run backward pass; keep gradients and measure time torch.cuda.synchronize() t_bw_s = time() for _ in range(iters): y = x_dev[index_dev] z = y.sum() z.backward() torch.cuda.synchronize() t_bw_s = (time() - t_bw_s) / iters return (x_dev.grad, t_bw_s) def run_basic_indexing_test(test_input): tensor_size = tuple(test_input[:5]) niters = test_input[0] num_unique = test_input[1] tensor_size = tuple(test_input[2:]) numel = 1 for dim in tensor_size: numel = dim if num_unique == 0: num_unique = numel index = torch.randint(0, num_unique, tensor_size, dtype=torch.long, device="cpu") x = torch.randn((numel,), dtype=torch.float32, device="cuda") index = index.detach() x = x.detach().requires_grad_() (cpu_grad, t_bw_cpu) = run_basic_indexing_on_device(x, index, numel / 2, "cpu", 1) (gpu_grad, t_bw_gpu) = run_basic_indexing_on_device(x, index, numel / 2, "cuda", 1) max_delta = torch.max(torch.abs(cpu_grad - gpu_grad.to("cpu"))) missmatches = torch.nonzero(torch.abs(cpu_grad - gpu_grad.to("cpu"))) (gpu_grad_perf, t_gpu) = run_basic_indexing_on_device( x, index, numel / 2, "cuda", niters ) print( "test = {}, delta = {:.5f}, missmatches = {} duration_ms = {:.3f}".format( tuple(test_input), max_delta, missmatches, t_gpu 1000.0 ) ) if torch.numel(missmatches) > 0: print("cpu grad = {}", cpu_grad[missmatches]) print("gpu grad = {}", gpu_grad[missmatches]) ``` RESULTS: ``` Default Implementation test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.726 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.867 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.514 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.689 test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.547 test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.537 test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.199 test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.584 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.055 test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.411 test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.419 test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.048 test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.633 test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 606.403 test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4.099 test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 76.813 test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.760 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.547 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 317.583 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1204.800 test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2412.133 Small Stride Kernel Version test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.904 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.156 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 308.878 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.566 test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.540 test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.550 test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.868 test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.656 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.856 test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.624 test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.837 test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.274 test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1127.040 test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2123.942 test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.282 test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 288.997 test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 547.267 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 12.844 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1178.934 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4262.042 test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8172.318 Stride 1 Kernel Version test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.692 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.834 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 81.023 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.631 test = (100, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.491 test = (100, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.477 test = (50, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.561 test = (50, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.516 test = (16, 10, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 126.455 test = (10, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.238 test = (10, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.520 test = (10, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 7.854 test = (10, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 306.327 test = (10, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 610.498 test = (5, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.684 test = (5, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 75.604 test = (5, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.679 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.525 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 315.095 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1214.715 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100505 Approved by: https://github.com/ngimel	2023-05-03 23:52:58 +00:00
Bin Bao	61813a8e62	[reland][CI] Start to collect inference perf with cpp_wrapper ON (#100187 ) (#100502 ) Summary: Previous failures were caused by GCP outage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100502 Approved by: https://github.com/huydhn	2023-05-03 23:51:18 +00:00
Sergii Dymchenko	1a6f613b8f	Check uppercase when checking for merge blocking SEVs (#100559 ) Otherwise it's triggered by phrases like "removing merge blocking" in the details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100559 Approved by: https://github.com/seemethere, https://github.com/malfet	2023-05-03 23:47:06 +00:00
Li-Huai (Allan) Lin	0a6a0ac49b	[MPS] Add dot input check (#100099 ) Fixes #99564 <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at c21d056</samp> This pull request adds input validation and error handling tests for the `dot` and `vdot` operations in the `mps` namespace, using a new helper function and a new test function. This enhances the MPS backend and the testing framework for these operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100099 Approved by: https://github.com/albanD, https://github.com/malfet	2023-05-03 23:35:10 +00:00
Angela Yi	3c5ec6af14	Partition modules (#98628 ) Added helper functions to match nodes in the graph that are decomposed from their source (leaf modules, or functional ops), as a result of dynamo tracing. `get_source_partitions(graph: torch.fx.Graph, wanted_sources: List[Any]) -> Dict[Any, SourcePartition]` Args: * graph: The graph we want to partition * wanted_sources: List of sources of nodes that were decomposed from this source. This can be a function (ex. torch.nn.functional.linear) or a leaf module type (ex. torch.nn.Linear) Returns: * Dictionary mapping sources (ex. torch.nn.modules.linear.Linear) to a list of SourcePartitions that correspond to the list of nodes that were flattened from a module of that type. ``` @dataclass class SourcePartition(): # Nodes in a particular partition nodes: List[Node] # Module type module_type: Type # Nodes in the graph that are needed as inputs to the partition input_nodes: List[Node] = field(default_factory=list) # Nodes in the partition that are being used by nodes outside of the partition output_nodes: List[Node] = field(default_factory=list) # Parameters that are being used params: List[str] = field(default_factory=list) ``` Example: Original: ``` x -> linear -> linear -> relu -> linear ``` Traced graph: ``` .graph(): %arg0 : [#users=1] = placeholder[target=arg0] %_param_constant0 : [#users=1] = get_attr[target=_param_constant0] %t_default : [#users=1] = call_function[target=torch.ops.aten.t.default](args = (%_param_constant0,), kwargs = {}) %_param_constant1 : [#users=1] = get_attr[target=_param_constant1] %addmm_default : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant1, %arg0, %t_default), kwargs = {}) %_param_constant0_1 : [#users=1] = get_attr[target=_param_constant0] %t_default_1 : [#users=1] = call_function[target=torch.ops.aten.t.default](args = (%_param_constant0_1,), kwargs = {}) %_param_constant1_1 : [#users=1] = get_attr[target=_param_constant1] %addmm_default_1 : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant1_1, %addmm_default, %t_default_1), kwargs = {}) %relu_default : [#users=1] = call_function[target=torch.ops.aten.relu.default](args = (%addmm_default_1,), kwargs = {}) %_param_constant2 : [#users=1] = get_attr[target=_param_constant2] %t_default_2 : [#users=1] = call_function[target=torch.ops.aten.t.default](args = (%_param_constant2,), kwargs = {}) %_param_constant3 : [#users=1] = get_attr[target=_param_constant3] %addmm_default_2 : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant3, %relu_default, %t_default_2), kwargs = {}) return [addmm_default_2] ``` Result of `get_module_partitions`: ``` {<class 'torch.nn.modules.linear.Linear'>: [ ModulePartition(nodes=[_param_constant0, t_default, _param_constant1, addmm_default], module_type=<class 'torch.nn.modules.linear.Linear'>, input_nodes=[arg0], output_nodes=[addmm_default], params=["_param_constant0", "_param_constant1"]), ModulePartition(nodes=[_param_constant0_1, t_default_1, _param_constant1_1, addmm_default_1], module_type=<class 'torch.nn.modules.linear.Linear'>, input_nodes=[addmm_default], output_nodes=[addmm_default_1], params=["_param_constant0_1", "_param_constant1_1"]), ModulePartition(nodes=[_param_constant2, t_default_2, _param_constant3, addmm_default_2], module_type=<class 'torch.nn.modules.linear.Linear'>, input_nodes=[relu_default], output_nodes=[addmm_default_2], params=["_param_constant2", "_param_constant3"])], <class 'torch.nn.modules.activation.ReLU'>: [ ModulePartition(nodes=[relu_default], module_type=<class 'torch.nn.modules.activation.ReLU'>, input_nodes=[addmm_default_1], output_nodes=[relu_default], params=[])]} ``` Also added helper function to check if two module partitions are connected: `check_subgraphs_connected(subgraph1: SourcePartition, subgraph2: SourcePartition) -> bool` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98628 Approved by: https://github.com/cccclai	2023-05-03 23:31:56 +00:00
Dirk Groeneveld	75945d54f7	Properly propagates checkpoint wrapper args and kwargs (#99791 ) It looks like passing `args` and `kwargs` to `checkpoint_wrapper()` does not work because someone forgot some ``s. This adds them back in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99791 Approved by: https://github.com/awgu	2023-05-03 23:19:21 +00:00
Eddie Yan	8f6951cf55	[cuDNN][cuDNN V8 frontend API] Clean up `time_sorted_plan` workaround for cuDNN v8 API (#100287 ) `cudnn-frontend` (bumped in #99674) has added support for limiting the number of kernels to benchmark, so we can remove the workaround introduced in #91032. CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/100287 Approved by: https://github.com/ngimel	2023-05-03 22:37:16 +00:00
Huy Do	478a5ddd8a	Mark Windows CPU jobs as unstable (#100581 ) Caused by https://github.com/pytorch/pytorch/pull/100377, something removes VS2019 installation on the non-ephemeral runner. I think moving this to unstable is nicer to gather signals in trunk without completely disable the job or revert https://github.com/pytorch/pytorch/pull/100377 (for the Nth times) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100581 Approved by: https://github.com/clee2000, https://github.com/malfet	2023-05-03 21:43:43 +00:00
Kevin Tse	f04bb519f5	[DataPipe] Change DataPipe display name in profiler (#100042 ) Script: ```python from torchdata.datapipes.iter import IterableWrapper from torchdata.dataloader2 import DataLoader2, MultiProcessingReadingService ls = range(16) dp = IterableWrapper(ls).map(fn_2).map(fn_3).map(fn_4) rs = MultiProcessingReadingService(num_workers=0, main_prefetch_cnt=0, worker_prefetch_cnt=0) dl2 = DataLoader2(dp, reading_service=rs) with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU]) as prof: for x in dl2: pass ``` Output before: ``` --------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls --------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ enumerate(DataPipe)#MapperIterDataPipe 76.37% 1.419ms 213.08% 3.959ms 80.796us 49 enumerate(DataPipe)#IterableWrapperIterDataPipe 12.70% 236.000us 12.70% 236.000us 13.882us 17 ... ``` Output after: ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Mapper(datapipe=Mapper, fn=fn_4, input_col=None, out... 29.79% 645.000us 99.17% 2.147ms 126.294us 17 Mapper(datapipe=IterableWrapper, fn=fn_2, input_col=... 29.24% 633.000us 42.96% 930.000us 54.706us 17 Mapper(datapipe=Mapper, fn=fn_3, input_col=None, out... 24.76% 536.000us 68.59% 1.485ms 87.353us 17 IterableWrapper(deepcopy=True, iterable=range(0, 16)... 10.58% 229.000us 10.58% 229.000us 13.471us 17 ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100042 Approved by: https://github.com/ejguan	2023-05-03 21:36:13 +00:00
Huy Do	72c68704d7	Revert "Temporarily move ROCm to unstable (#99579 )" (#100564 ) This reverts commit c412056921f1e251bd955bb5fd9bd117d5a97ee5. Need to revert this manually due to a merge conflict. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100564 Approved by: https://github.com/kit1980	2023-05-03 21:04:23 +00:00
Will Constable	a304b2a45f	Activate TracingContext earlier (#100043 ) Ensure any calls into VariableTracker have a valid TC Previously, Calls into VariableBuilder from symbolic locals construction were not done under an active TC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100043 Approved by: https://github.com/anijain2305	2023-05-03 20:55:30 +00:00
Richard Zou	3d10e748e7	[Reland] Initial version of Dynamo capture for HigherOrderOperator (#100544 ) Original PR #99988 The problem was that we added `wrap` to torch._ops which actually puts it on `torch.ops.wrap` which is a namespace that can be open-registered to. The fix is that we now shove `wrap` into a new file Pull Request resolved: https://github.com/pytorch/pytorch/pull/100544 Approved by: https://github.com/voznesenskym	2023-05-03 20:49:05 +00:00
soulitzer	e552b91286	torch.utils.checkpoint warns if user does not pass use_reentrant explicitly (#100551 ) Now that we have updated all internal callsites, per https://fb.workplace.com/groups/pytorch.oss.dev/permalink/1635183750239493/ we should raise a warning when use_reentrant is not explicitly passed for 2.1 Deprecation note: - Not passing in use_reentrant explicitly is now deprecated and will raise a warning. In the future the default value of use-reentrant will be False. To preserve the existing behavior you can pass in use_reentrant=True. It is recommended that you use use_reentrant=False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100551 Approved by: https://github.com/Skylion007	2023-05-03 20:48:07 +00:00
shubhambhokare1	0595ecf00c	[ONNX] Add symbolic for _convolution_mode (#89107 ) As per #68880 implement the operator _convolution_mode in the ONNX exporter. This will allow user to leverage the padding 'str' mode where it can be set to 'valid' or 'same'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89107 Approved by: https://github.com/titaiwangms, https://github.com/BowenBao	2023-05-03 20:42:30 +00:00
Sam Gross	d419ad17b2	[dynamo] Disable pytest AST rewriting in test_export (#100484 ) pytest rewrites Python assert statements in unit tests to provide more detailed error messages. Unfortunately, this breaks some dynamo tests. Disable AST rewriting in test_export.py so that "pytest test/dynamo/test_export.py" passes. Fixes #93449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100484 Approved by: https://github.com/tugsbayasgalan	2023-05-03 20:40:46 +00:00
Edward Z. Yang	2f13a7a7a7	Prevent GraphArg from keeping real tensors live (#100515 ) This may potentially fix an OOM on IG Cover model (Meta only: T152238176) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100515 Approved by: https://github.com/voznesenskym, https://github.com/yf225	2023-05-03 20:14:19 +00:00
Peter Ye	16d268e280	Fix comment error in TensorIterator.cpp (#100227 ) Fixes comment error in TensorIterator.cpp I believe there is an error in the comment, based on the following code snippet ```c++ if (shape0 * stride[dim0] != stride[dim1]) { return false; } ``` I have corrected the comment accordingly. Please let me know if any further action is required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100227 Approved by: https://github.com/kit1980	2023-05-03 19:59:14 +00:00
vfdev-5	6a12f10b08	Publicly exposing `torch.backends.cpu.get_cpu_capability()` (#100164 ) Description: - As suggested by Nikita, created `torch.backends.cpu` submodule and exposed `get_cpu_capability`. - In torchvision Resize method we want to know current cpu capability in order to pick appropriate codepath depending on cpu capablities Newly coded vectorized resize of uint8 images on AVX2 supported CPUs is now faster than older way (uint8->float->resize->uint8). However, on non-avx hardware (e.g. Mac M1) certain configs are slower using native uint8. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100164 Approved by: https://github.com/albanD, https://github.com/malfet	2023-05-03 19:02:07 +00:00
Ramil Nugmanov	3e18d3958b	[DataLoader] Follow-up Fix: TypeVars of Sampler (#100409 ) API backward compatibility fixed: https://github.com/pytorch/pytorch/pull/97338#discussion_r1169164163 Mapped Dataset can accept noninteger indices from custom Samplers. Fixes #97338 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100409 Approved by: https://github.com/ejguan, https://github.com/NivekT	2023-05-03 17:38:31 +00:00
Edward Z. Yang	db4572dbf1	Revert tl.reduce usage (#100521 ) Test Plan: sandcastle Reviewed By: bertmaher Differential Revision: D45513572 fbshipit-source-id: a03df851503f72313dfb50238e7d6db9239bf42e	2023-05-03 12:20:33 -04:00
Shabab Ayub	287f74c4fc	Revert D45387167: Multisect successfully blamed D45387167 for test or build failures (#100424 ) Summary: This diff is reverting D45387167 D45387167: Basic dynamo support for traceable collectives (#94440) by wconstab has been identified to be causing the following test or build failures (internal) If you believe this diff has been generated in error you may Commandeer and Abandon it. Test Plan: NA Reviewed By: s4ayub Differential Revision: D45448312 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100424 Approved by: https://github.com/rohan-varma, https://github.com/kumpera	2023-05-03 16:10:54 +00:00
Jean Schmidt	2ac6ee7f12	Migrate jobs: `windows.4xlarge`->`windows.4xlarge.nonephemeral` (#100548 ) This is reopening of the PR https://github.com/pytorch/pytorch/pull/100377 # About this PR Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral. Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances. As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower. Instances definition can be found here: https://github.com/pytorch/test-infra/pull/4072 This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows: * migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch` * migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` * terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu` * evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn) The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs. # Copilot Summary <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 579d87a</samp> This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`. # Copilot Poem <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 579d87a</samp> > _We're breaking free from the ephemeral chains_ > _We're running on the nonephemeral lanes_ > _We're building faster, testing stronger, supporting newer_ > _We're the non-ephemeral runners of fire_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/100377 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman (cherry picked from commit 7caac545b1d8e5de797c9593981c9578685dba81) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100548 Approved by: https://github.com/jeanschmidt, https://github.com/janeyx99	2023-05-03 15:47:18 +00:00
AllenTiTaiWang	843ead134c	[ONNX] Add supported ops into test_fx_op_consistency - 1st batch (#100265 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100265 Approved by: https://github.com/justinchuby	2023-05-03 14:42:25 +00:00
Shen Li	2ebb48ff28	[SPMD] add FQN argument to Override.replacement (#100473 ) Differential Revision: [D45486089](https://our.internmc.facebook.com/intern/diff/D45486089) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100473 Approved by: https://github.com/wanchaol	2023-05-03 14:20:01 +00:00
Masaki Kozuki	6cc0158311	Use `maybe_unused` attr in VariableType (#100498 ) simple cosmetic change, a fallout of #100250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100498 Approved by: https://github.com/albanD	2023-05-03 14:14:29 +00:00
PyTorch MergeBot	58f796ff5d	Revert "Initial version of Dynamo capture for HigherOrderOperator (#99988 )" This reverts commit 4c99f9cdf236756efcdb365679ddec788b756eeb. Reverted https://github.com/pytorch/pytorch/pull/99988 on behalf of https://github.com/atalman due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/99988#issuecomment-1533081452))	2023-05-03 14:02:40 +00:00
Richard Zou	b2d703e2d7	Stop loading functorch._C unless torchdim is needed (#100491 ) Just a small optimization. This PR changes it so that import of functorch.dim ends up loading functorch._C (which is entirely composed of torchdim APIs) Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/100491 Approved by: https://github.com/Chillee, https://github.com/kshitij12345	2023-05-03 13:47:49 +00:00
kshitij12345	8b64dee5d2	[fix] torch_compile_debug don't log with 0 (#100462 ) Fixes https://github.com/pytorch/pytorch/issues/99906 Tested locally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100462 Approved by: https://github.com/mlazos	2023-05-03 08:23:09 +00:00
Yanbo Liang	896eb1db26	[Dynamo] Skip TB Background_Matting model eager accuracy check because of non deterministic (#100513 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100513 Approved by: https://github.com/anijain2305	2023-05-03 07:06:50 +00:00
Huy Do	9e2808aa47	Retry resolving download.pytorch.org with Google DNS (#100509 ) We want to retry resolving `download.pytorch.org` one more time with Google DNS as it seems to work on the runner. This is to avoid the intermittent NXDOMAIN error when using AWS local DNS 10.0.0.2, for example https://github.com/pytorch/pytorch/actions/runs/4864714757/jobs/8674570552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100509 Approved by: https://github.com/malfet, https://github.com/atalman	2023-05-03 04:51:51 +00:00
Kimish Patel	771a9debbe	[PT2E][Quant] Refactor quantizer and qnnpack qantizer code to support dqlinear config (#99399 ) This diff introduces a few refactors: - Move observer creation to utils.py. - Use quantization spec to supply args to observers. - Use annotation function registration corresponding QuantizationConfig. This will be later used in dynamic quantized linear. Differential Revision: [D45073790](https://our.internmc.facebook.com/intern/diff/D45073790/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99399 Approved by: https://github.com/jerryzh168	2023-05-03 03:23:32 +00:00
Edward Z. Yang	1bbca4fbc0	Relax after_aot restriction on no buffers, serialize small constants (#100472 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100472 Approved by: https://github.com/bdhirsh, https://github.com/voznesenskym	2023-05-03 03:10:22 +00:00
Edward Z. Yang	2089a9bd48	Refactor minifier tests to be more compact (#100471 ) Mostly burning in more assumptions based on commonality on the tests, so writing new tests takes less code. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100471 Approved by: https://github.com/voznesenskym	2023-05-03 03:10:22 +00:00
Edward Z. Yang	409fc7a4c7	Make hash_storage work with size 0/1 storage (#100467 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100467 Approved by: https://github.com/Skylion007, https://github.com/voznesenskym	2023-05-03 03:10:19 +00:00
drisspg	4b9ba3fad5	Allow discontiguous NestedTensors to empty_like (#98383 ) # Summary Preivously we disallowd dis-contiguous NTs to passed into to empty_like. This was done out of an abundance of caution, :think:. However it should be safe to create an empty NT for dis-contiguous NTs. Empty like does account for offsets, strides, and sizes in construction of the result and therefore this should be safe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98383 Approved by: https://github.com/cpuhrsch	2023-05-03 02:27:08 +00:00
Sergii Dymchenko	419387f66f	Run periodic jobs only twice a day on weekends (#100489 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100489 Approved by: https://github.com/ZainRizvi, https://github.com/malfet	2023-05-03 02:07:28 +00:00
Pearu Peterson	6b2ecb12b6	OpInfo: specifying sparse sample input function implies the corresponding layout support (#100392 ) As in the title. The PR fixes an issue of silently skipping tests as described in #100391. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100392 Approved by: https://github.com/pmeier, https://github.com/cpuhrsch	2023-05-03 02:04:39 +00:00
Pearu Peterson	3ae0e23b90	Fix sum OpInfo for sparse sample inputs and assert coverage for sparse-enabled operators (#100391 ) This PR enables sum tests for sparse sample inputs. Previously, the tests existed but were never run because the sum OpInfo instance was created without specifying `supports_sparse_=True`. To avoid such mistakes in the future, the following PR https://github.com/pytorch/pytorch/pull/100392 enables the `supports_sparse_` flags automatically when OpInfo creation specifies `sample_inputs_sparse_*_func`. In addition, the PR applies several fixes to sum tests for sparse sample inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100391 Approved by: https://github.com/cpuhrsch	2023-05-03 02:04:39 +00:00
Michael Voznesensky	ffcbd1c2de	Move tracked nn_modules from OutputGraph to TracingContext (#100457 ) Lint Pull Request resolved: https://github.com/pytorch/pytorch/pull/100457 Approved by: https://github.com/anijain2305	2023-05-03 02:00:11 +00:00
Michael Voznesensky	2439090bef	Remove special casing for stride/size setup for guards (#100456 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100456 Approved by: https://github.com/ezyang	2023-05-03 01:59:52 +00:00
Shen Li	9439cb0e11	Avoid using einsum for torch.cat DTensor propogation (#100251 ) DTensor was reusing `einop_rule` to propagate sharding for torch.cat. However, einsum only supports up to 52 subscripts (i.e., input tensors). We have encountered use cases where one cat operator has more than 60 input tensors. Therefore, this commit reimplements sharding prop rule for cat without using einsum. Differential Revision: [D45435232](https://our.internmc.facebook.com/intern/diff/D45435232) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100251 Approved by: https://github.com/wanchaol	2023-05-03 01:56:18 +00:00
AllenTiTaiWang	d23dbfff60	[ONNX] Add RemoveConstantInputStep to adapt torch inputs to ONNX inputs (#100252 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100252 Approved by: https://github.com/BowenBao	2023-05-03 01:50:47 +00:00
Bin Bao	6b5f50004d	[inductor] Change the default value of layout (#100254 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100254 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-05-03 01:48:05 +00:00
PyTorch MergeBot	c3aa59c8f5	Revert "[WIP] enable cuda graphs support for flash attention with dropout (#100196 )" This reverts commit 32615618e439ce84d9365bd0d8892e34fcbe8add. Reverted https://github.com/pytorch/pytorch/pull/100196 on behalf of https://github.com/clee2000 due to broke no ops build `32615618e4` https://github.com/pytorch/pytorch/actions/runs/4866578063/jobs/8678258318 ([comment](https://github.com/pytorch/pytorch/pull/100196#issuecomment-1532352810))	2023-05-03 01:41:56 +00:00
Nikita Shulga	dc4a25312f	Fix hosts update for binary build (#100507 ) Forward fix for regression caused by https://github.com/pytorch/pytorch/pull/100475 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100507 Approved by: https://github.com/huydhn	2023-05-03 00:58:04 +00:00
Sergii Dymchenko	e4ad67f9c2	Remove ci: sev label and details from ci-sev.md tempalte (#100504 ) We don't want to add the label automatically: this way we can limit CI SEV creation to people with write permissions only. Also remove `details` section as it's not really filled by people. Fixes https://github.com/pytorch/pytorch/issues/100143 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100504 Approved by: https://github.com/clee2000, https://github.com/huydhn	2023-05-03 00:51:06 +00:00
PyTorch MergeBot	34e90b8df1	Revert "[inductor] Cleanup strip_last_size logic (#100305 )" This reverts commit de7793d577fec7af286ba63b309ccd3795a8c038. Reverted https://github.com/pytorch/pytorch/pull/100305 on behalf of https://github.com/jansel due to causes IMA errors on huggingface ([comment](https://github.com/pytorch/pytorch/pull/100305#issuecomment-1532317310))	2023-05-03 00:42:48 +00:00
Kimish Patel	8ec0a939a2	[PT2E][Quant] Fix but in quant spec of symmetric static quant (#99398 ) Activation quant spec should have qscheme = per_tensor_affine Weights quant spec should have ch_axis=0 for per_channel_symmetric Differential Revision: [D45073789](https://our.internmc.facebook.com/intern/diff/D45073789/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99398 Approved by: https://github.com/jerryzh168	2023-05-03 00:36:03 +00:00
Michael Gschwind	8430430e94	Handle trailing masked column behavior for nested tensor (#100113 ) Summary: Handle trailing masked column behavior for nested tensor by padding during to_padded, to original tensor size https://github.com/pytorch/pytorch/issues/97111 Test Plan: sandcastle & github Differential Revision: D45167874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100113 Approved by: https://github.com/bertmaher, https://github.com/cpuhrsch, https://github.com/drisspg	2023-05-03 00:30:17 +00:00
Animesh Jain	0acfe2ce09	[dashboard] higher tolerance for AlbertForQuestionAnswering (#100277 ) @desertfire Pull Request resolved: https://github.com/pytorch/pytorch/pull/100277 Approved by: https://github.com/desertfire	2023-05-02 23:51:08 +00:00
Jason Ansel	de7793d577	[inductor] Cleanup strip_last_size logic (#100305 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100305 Approved by: https://github.com/ngimel	2023-05-02 23:46:26 +00:00
Natalia Gimelshein	32615618e4	[WIP] enable cuda graphs support for flash attention with dropout (#100196 ) Fixes #99905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100196 Approved by: https://github.com/drisspg	2023-05-02 23:05:31 +00:00
Bin Bao	a587f1ff0a	[CI] Change the dashboard run to once a day (#100499 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100499 Approved by: https://github.com/huydhn	2023-05-02 22:35:49 +00:00
Nikita Shulga	7ff71a3a48	Populate download.pytorch.org IP to container (#100475 ) Follow up after https://github.com/pytorch/pytorch/pull/100436 to disable download.pytorch.org access over ipv6 access problems. Why not copy `/etc/hosts` from host to the container? Because it would break container ip resolution in distributed tests, that relies on `socket.gethostbyname(socket.gethostname())` to work. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 756d0b1</samp> Propagate `download.pytorch.org` IP address to docker containers in `test-pytorch-binary` action and workflow. This fixes DNS issues when downloading PyTorch binaries inside the containers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100475 Approved by: https://github.com/huydhn	2023-05-02 22:08:06 +00:00
Catherine Lee	2ec6eb3d09	Revert "PyTorch -> C++17 (#98209 )" (#100497 ) This reverts commit 8f0c825d36d6737000dd93bc86aa18761166a7b6. https://github.com/pytorch/pytorch/pull/98209#issuecomment-1532099965, cannot revert normally due to unmerged linked diff Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100497 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-05-02 21:22:31 +00:00
PyTorch MergeBot	543b7ebb50	Revert "Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances (#100377 )" This reverts commit 7caac545b1d8e5de797c9593981c9578685dba81. Reverted https://github.com/pytorch/pytorch/pull/100377 on behalf of https://github.com/malfet due to This is not the PR I've reviewed ([comment](https://github.com/pytorch/pytorch/pull/100377#issuecomment-1532148086))	2023-05-02 21:05:53 +00:00
Jean Schmidt	7caac545b1	Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances (#100377 ) This is reopening of the PR [100091](https://github.com/pytorch/pytorch/pull/100091) # About this PR Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral. Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances. As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower. Instances definition can be found here: https://github.com/pytorch/test-infra/pull/4072 This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows: * migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch` * migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` * terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu` * evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn) The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs. # Copilot Summary <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 579d87a</samp> This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`. # Copilot Poem <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 579d87a</samp> > _We're breaking free from the ephemeral chains_ > _We're running on the nonephemeral lanes_ > _We're building faster, testing stronger, supporting newer_ > _We're the non-ephemeral runners of fire_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/100377 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman	2023-05-02 20:41:12 +00:00
Masaki Kozuki	311c2bb7ec	Move pattern match for foreach before bulky if-else in `save_variables` (#100445 ) One caveat could be that the first if branch doesn't seem to use `arg.expr` at all. fixes https://github.com/pytorch/pytorch/pull/96405#discussion_r1175669480. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100445 Approved by: https://github.com/soulitzer	2023-05-02 20:38:51 +00:00
PyTorch MergeBot	e8a1d0be3e	Revert "Mount /etc/hosts into container (#100475 )" This reverts commit 99ded8bbcea896b02f1c0babb055329c503ca95e. Reverted https://github.com/pytorch/pytorch/pull/100475 on behalf of https://github.com/malfet due to Breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/100475#issuecomment-1532097309))	2023-05-02 20:23:32 +00:00
Animesh Jain	5fbb40669f	[dynamo][moco] Disallow_in_graph distributed APIs (#100071 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100071 Approved by: https://github.com/jansel, https://github.com/H-Huang	2023-05-02 20:09:25 +00:00
Rodrigo Kumpera	0dc671c247	[c10d] Add new Store methods: append, multi_get, multi_set. (#100379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100379 Approved by: https://github.com/fduwjj	2023-05-02 19:46:09 +00:00
Richard Barnes	8f0c825d36	PyTorch -> C++17 (#98209 ) This diff locks in C++17 as the minimum standard with which PyTorch can be compiled. This makes it possible to use all C++17 features in PyTorch. This breaks backward compatibility in the sense that users with older compilers may find their compilers no longer are sufficient for the job. Summary: #buildmore Differential Revision: D44356879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98209 Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/PaliC	2023-05-02 19:41:50 +00:00
Eli Uriegas	50b0fff060	ci: win cpu test -> trunk, cuda test -> periodic (#100478 ) Bumps windows CPU tests to trunk.yml (retaining build in pull.yml), this also bumps the cuda tests to periodic.yml (retaining build in trunk.yml). Hopefully this change will rein in windows spending on AWS since it is currently our costliest platform (in terms of dollar amount / hours used) Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100478 Approved by: https://github.com/kit1980, https://github.com/huydhn	2023-05-02 19:29:33 +00:00
Aaron Gokaslan	0efab60401	[BE] Update cutlass with NVIDIA upstream changes to 3.1 (#100333 ) Updates cutlass with some more upstream changes that went into the 3.1 rc. We already merged in 3.1 so best to get these performance and other fixes into master as well. Follow up to #94188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100333 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-05-02 19:12:29 +00:00
Fuzzkatt	06bf5d4de7	enable headdims > 64 for flash attention on sm90 (#99776 ) Follow up to #99105 which disabled FlashAttention when using autograd and mem eff attention for the following cases head_dim > 64 sm86 or newer We have tested enabling FlashAttention on sm90 and it works, so this PR will enable it back for sm90 and add in a test Pull Request resolved: https://github.com/pytorch/pytorch/pull/99776 Approved by: https://github.com/malfet, https://github.com/drisspg	2023-05-02 19:11:48 +00:00
Nikita Karetnikov	279f3cd0a6	[pt2] add `SymInt` support for `dsplit`, `hsplit`, `vsplit` (#100352 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100352 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2023-05-02 18:51:03 +00:00
Theodor Arsenij	794e3971ab	Add size check before calling stack_.at(dict_pos) in unpickler.cpp (#94300 ) Hi! I've been fuzzing different pytorch modules, and found a crash inside one of them. Specifically, I'm talking about a module for unpickling and a function called `Unpickler::readInstruction()`. Running this function with provided crash file results in a crash, which occurs while calling `auto dict = stack_.at(dict_pos).toGenericDict();` [unpickler.cpp:561](`0e94fbc0c8/torch/csrc/jit/serialization/unpickler.cpp (L561)`). The crash occurs, because the index `dict_pos` is out of bounds (which itself happens because the stack size is 0). Besides this pull-request, there is another one related to unpickler hardening: https://github.com/pytorch/pytorch/pull/84343 All tests were performed on this pytorch version: [abc54f93145830b502400faa92bec86e05422fbd](`abc54f9314`) ### How to reproduce 1. To reproduce the crash, use provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch) 2. Build the container: `docker build -t oss-sydr-fuzz-pytorch-reproduce .` 3. Copy crash file to the current directory: - [crash-042dff5e121580425d9d34d0f293918f3c9fbf1e.zip](https://github.com/pytorch/pytorch/files/10674361/crash-042dff5e121580425d9d34d0f293918f3c9fbf1e.zip) 4. Run the container: ``docker run --privileged --network host -v `pwd`:/homedir --rm -it oss-sydr-fuzz-pytorch-reproduce /bin/bash`` 5. And execute the binary: `/message_deserialize_sydr /homedir/crash-042dff5e121580425d9d34d0f293918f3c9fbf1e` After execution completes you will see this error message: ```txt terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check: __n (which is 18446744073709551613) >= this->size() (which is 0) ``` And this stacktrace: ```asan erminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check: __n (which is 18446744073709551613) >= this->size() (which is 0) ==39== ERROR: libFuzzer: deadly signal #0 0x5d0df1 in __sanitizer_print_stack_trace /llvm-project/compiler-rt/lib/asan/asan_stack.cpp:87:3 #1 0x545727 in fuzzer::PrintStackTrace() /llvm-project/compiler-rt/lib/fuzzer/FuzzerUtil.cpp:210:5 #2 0x52b933 in fuzzer::Fuzzer::CrashCallback() /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:233:3 #3 0x7f9118e0341f (/lib/x86_64-linux-gnu/libpthread.so.0+0x1441f) #4 0x7f9118c2300a in raise (/lib/x86_64-linux-gnu/libc.so.6+0x4300a) #5 0x7f9118c02858 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x22858) #6 0x7f9119040910 (/lib/x86_64-linux-gnu/libstdc++.so.6+0x9e910) #7 0x7f911904c38b (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa38b) #8 0x7f911904c3f6 in std::terminate() (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa3f6) #9 0x7f911904c6a8 in __cxa_throw (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa6a8) #10 0x7f91190433aa (/lib/x86_64-linux-gnu/libstdc++.so.6+0xa13aa) #11 0x63acdf in std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_range_check(unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:1073:4 #12 0xce8f93e in std::vector<c10::IValue, std::allocator<c10::IValue> >::at(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:1094:2 #13 0xce8f93e in torch::jit::Unpickler::readInstruction() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:546:26 #14 0xce8d527 in torch::jit::Unpickler::run() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:235:27 #15 0xce8d1c2 in torch::jit::Unpickler::parse_ivalue() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:192:3 #16 0xcdf0792 in torch::jit::unpickle(std::function<unsigned long (char, unsigned long)>, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch_fuzz/torch/csrc/jit/serialization/pickle.cpp:127:20 #17 0xcdf104d in torch::jit::unpickle(char const, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch_fuzz/torch/csrc/jit/serialization/pickle.cpp:137:10 #18 0xe0532db in torch::distributed::rpc::ScriptRemoteCall::fromMessage(torch::distributed::rpc::Message const&) /pytorch_fuzz/torch/csrc/distributed/rpc/script_remote_call.cpp:74:16 #19 0xe0ffa10 in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch_fuzz/torch/csrc/distributed/rpc/utils.cpp:108:14 #20 0x602a41 in LLVMFuzzerTestOneInput /message_deserialize_fuzz.cc:192:27 #21 0x52ce61 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15 #22 0x516d7c in fuzzer::RunOneTest(fuzzer::Fuzzer, char const, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6 #23 0x51cacb in fuzzer::FuzzerDriver(int, char**, int ()(unsigned char const*, unsigned long)) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9 #24 0x546062 in main /llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10 #25 0x7f9118c04082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) #26 0x51169d in _start (/message_deserialize_fuzz+0x51169d) NOTE: libFuzzer has rudimentary signal handlers. Combine libFuzzer with AddressSanitizer or similar for better crash reports. SUMMARY: libFuzzer: deadly signal ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94300 Approved by: https://github.com/malfet, https://github.com/apach301	2023-05-02 18:50:31 +00:00
Sergii Dymchenko	ab65bac3ce	Use yaml.SafeLoader instead of legacy yaml.Loader (#100443 ) See `957ae4d495/lib/yaml/loader.py (L51)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100443 Approved by: https://github.com/Skylion007	2023-05-02 18:32:36 +00:00
Pearu Peterson	02a0fb8df4	Add error_inputs_sparse method to OpInfo (#100389 ) Per https://github.com/pytorch/pytorch/pull/98288#discussion_r1170553576, this PR introduces `OpInfo.error_inputs_sparse_func` attribute for sparse inputs in parallel to the `OpInfo.error_inputs_func` attribute which is used for strided inputs. These attributes are kept separate as the existing testing framework that calls `error_inputs_func` may apply operations to inputs that are unsupported for sparse tensors (e.g. as in test/functorch/). Pull Request resolved: https://github.com/pytorch/pytorch/pull/100389 Approved by: https://github.com/cpuhrsch, https://github.com/pmeier	2023-05-02 18:30:12 +00:00
Svetlana Karslioglu	d425da8bf3	Replace master with main in links and docs/conf.py (#100176 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100176 Approved by: https://github.com/albanD, https://github.com/malfet	2023-05-02 18:20:32 +00:00
zdevito	0aac244680	Support expandable_segments:True in fbcode for caching allocator Now that expandable_segments has been merged from OSS, we can enable it in the internal build. It still defaults to off, so this should not change any behavior changes in the allocator unless the flag is explicitly set. Differential Revision: D45249535 Pull request resolved: https://github.com/pytorch/pytorch/pull/100184	2023-05-02 11:12:39 -07:00
Nikita Shulga	99ded8bbce	Mount /etc/hosts into container (#100475 ) Follow up after https://github.com/pytorch/pytorch/pull/100436 to disable download.pytorch.org access over ipv6 access problems. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 55c9443</samp> This pull request improves the network configuration of the test-pytorch-binary GitHub action and workflow by mounting the host's `/etc/hosts` file into the container. This enables the container to resolve hostname aliases consistently with the host machine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100475 Approved by: https://github.com/huydhn	2023-05-02 17:34:07 +00:00
PyTorch MergeBot	af92fc1cd7	Revert "[functorch] test for compiling functorch transforms (#100151 )" This reverts commit ea5f6d73124c799d402a5e749b923c21af84e4a5. Reverted https://github.com/pytorch/pytorch/pull/100151 on behalf of https://github.com/atalman due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100151#issuecomment-1531871900))	2023-05-02 17:33:29 +00:00
Richard Zou	4c99f9cdf2	Initial version of Dynamo capture for HigherOrderOperator (#99988 ) This PR introduces a `wrap(body_fn, args)` higher order operator The semantics of `wrap(body_fn, args)` is to just run `body_fn(args)` Underneath Dynamo, this PR makes it so that we rewrite calls to `wrap(body_fn, args)` with `wrap(new_fn, *new_args)` where `new_fn` has no free variables. This PR does not update cond/map to use the new mechanism yet (we do not support nn.Modues yet, will come in the future). The design we take is: - OutputGraph represents the graph being built by Dynamo that may be compiled and executed. - OutputGraph owns a root SubgraphTracer, where it builds the FX graph. - OutputGraph may own multiple nested SubgraphTracers. - When we need to trace the body function of a HigherOrderOperator, we construct a new SubgraphTracer to build the graph of the body function. Mechanically, when Dynamo sees a new `wrap` HigherOrderOperator with a body function, it: - Creates a new SubgraphTracer via OutputGraph.new_subtracer - Executes the body function This captures the body function into the graph on the new SubgraphTracer while modifying the state of the OutputGraph. For example, the OutputGraph may receive new GraphArgs, new guards, and new side effects. If capture of the body function fails, then Dynamo graph breaks on the HigherOrderOperator. Test Plan: - added test/dynamo/test_higher_order_ops.py Future: - We're not actually able to tell Dynamo to completely graph break on the HigherOrderOperator. Instead, when we do graph break, Dynamo begins introspecting `HigherOrderOperator.__call__`. It should probably not do this. - Ideally we would error out on new SideEffects. I don't know how to do this yet. - We don't support dealing with nn.Modules yet (e.g. calling nn.Modules or accessing attributes of tracked nn.Modules from a body_fn). There's an open question on what should actually happen here - Ideally we would rewrite map/cond to use the new mechanism but we need to fix the previous bullet point before we can get there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99988 Approved by: https://github.com/voznesenskym, https://github.com/anijain2305	2023-05-02 17:11:02 +00:00
Richard Zou	984a2397ba	Refactor OutputGraph (#99987 ) This PR splits OutputGraph into two classes: - SubgraphTracer (handles FX-tracing) - OutputGraph (handles Dynamo-specific output graph logic, like tracking graph inputs, compiling the graph, and executing it). The motivation behind this is in the next PR up in the stack. TL;DR is: in order to do higher-order operators, we need nested SubgraphTracer, one for each level of nesting of the higher-order operators. I'm happy to flatten the stack into a single PR, but this separate made it easier for me to test. Lmk if you want the stack flattened. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/99987 Approved by: https://github.com/anijain2305, https://github.com/voznesenskym	2023-05-02 17:11:02 +00:00
PyTorch MergeBot	1114673c90	Revert "[pytorch] Accelerate indexing_backward_kernel with duplicates (#99441 )" This reverts commit 97afbcbc8007857a51c85e9c61fe6d80564ef1f9. Reverted https://github.com/pytorch/pytorch/pull/99441 on behalf of https://github.com/ngimel due to breaks ROCM ([comment](https://github.com/pytorch/pytorch/pull/99441#issuecomment-1531804487))	2023-05-02 16:46:04 +00:00
Bin Bao	ec3c8abb54	[inductor] Remove redundant model copy when running with cpp_wrapper (#100275 ) Summary: to reduce the peak memory consumption Pull Request resolved: https://github.com/pytorch/pytorch/pull/100275 Approved by: https://github.com/jansel	2023-05-02 16:43:18 +00:00
Angela Yi	af62d098fe	[export] Migrate internal verifier to subclass export/verifier Differential Revision: D45416983nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100388	2023-05-02 08:50:48 -07:00
Nikita Karetnikov	41361538a9	[pt2] add `SymInt` support for `tensordot` and `inner` (#100356 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100356 Approved by: https://github.com/ezyang	2023-05-02 14:42:50 +00:00
Ivan Kobzarev	4582ceb2c4	[distributed][sharded_tensor] Move local_shards check from ShardedTensorBase to ShardedTensor (#100197 ) Differential Revision: [D45369211](https://our.internmc.facebook.com/intern/diff/D45369211) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100197 Approved by: https://github.com/fduwjj	2023-05-02 12:42:24 +00:00
Edward Z. Yang	8556cf208a	Make backend_accuracy_fails suppress errors in same_two_models (#100324 ) The basic idea is that if we're trying to match for an accuracy error, we don't want to switch to a compile/runtime error, because that's probably us breaking things in a different way. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100324 Approved by: https://github.com/voznesenskym	2023-05-02 11:44:16 +00:00
Edward Z. Yang	054a254b06	Run minifier tests same process when possible (#100416 ) The fast minifier tests now take only 10s to run. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100416 Approved by: https://github.com/voznesenskym	2023-05-02 11:44:16 +00:00
Edward Z. Yang	f093ee1722	Prevent Triton from getting eagerly imported when importing torch._inductor (#100374 ) This makes 'import torch._inductor.utils' go from 3.5s to 2.1s See also https://github.com/openai/triton/issues/1599 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100374 Approved by: https://github.com/voznesenskym	2023-05-02 11:44:12 +00:00
Edward Z. Yang	74cc377162	Speed up minifier tests by applying some configs that speed things up. (#100387 ) Previously, test_after_aot_cpu_compile_error took 101s. After this patch, it only takes 46s, a more than 2x speedup. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100387 Approved by: https://github.com/voznesenskym	2023-05-02 11:44:09 +00:00
Edward Z. Yang	0a479d9b9c	Simplify minifier testing by incorporating fault injection in prod code (#100357 ) Previously, minifier testing injected faults by injecting extra code into the repro scripts, and then ensuring this code got propagated to all subsequent subprocess calls. This was not only quite complicated, but also induced a big slowdown on the minifier, because to inject the faults, you had to import torch._inductor, which would cause the compilation threads to immediately get initialized before you even got to do anything else in the repro script. This new approach fixes this problem by incorporating the fault injection into "prod" code. Essentially, for inductor fault injection we introduce some new config flags that let you "configure" Inductor to be buggy; for Dynamo fault injection we just permanently keep the buggy testing backends registered. This is MUCH simpler: we only have to propagate the buggy config (which is something we're already doing), and it saves the minifier scripts from having to immediately initialize inductor on entry. Also, I enable the test for Triton runtime errors, now that tl.assert_device is here. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100357 Approved by: https://github.com/voznesenskym	2023-05-02 11:44:06 +00:00
Edward Z. Yang	17be65381d	Do not use pickle to output config entries in repro scripts (#100354 ) New output looks like: ``` torch._dynamo.config.dynamic_shapes = True torch._dynamo.config.assume_static_by_default = False torch._inductor.config.fallback_random = True torch._inductor.config.triton.cudagraphs = True ``` instead of an unreadable pickle. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100354 Approved by: https://github.com/voznesenskym	2023-05-02 11:44:01 +00:00
Huy Do	0093df78df	Manually resolve download.pytorch.org to IPv4 (#100436 ) This is an attempt to address https://github.com/pytorch/pytorch/issues/100400 by using only IPV4 when accessing the domain. I kind of want to ignore AAAA records from DNS instead, but couldn't find an easy way to do so. https://www.cloudflare.com/learning/dns/dns-records/dns-aaaa-record doc mentions this ``` Like A records, AAAA records enable client devices to learn the IP address for a domain name. The client device can then connect with and load the website. AAAA records are only used when a domain has an IPv6 address in addition to an IPv4 address, and when the client device in question is configured to use IPv6 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100436 Approved by: https://github.com/malfet	2023-05-02 08:35:48 +00:00
Animesh Jain	52a36a98d9	[dynamo] Graph break on a list referencing self (#100296 ) Fixes https://github.com/pytorch/pytorch/issues/100150 I did not try hard to support this w/o a graph break. As this pattern is not common, current PR graph breaks and avoids an infinite recursion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100296 Approved by: https://github.com/jansel	2023-05-02 06:38:28 +00:00
Lu Fang	090ec55f8d	Only skip in torch inductor test Differential Revision: D45464303nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100435	2023-05-01 22:21:37 -07:00
James Reed	d5169e7141	Use a stable ordering for saved values in functorch.default_partition (#100111 ) Previously, due to the use of the Python set data structure, the ordering of saved values (and how they would appear in the graph) was unstable and changed across runs, making it hard to debug downstream applications. Here we use a dict (with insertion-ordering semantics) to deduplicate values in a way that preserves ordering Pull Request resolved: https://github.com/pytorch/pytorch/pull/100111 Approved by: https://github.com/Skylion007	2023-05-02 05:14:31 +00:00
Kshiteej K	ea5f6d7312	[functorch] test for compiling functorch transforms (#100151 ) Add a basic test to make sure functorch-torch.compile works as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100151 Approved by: https://github.com/zou3519	2023-05-02 04:56:07 +00:00
Natalia Gimelshein	ff29722364	[inductor] Prevent reusing aliased buffers if aliases still have uses (#100332 ) Fixes #100314 In dependencies, we should track not only immediately used buffer, but also aliased buffers that point to it, otherwise we can reuse and overwrite the buffer while there are still pending uses. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100332 Approved by: https://github.com/jansel	2023-05-02 04:05:16 +00:00
PyTorch MergeBot	3fd46e1f0d	[vision hash update] update the pinned vision hash (#100437 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100437 Approved by: https://github.com/pytorchbot	2023-05-02 02:41:06 +00:00
Jason Ansel	fdc853b14c	Add --baseline option to benchmark runners (#100266 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100266 Approved by: https://github.com/ngimel	2023-05-02 02:35:11 +00:00
Edward Z. Yang	c6c9258357	Delete @property support at module level, it is unused (#100353 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100353 Approved by: https://github.com/voznesenskym	2023-05-02 01:50:20 +00:00
Edward Z. Yang	e918fd18e7	Disable densenet121 as it is flaky (#100371 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100371 Approved by: https://github.com/voznesenskym	2023-05-02 01:49:11 +00:00
Wanchao Liang	123be4b694	[dtensor] add debug tool to track op coverage (#100124 ) This PR adds a debug tool to track the op coverage needed in DTensor. Note that we specifically target ops after decomp table in inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/100124 Approved by: https://github.com/XilunWu	2023-05-02 01:45:55 +00:00
Li-Huai (Allan) Lin	13da6585b6	[MPS] Skip all empty ops tests (#100368 ) Fixes #100175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100368 Approved by: https://github.com/kulinseth	2023-05-02 00:43:58 +00:00
Li-Huai (Allan) Lin	a50fb50c51	[MPS] Fix exception regex not compared (#100367 ) Previously when using `self.assertRaisesRegex` to test raised exception and its regex, the regex wasn't actually compared because mps was not in the `NATIVE_DEVICES`. This PR fixes that by enabling exception regex comparisons for mps device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100367 Approved by: https://github.com/albanD	2023-05-02 00:43:58 +00:00
Nikita Shulga	5daef13883	Fix `merging` label removal (#100433 ) During regular merge process, when `GitHubPR` object is created, it does not have `merging` label and when label is added it does not update existing `GitHubPR` object either To fix the problem, call REST API wrapper `gh_remove_label` directly. Worst case that can happen, if label is already removed at this point, is that it will be printed to the stderr, which is not rendered on HUD anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/100433 Approved by: https://github.com/PaliC, https://github.com/kit1980	2023-05-02 00:30:13 +00:00
Hirochika Matsumoto	f143c92739	[docs] Fix typo in get-started.rst (#100355 ) This PR changes `""nvprims_nvfuser"` which should be a typo to `"nvprims_nvfuser"`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100355 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-05-02 00:29:53 +00:00
Nikita Shulga	7b684310c8	[BE][GHF] Do not call GraphQL twice (#100434 ) During regular merge process, `GitHubPR` and `GitHubRepo` objects are first created in main() and than re-created in `merge()` instead of being passed by reference, which results in making the same GraphQL requests to the repo twice <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at ee4e23e</samp> > _Sing, O Muse, of the skillful coder who refactored_ > _The `merge` function, to accept a `GitHubPR` object,_ > _And thus reduced the calls to the divine API_ > _And the duplication of code, that source of errors._ Pull Request resolved: https://github.com/pytorch/pytorch/pull/100434 Approved by: https://github.com/kit1980, https://github.com/PaliC, https://github.com/huydhn, https://github.com/ZainRizvi	2023-05-02 00:26:49 +00:00
Yanli Zhao	dc9c79d3cf	Allow each fully_shard unit to cast foward inputs for mixed precision config (#100290 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100290 Approved by: https://github.com/rohan-varma	2023-05-02 00:03:48 +00:00
Lu Fang	429155b3c8	Disable some check to get the test pass Differential Revision: D45437730nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100364	2023-05-01 16:28:12 -07:00
Richard Barnes	66fde107e2	[codemod] Replace hasattr with getattr in caffe2/torch/testing/_internal/common_device_type.py Differential Revision: D44886473nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100363	2023-05-01 16:28:07 -07:00
kwanghoon-meta	3fb0bf4d96	Automatic pulling ExtraFileMaps without explicit mapping. Differential Revision: D45170126nnPull Request resolved: https://github.com/pytorch/pytorch/pull/99747	2023-05-01 16:27:56 -07:00
Donald Dong	a1d041728b	Back out "[aarch64][tools/build_defs/third_party/fbcode_defs.bzl] Fix dep handling in cross-builds" Differential Revision: D45415678nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100294	2023-05-01 16:27:51 -07:00
andrew huang	c3ccdc0125	Add store.wait() tests (#99577 ) Fixes #53863 `pytest test/distributed/test_store.py -vsk test_wait` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99577 Approved by: https://github.com/H-Huang	2023-05-01 22:59:52 +00:00
valentinandrei	97afbcbc80	[pytorch] Accelerate indexing_backward_kernel with duplicates (#99441 ) By knowing the stride value ahead of time, we can simplify the kernel code as follows: If `stride == 1` we can use the whole warp to reduce the gradients If `stride < warp_size` we don't need the internal `while (start_feature < stride)` loop as `blockDim.x` is always 32 This changes improve the performance of the kernel when duplicates are present and do not affect the performance with low amount of duplicates. The implementation is deterministic. The proposed implementation uses `opmath_t` to accumulate in registers the gradient values so when using FP16/BF16 it may overflow if the number of elements is large. This is different from the initial implementation who accumulates in `scalar_t` and does not overflow. In addition, when the stride is 1, we are using warp shuffles to sum the gradient so the order of the addition is slightly different than a reference implementation which causes some minor numerical differences when compared to a reference. TEST CODE: ``` # The first element is the number of iterations. # The second represents the number of unique elements. If # set to 0, the number of unique elements is equal to the # number of elements. # The remaining elements are the tensor dimensions. basic_indexing_tests = [ [10, 0, 12345], [10, 4, 12345], [10, 16, 512, 512, 32], [10, 0, 4, 4], [10, 0, 32, 32], [10, 8, 32, 32], [10, 8, 64, 32, 16], [10, 0, 64, 32, 16], [10, 16, 512, 512, 32], [10, 0, 675, 999, 13], [10, 0, 123, 456, 31], [10, 0, 512, 512, 32], [10, 4, 512, 512, 32], [10, 2, 512, 512, 32], [10, 0, 128, 128, 16, 16], [10, 8, 128, 126, 16, 16], [10, 4, 128, 126, 16, 16], [10, 0, 64, 64, 16, 16, 16], [10, 8, 64, 64, 16, 16, 16], [10, 2, 64, 64, 16, 16, 16], [10, 1, 64, 64, 16, 16, 16], ] def run_basic_indexing_on_device(x, index, expected, device_string, iters): x_dev = x.to(device_string) x_dev = x_dev.detach().requires_grad_() index_dev = index.to(device_string) # Run backward pass; keep gradients and measure time torch.cuda.synchronize() t_bw_s = time() for _ in range(iters): y = x_dev[index_dev] z = y.sum() z.backward() torch.cuda.synchronize() t_bw_s = (time() - t_bw_s) / iters return (x_dev.grad, t_bw_s) def run_basic_indexing_test(test_input): tensor_size = tuple(test_input[:5]) niters = test_input[0] num_unique = test_input[1] tensor_size = tuple(test_input[2:]) numel = 1 for dim in tensor_size: numel = dim if num_unique == 0: num_unique = numel index = torch.randint(0, num_unique, tensor_size, dtype=torch.long, device="cpu") x = torch.randn((numel,), dtype=torch.float32, device="cuda") index = index.detach() x = x.detach().requires_grad_() (cpu_grad, t_bw_cpu) = run_basic_indexing_on_device(x, index, numel / 2, "cpu", 1) (gpu_grad, t_bw_gpu) = run_basic_indexing_on_device(x, index, numel / 2, "cuda", 1) max_delta = torch.max(torch.abs(cpu_grad - gpu_grad.to("cpu"))) missmatches = torch.nonzero(torch.abs(cpu_grad - gpu_grad.to("cpu"))) (gpu_grad_perf, t_gpu) = run_basic_indexing_on_device( x, index, numel / 2, "cuda", niters ) print( "test = {}, delta = {:.5f}, missmatches = {} duration_ms = {:.3f}".format( tuple(test_input), max_delta, missmatches, t_gpu 1000.0 ) ) if torch.numel(missmatches) > 0: print("cpu grad = {}", cpu_grad[missmatches]) print("gpu grad = {}", gpu_grad[missmatches]) ``` RESULTS: ``` Default Implementation test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.726 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.867 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.514 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.689 test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.547 test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.537 test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.199 test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.584 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.055 test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.411 test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.419 test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.048 test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.633 test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 606.403 test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4.099 test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 76.813 test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.760 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.547 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 317.583 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1204.800 test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2412.133 Small Stride Kernel Version test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.904 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.156 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 308.878 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.566 test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.540 test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.550 test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.868 test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.656 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.856 test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.624 test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.837 test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.274 test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1127.040 test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2123.942 test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.282 test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 288.997 test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 547.267 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 12.844 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1178.934 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4262.042 test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8172.318 Stride 1 Kernel Version test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.692 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.834 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 81.023 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.631 test = (100, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.491 test = (100, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.477 test = (50, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.561 test = (50, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.516 test = (16, 10, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 126.455 test = (10, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.238 test = (10, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.520 test = (10, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 7.854 test = (10, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 306.327 test = (10, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 610.498 test = (5, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.684 test = (5, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 75.604 test = (5, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.679 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.525 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 315.095 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1214.715 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99441 Approved by: https://github.com/ngimel	2023-05-01 22:41:00 +00:00
Michael Lazos	dc27b842ba	Ensure optimizer state references are cleared (#100282 ) Fixes https://github.com/pytorch/pytorch/issues/100264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100282 Approved by: https://github.com/janeyx99, https://github.com/yanboliang	2023-05-01 22:25:07 +00:00
Catherine Lee	e88e92e7a2	Update to reruns + timeouts in run_test.py (#100412 ) https://github.com/pytorch/pytorch/pull/100200/files made unknown tests more likely to fail b/c lacking test times but still have time outs, so fix that Pull Request resolved: https://github.com/pytorch/pytorch/pull/100412 Approved by: https://github.com/huydhn	2023-05-01 21:51:53 +00:00
Michael Gschwind	29b2745285	Add message about need_weights=False performance profile. (#100396 ) Summary: Add message about need_weights=False/True performance profile. Test Plan: sandcastle Differential Revision: D45446417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100396 Approved by: https://github.com/albanD	2023-05-01 21:45:41 +00:00
Michael Voznesensky	940662c4dc	Remove some dyn shape flags (#100317 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100317 Approved by: https://github.com/ezyang, https://github.com/Neilblaze	2023-05-01 21:36:49 +00:00
Michael Voznesensky	aafc6ce8cc	Produce constant variables in cases where a SymNode is created with a constant (#100144 ) ` AOT_DYNAMIC_SHAPES=1 TORCHDYNAMO_DYNAMIC_SHAPES=1 benchmarks/dynamo/huggingface.py --performance --training --amp --backend eager --disable-cudagraphs --device cuda --only AllenaiLongformerBase --explain` Looks promising! Goes from: Dynamo produced 173 graphs covering 2760 ops with 160 graph breaks (14 unique) To: Dynamo produced 6 graphs covering 2298 ops with 15 graph breaks (7 unique) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100144 Approved by: https://github.com/ezyang	2023-05-01 21:32:11 +00:00
PaliC	0cf6e74fa9	add users to external contrib stat upload (#100403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100403 Approved by: https://github.com/kit1980	2023-05-01 20:35:51 +00:00
BowenBao	0bcb9dac4f	[ONNX] Non-global diagnostic context (#100219 ) Summary: * `dynamo_export`, and everything within now access diagnostic context through a maintained local variable, instead of global. * Refactored `diagnose_call` decorator to require local diagnostic context, instead of accessing global. * Modified `test_fx_to_onnx_.py` tests to only log '.sarif' logs when `verbose=True`. * Temporarily removed diagnostics for `OnnxFunction`, as they don't have access to diagnostic context anymore. These diagnostics will be the responsibility of `onnxscript`, and they will return once diagnostics system is integrated there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100219 Approved by: https://github.com/justinchuby	2023-05-01 19:58:53 +00:00
BowenBao	8e084cbfaa	[ONNX] Remove 'diagnose_step' (#99944 ) `ThreadFlowLocation`, a.k.a 'step', cannot fully be visualized by `SARIF vscode extension` today. Discarding `diagnose_step` such that we don't end up creating diagnostics that record things there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99944 Approved by: https://github.com/justinchuby	2023-05-01 19:58:53 +00:00
BowenBao	c94b6a6712	[ONNX] Introduce 'diagnostics' to 'dynamo_export' api (#99668 ) Summary * Introduce `DiagnosticContext` to `torch.onnx.dynamo_export`. * Remove `DiagnosticEngine` in preparations to update 'diagnostics' in `dynamo_export` to drop dependencies on global diagnostic context. No plans to update `torch.onnx.export` diagnostics. Next steps * Separate `torch.onnx.export` diagnostics and `torch.onnx.dynamo_export` diagnostics. * Drop dependencies on global diagnostic context. https://github.com/pytorch/pytorch/pull/100219 * Replace 'print's with 'logger.log'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99668 Approved by: https://github.com/justinchuby, https://github.com/abock	2023-05-01 19:58:49 +00:00
Huy Do	85bd6bc010	Cache pretrained mobilenet_v2 and mobilenet_v3_large models in Docker (#100302 ) Follow the example I did for ONNX in https://github.com/pytorch/pytorch/pull/96793, this caches the pretrained `mobilenet_v2 model` and `mobilenet_v3_large` used by CI jobs. I think there might be an issue either with AWS or with the domain download.pytorch.org as the connection to the latter has been failing a lots in the past few days. Related flaky jobs: * https://github.com/pytorch/pytorch/actions/runs/4835873487/jobs/8618836446 * https://github.com/pytorch/pytorch/actions/runs/4835783539/jobs/8618404639 * https://github.com/pytorch/pytorch/actions/runs/4835783539/jobs/8618404639 ``` Downloading: "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth" to /var/lib/jenkins/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth Traceback (most recent call last): File "/opt/conda/envs/py_3.8/lib/python3.8/urllib/request.py", line 1354, in do_open h.request(req.get_method(), req.selector, req.data, headers, File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1256, in request self._send_request(method, url, body, headers, encode_chunked) File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1302, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1251, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1011, in _send_output self.send(msg) File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 951, in send self.connect() File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1418, in connect super().connect() File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 922, in connect self.sock = self._create_connection( File "/opt/conda/envs/py_3.8/lib/python3.8/socket.py", line 808, in create_connection raise err File "/opt/conda/envs/py_3.8/lib/python3.8/socket.py", line 796, in create_connection sock.connect(sa) OSError: [Errno 99] Cannot assign requested address ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100302 Approved by: https://github.com/ZainRizvi	2023-05-01 19:31:37 +00:00
Jacob Szwejbka	fd82f11882	[lite interpreter][hack] Add batch_norm_update_stats if batchnorm and training are present (#100134 ) Summary: not sure how the train bool to batch_norm gets set. But its not the is_training module level flag. We get weird behavior for teams trying to do on device training because of this Test Plan: ci Differential Revision: D45335791 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100134 Approved by: https://github.com/larryliu0820	2023-05-01 19:27:39 +00:00
Huy Do	d5bd23684d	Pin scikit-image and tb-nightly CI requirements (#100399 ) Docker build starts to fail recently, for example https://github.com/pytorch/pytorch/actions/runs/4853022561/jobs/8648730115. I notice some packages on requirements-ci haven't been pinned yet ## Testing `pip install --verbose -r .ci/docker/requirements-ci.txt` locally. I can confirm that the issue can be reproduced with Python 3.11 locally, and is fixed by this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100399 Approved by: https://github.com/clee2000	2023-05-01 19:10:08 +00:00
ts	2a6a159c0c	Modify repeat_interleave docs to highlight potential overloading (#99650 ) Fixes #99259 , drawing to attention that input is optional by putting a variation of the method signature at the top of the file and by modifying the input arguments. Note that I'm not certain how to get the additional signature at the same level of indentation as the first one, but I think this change does a good job of highlighting the change is optional. Would be happy to iterate on this if there are any issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99650 Approved by: https://github.com/mikaylagawarecki	2023-05-01 17:53:03 +00:00
Edward Z. Yang	73dac48464	Add bertmaher to triton pin CODEOWNERS (#100390 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100390 Approved by: https://github.com/bertmaher, https://github.com/albanD	2023-05-01 16:54:15 +00:00
Richard Barnes	5f92909faf	Use correct standard when compiling NVCC on Windows (#100031 ) Test Plan: Sandcastle Differential Revision: D45129001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100031 Approved by: https://github.com/ngimel	2023-05-01 16:28:23 +00:00
pbialecki	73645a8412	Add CUDA 12.1 CI workflows (#98832 ) Adds CUDA 12.1 CI workflows, removes CUDA 11.7. CC @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/98832 Approved by: https://github.com/atalman	2023-05-01 16:25:53 +00:00
Elias Ellison	3edff6b6ec	Improve detection of workspace/non-output allocations in cudagraphs (#99985 ) When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off. This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99985 Approved by: https://github.com/zdevito	2023-05-01 15:58:45 +00:00
Edward Z. Yang	5d93265cce	Report timeout/infra_error instead of 0.0000 on infra error (#100372 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100372 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-05-01 14:56:01 +00:00
Andrew Gu	a014d1b18c	[Easy][FSDP] Clarify `_use_unsharded_grad_views` comment (#100359 ) This is an easy follow-up to the previous PR to (1) clarify that `view` is the original parameter's gradient and (2) that after `reshard()` the gradient is on CPU only if offloading parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100359 Approved by: https://github.com/rohan-varma	2023-05-01 12:58:43 +00:00
Edward Z. Yang	2d8deffc1e	Refactor repro/minifier into CLI; add analyze (#100226 ) This is a two part PR; I can split it if you really want me to. The first part is a refactor of the after aot repro/minifier scripts to come with a command line interface. I maintain exact BC with the previous interface (so, e.g., you still get a repro.py and a run_minifier.py that do the same thing as before), but each of these scripts also take command line arguments now which you can use to customize what actually happens. Check `run_repro` for full documentation on the arguments. The second part of this is an implementation of `analyze` subcommand on the new CLI for any repro. <img width="1277" alt="image" src="https://user-images.githubusercontent.com/13564/235045677-8545aab7-5e83-4813-bbec-47783dc60122.png"> This facility is oriented towards accuracy debugging. It does several things: 1. It will run your model twice and check for nondeterminism in inductor/float64, even on intermediate inputs (our benchmarking nondeterminism test only checks for nondeterminism on the final output). This makes localizing which operator is nondeterministic easy. 2. It will run your compiled model side-by-side with eager and float64 variants, and then report when things diverge too far from RMSE delta from float64. Importantly, it does all this without requiring every intermediate to be held in memory (which will cause an OOM on large repros, such as the one I tested this on.) Some other minor improvements: * MinifierTestBase now has an easy to comment out spot that you can use to retain the temporary directory; good for debugging * We print "running minifier" and "running repro" in MinifierTestBase to make it easier to orient where logs are coming from * same takes a `log_error` optional argument which you can use to reroute the error logs when things mismatch * counters["inductor"]["intermediate_hooks"] tracks the number of intermediate hooks we've codegen'ed; good for populate the tqdm interface * torch.fx.interpreter gets an official `boxed_run` interface which uses the boxed arguments calling convention and doesn't retain inputs unnecessarily long * torch.utils._content_store gets compute_tensor_metadata/read_tensor_metadata helper functions for computing tensor information without serializing it Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100226 Approved by: https://github.com/bertmaher, https://github.com/bdhirsh, https://github.com/anijain2305	2023-05-01 11:12:38 +00:00
PyTorch MergeBot	89c43f4108	Revert "Produce constant variables in cases where a SymNode is created with a constant (#100144 )" This reverts commit d7bdfd345402615eccbcc8bda24addb5cd3fa696. Reverted https://github.com/pytorch/pytorch/pull/100144 on behalf of https://github.com/ezyang due to ci failure is real ([comment](https://github.com/pytorch/pytorch/pull/100144#issuecomment-1529587039))	2023-05-01 11:10:48 +00:00
Andrew Gu	83b803c2b5	[FSDP] Fix `use_orig_params=True`, CPU offload, `no_sync()` (#100180 ) This should fix https://github.com/pytorch/pytorch/issues/98494. We follow a similar approach as in past PRs for mismatched dtype or size from running in `no_sync()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100180 Approved by: https://github.com/rohan-varma	2023-05-01 05:15:51 +00:00
Justin Chu	e779a30d50	[BE] Fix SIM109 `compare-with-tuple` (#100337 ) Use {replacement} instead of multiple equality comparisons Pull Request resolved: https://github.com/pytorch/pytorch/pull/100337 Approved by: https://github.com/Skylion007	2023-04-30 19:51:32 +00:00
Justin Chu	01abbfbaae	[BE] Fix all B022 `useless-contextlib-suppress` (#100335 ) No arguments passed to contextlib.suppress. No exceptions will be suppressed and therefore this context manager is redundant Pull Request resolved: https://github.com/pytorch/pytorch/pull/100335 Approved by: https://github.com/Skylion007	2023-04-30 18:47:40 +00:00
Michael Voznesensky	d7bdfd3454	Produce constant variables in cases where a SymNode is created with a constant (#100144 ) ` AOT_DYNAMIC_SHAPES=1 TORCHDYNAMO_DYNAMIC_SHAPES=1 benchmarks/dynamo/huggingface.py --performance --training --amp --backend eager --disable-cudagraphs --device cuda --only AllenaiLongformerBase --explain` Looks promising! Goes from: Dynamo produced 173 graphs covering 2760 ops with 160 graph breaks (14 unique) To: Dynamo produced 6 graphs covering 2298 ops with 15 graph breaks (7 unique) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100144 Approved by: https://github.com/ezyang	2023-04-30 17:13:57 +00:00
Liao, Xuan	cc3ed8ae53	[inductor] avoid zero division error for dropout (#100222 ) Fixes #100025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100222 Approved by: https://github.com/ngimel, https://github.com/jgong5	2023-04-30 16:17:43 +00:00
Edward Z. Yang	beb7f79517	Fix intermediate hooks on inplace buffers, enable it in testing (#100322 ) Fixes https://github.com/pytorch/pytorch/issues/100312 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100322 Approved by: https://github.com/ngimel	2023-04-30 13:34:44 +00:00
Edward Z. Yang	155fa4e714	Use sympy.And instead of bitwise operator, for better promotion (#100328 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100328 Approved by: https://github.com/voznesenskym	2023-04-30 13:01:36 +00:00
Masaki Kozuki	6c934a89a7	Skip invalid grads in outplace foreachs' backward (#100256 ) Fixes #100248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100256 Approved by: https://github.com/soulitzer, https://github.com/albanD	2023-04-29 22:45:26 +00:00
XiaobingSuper	76bcc87277	fix TIMM mobilevit_s complier issue for dynamic CPU path (#100230 ) For TIMM ```mobilevit``` dynamic path, there has a compiler issue(``` python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/timm_models.py --performance --float32 -dcpu -n2 --inductor --no-skip --dashboard --only mobilevit_s --inference --dynamic-shapes``` ): ``` /tmp/torchinductor_xiaobing/xy/cxyslqzcsxkco4ieph7t63kn5q74ka35ak75lwfon32nlalxmru5.cpp:29:130: error: invalid operands of types ‘long int’ and ‘double’ to binary ‘operator%’ 29 \| auto tmp0 = in_ptr0[static_cast<long>((((((-1L) + ks1) / 8L)(((-1L) + ks1) / 8L))((((2L((i2 / 1L) % (std::ceil((1.0/2.0) + ((1.0/2.0)(((-1L) + ks1) ``` There has a modulo for ```long % double```, this PR will convert inputs to long before do this operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100230 Approved by: https://github.com/jansel	2023-04-29 12:05:47 +00:00
Animesh Jain	e1021ec535	[decomp] Bad accuracy for elu_backward (#100284 ) Accuracy is tested by the full model at https://github.com/pytorch/pytorch/issues/100061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100284 Approved by: https://github.com/ngimel	2023-04-29 04:21:20 +00:00
PyTorch MergeBot	3d55bce3bf	Revert "Move win-vs2019 build and test to unstable (#100281 )" This reverts commit 999e17d80a8107e88c92a1019e3d7aff1d740e8c. Reverted https://github.com/pytorch/pytorch/pull/100281 on behalf of https://github.com/malfet due to All runners has been updated ([comment](https://github.com/pytorch/pytorch/pull/100281#issuecomment-1528622556))	2023-04-29 03:47:12 +00:00
Nikita Shulga	2442858f52	[MPS] Fix `layer_norm_backward_mps` key (#100295 ) Followup after https://github.com/pytorch/pytorch/pull/98794 See report in https://github.com/pytorch/pytorch/issues/98602#issuecomment-1527312211 and reproducer in https://github.com/pytorch/pytorch/issues/98602#issuecomment-1528214175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100295 Approved by: https://github.com/kit1980, https://github.com/izaitsevfb	2023-04-29 03:37:35 +00:00
Animesh Jain	03806eddbf	[dynamo] Compile torchvision augmentations (#100292 ) Resolves https://github.com/pytorch/pytorch/issues/100112 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100292 Approved by: https://github.com/jansel	2023-04-29 02:59:41 +00:00
BowenBao	6647e61a59	Update docstring for dynamo.export tracing_mode (#100205 ) As follow up to #99877. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100205 Approved by: https://github.com/ezyang	2023-04-29 02:12:08 +00:00
PyTorch MergeBot	9075e3c2c6	Revert "Run test_fx_to_onnx_with_onnxruntime serially (#100298 )" This reverts commit 3a3f781f6cd90abbceb63a9cb59546d892ef899e. Reverted https://github.com/pytorch/pytorch/pull/100298 on behalf of https://github.com/huydhn due to No need as https://github.com/pytorch/pytorch/pull/100297 has been landed ([comment](https://github.com/pytorch/pytorch/pull/100298#issuecomment-1528476786))	2023-04-29 02:07:39 +00:00
BowenBao	1267897d67	[ONNX] Skip flaky dynamic test in CI (#100297 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100297 Approved by: https://github.com/titaiwangms, https://github.com/kit1980, https://github.com/huydhn	2023-04-29 01:56:45 +00:00
Huy Do	3a3f781f6c	Run test_fx_to_onnx_with_onnxruntime serially (#100298 ) This test starts to fail out of nowhere in trunk Pull Request resolved: https://github.com/pytorch/pytorch/pull/100298 Approved by: https://github.com/kit1980	2023-04-29 00:51:25 +00:00
Theodor Arsenij	7684044b71	Add size check before calling .back() in rpc/script_call.cpp (#94297 ) Hi! I've been fuzzing different pytorch modules, and found a crash inside one of them. Specifically, I'm talking about a module that processes `script_call` rpc requests and a function `ScriptCall::fromIValues(std::vector<at::IValue>& ivalues)`. Running this test case causes a crash that occurs when `ivalues.back()` is called [script_call.cpp:90](`abc54f9314/torch/csrc/distributed/rpc/script_call.cpp (L90)`). The crash occurs because the vector `ivalues` is empty. All tests were performed on this pytorch version: [abc54f93145830b502400faa92bec86e05422fbd](`abc54f9314`) The provided patch checks if there are enough elements in the ivalues vector. ### How to reproduce 1. To reproduce the crash, use provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch) 2. Build the container: `docker build -t oss-sydr-fuzz-pytorch-reproduce .` 3. Copy crash file to the current directory: - [crash-9f76d4e37a2391136a4ce07d47269db1e063e4b4.zip](https://github.com/pytorch/pytorch/files/10674059/crash-9f76d4e37a2391136a4ce07d47269db1e063e4b4.zip) 4. Run the container: ``docker run --privileged --network host -v `pwd`:/homedir --rm -it oss-sydr-fuzz-pytorch-reproduce /bin/bash`` 5. And execute the binary: `/message_deserialize_fuzz /homedir/crash-9f76d4e37a2391136a4ce07d47269db1e063e4b4` After execution completes you will see this stacktrace: ```asan AddressSanitizer:DEADLYSIGNAL ================================================================= ==57==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x0000008e7b19 bp 0x7ffd2fdded70 sp 0x7ffd2fddec40 T0) ==57==The signal is caused by a READ memory access. ==57==Hint: this fault was caused by a dereference of a high value address (see register values below). Disassemble the provided pc to learn which register was used. #0 0x8e7b19 in c10::IValue::isString() const /pytorch_fuzz/aten/src/ATen/core/ivalue.h:639:27 #1 0x8e7b19 in c10::IValue::toStringRef[abi:cxx11]() const /pytorch_fuzz/aten/src/ATen/core/ivalue_inl.h:2179:3 #2 0xe04fb58 in torch::distributed::rpc::ScriptCall::fromIValues(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/torch/csrc/distributed/rpc/script_call.cpp:90:53 #3 0xe0511f0 in torch::distributed::rpc::ScriptCall::fromMessage(torch::distributed::rpc::Message const&) /pytorch_fuzz/torch/csrc/distributed/rpc/script_call.cpp:133:10 #4 0xe0ff71e in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch_fuzz/torch/csrc/distributed/rpc/utils.cpp:102:14 #5 0x602a41 in LLVMFuzzerTestOneInput /message_deserialize_fuzz.cc:192:27 #6 0x52ce61 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15 #7 0x516d7c in fuzzer::RunOneTest(fuzzer::Fuzzer, char const, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6 #8 0x51cacb in fuzzer::FuzzerDriver(int, char**, int ()(unsigned char const*, unsigned long)) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9 #9 0x546062 in main /llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10 #10 0x7f41e42a8082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) #11 0x51169d in _start (/message_deserialize_fuzz+0x51169d) AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV /pytorch_fuzz/aten/src/ATen/core/ivalue.h:639:27 in c10::IValue::isString() const ==57==ABORTING ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94297 Approved by: https://github.com/ezyang	2023-04-29 00:26:35 +00:00
Oleh Lokshyn	35991df5d6	fix(docs): torch.autograd.graph.Node.register_hook can override grad_inputs, not grad_outputs (#100272 ) Fixes #99165 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100272 Approved by: https://github.com/soulitzer	2023-04-29 00:10:12 +00:00
Sahan Paliskara	2b79d6c425	Update testing aggregate data (#100070 ) Updates testing aggregates data to also show workflows which is useful for actually seeing how long workflows take. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100070 Approved by: https://github.com/seemethere	2023-04-29 00:09:52 +00:00
soulitzer	6a02342131	Check inputs have same dtype in addmm_impl_cpu_ even if input has zero numel (#100274 ) Fixes #99226 When an inputs has zero numel, addmm_impl_cpu_'s check that the inputs have the same dtype are bypassed. This PR adds a check before the early return. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100274 Approved by: https://github.com/ngimel	2023-04-29 00:07:54 +00:00
Lu Fang	d7fa7fa8cf	Introduce fast path in the CPU equal op Differential Revision: D45282119nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100024	2023-04-28 16:00:17 -07:00
Catherine Lee	331ed5bee7	Add comment link to revert message (#100276 ) * add comment url/link to revert message for easier tracking * update gql mocks accordingly, not sure why databaseid in checkruns got updated as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/100276 Approved by: https://github.com/huydhn	2023-04-28 22:41:29 +00:00
Zain Rizvi	999e17d80a	Move win-vs2019 build and test to unstable (#100281 ) As described in https://github.com/pytorch/pytorch/issues/100273, the vs2019 test jobs are failing due to numpy incompatibility They've been disabled for a quick response, and now moving them to unstable so that we can keep getting signal on those jobs Fixes https://github.com/pytorch/pytorch/issues/100273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100281 Approved by: https://github.com/huydhn	2023-04-28 21:41:57 +00:00
Fuzzkatt	ccce7a2de0	follow up PR for test_c10d_ucc.py in response to Xiang's review of #88110 (#99654 ) * Adds extra test_allgather_base in UccProcessGroupWithDispatchedCollectivesTests; rest of nccl and gloo tests there don't work on ucc * Adds cpu tests for [op]_work_wait_gpu tests * Added single tensor input test for allgather_basics; multi tensor input still doesn't seem to be supported by ucc Pull Request resolved: https://github.com/pytorch/pytorch/pull/99654 Approved by: https://github.com/kwen2501	2023-04-28 21:38:16 +00:00
AllenTiTaiWang	8714fc7a2b	[ONNX] Set tracing_mode through options.dynamic_shapes and enable dynamic tests in test_fx_to_onnx_runtime.py (#100212 ) After #99876 and #99877, the dynamic tests are unblocked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100212 Approved by: https://github.com/BowenBao	2023-04-28 21:30:52 +00:00
pbialecki	0a5c930499	Re-enable CUDA 12.1 builds for Windows (#100268 ) Related: https://github.com/pytorch/pytorch/pull/98492 This PR enables Windows builds after the needed AMIs are ready. CC @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/100268 Approved by: https://github.com/atalman, https://github.com/malfet	2023-04-28 21:11:27 +00:00
Peter Bell	5b98910139	[inductor] Stop using `x + tl.zeros(...)` in generated triton (#100163 ) For reductions, this changes the accumulator ```python _tmp2 = tl.zeros([XBLOCK, RBLOCK], tl.int8) + -128 ``` to ```python _tmp2 = tl.full([XBLOCK, RBLOCK], -128, tl.int32) ``` which is equivalent since addition does type promotion from `int8` to `int32` For constant indexing, this changes ```python tl.store(in_out_ptr0 + (0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp4, None) ``` to ```python tl.store(in_out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` For variable indexing, this changes ```python tl.store(out_ptr0 + (0 + tl.zeros([XBLOCK], tl.int32)), tmp1, None) ``` to ```python tl.store(out_ptr0 + (tl.broadcast_to(x0, [XBLOCK])), tmp1, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100163 Approved by: https://github.com/ngimel	2023-04-28 21:01:24 +00:00
Peter Bell	270a33165b	[inductor] Move reduction_type special cases out of make_reduction (#99660 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99660 Approved by: https://github.com/ngimel	2023-04-28 21:01:24 +00:00
Catherine Lee	6ab9453ea9	File level rerun changes (#100200 ) Fixes #ISSUE_NUMBER * change hook so that test still gets saved in --sc when fails in test setup (caused an off by 1 error due to setup being called before the logreport hook) * allow reruns for all tests now that --sc is used * increase number of reruns now that --sc is used Pull Request resolved: https://github.com/pytorch/pytorch/pull/100200 Approved by: https://github.com/huydhn	2023-04-28 20:57:49 +00:00
eqy	43dea76305	[CUDA] Switch to `at::empty_like` in `adaptive_avg_pool3d_backward_cuda` (#100202 ) Same as #100138, `gradInput` is already zero'd out. Also clean up includes after #100138. CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/100202 Approved by: https://github.com/ngimel	2023-04-28 20:47:24 +00:00
PyTorch MergeBot	380ccfd442	Revert "Added round_with_scale_factor arg to ATen (#97868 )" This reverts commit aa99c5b4eda345f792687c490e72c8575110977a. Reverted https://github.com/pytorch/pytorch/pull/97868 on behalf of https://github.com/osalpekar due to Caused breakages in the glow compiler - see [D45374622](https://www.internalfb.com/diff/D45374622) for more details	2023-04-28 20:47:00 +00:00
Eddie Yan	5022143f88	Bump cuDNN frontend submodule to 0.9 (#99674 ) Testing via CI for now CC @ngimel @ptrblck @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/99674 Approved by: https://github.com/ngimel	2023-04-28 20:46:54 +00:00
eqy	3f656ad7bb	[CUDA] Do accumulation for Adaptive Average Pooling in `opmath_t` (#99378 ) Fix for an issue surfaced from the discuss forum: https://discuss.pytorch.org/t/adaptiveavgpool2d-causes-some-data-to-contain-inf/177420 CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/99378 Approved by: https://github.com/ngimel	2023-04-28 20:43:12 +00:00
Bin Bao	b66d7007d8	Add aten.smooth_l1_loss_backward to core_aten_decompositions (#100267 ) Summary: https://github.com/pytorch/pytorch/pull/100242 didn't cover all test failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/100267 Approved by: https://github.com/jansel	2023-04-28 19:32:17 +00:00
Masaki Kozuki	9e1f46d55b	Use `[[maybe_unused]]` in `VariableType_[0-4].cpp` (#100250 ) This is kind of trivial, as per title. Removing `(void)_any_requires_grad` and giving `[[maybe_unused]]` attribute to that variable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100250 Approved by: https://github.com/Skylion007, https://github.com/soulitzer	2023-04-28 19:00:19 +00:00
Rohan Varma	c4bed869d1	[PG Wrapper] Enhance error msg (#100213 ) Previously, the mismatch report would not give the full details of the collective running on the mismatched rank, it would look something like: ``` Detected mismatch between collectives on ranks. Rank 26 is running collective: CollectiveFingerPrint(SequenceNumber=683057617, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=513876813OpType=BROADCAST). ``` i.e. Rank 1 is missing more details such as the shape, type etc. This was due to `num_tensors` field not being populated, which operator<< checks to determine whether to print additional information such as the tensor shape. Adding this field gives a better error: ``` Detected mismatch between collectives on ranks. Rank 0 is run ning collective: CollectiveFingerPrint(SequenceNumber=1564312518, OpType=ALLREDUCE , TensorShape=[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype= float (default), device=cpu, layout=Strided (default), requires_grad=false (defaul t), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is runnin g collective: CollectiveFingerPrint(SequenceNumber=1564312518, OpType=REDUCE, Tens orShape=[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pi nned_memory=false (default), memory_format=(nullopt))). ``` Differential Revision: [D45372325](https://our.internmc.facebook.com/intern/diff/D45372325/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100213 Approved by: https://github.com/H-Huang	2023-04-28 18:49:18 +00:00
Chien-Chin Huang	e0a2b49f0b	[SPMD] Introduce prerequisites to graph_optimization_pass (#99970 ) Some optimizations require prerequisite passes. It is hard to debug why a optimization pass because of the prerequisites condition does not match. Adding this check makes it easier to discover the error. Differential Revision: [D45255377](https://our.internmc.facebook.com/intern/diff/D45255377/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99970 Approved by: https://github.com/lessw2020	2023-04-28 18:38:01 +00:00
kshitij12345	61dffa61c3	[fix] masked_scatter_: non-contiguous self (#100232 ) Fixes https://github.com/pytorch/pytorch/issues/99638 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100232 Approved by: https://github.com/ngimel	2023-04-28 18:12:23 +00:00
dujinhang	9cd48b0575	Add warning information for dtypetensor. (#99521 ) Fixes #ISSUE_NUMBER Without affecting the existing cpu/cuda logic, a separate interface is provided for the custom backend and users can choose whether to use the interface function which provides 10 tensor types with custom backend variations. Therefore, users can use torch.set_deafult_tensor_type to set the default device tensor type, or use torch.xxx.dtypetensor to create a tensor.For example，torch.set_deafult_tensor_type(torch.foo.DoubleTensor) or torch.foo.DoubleTensor([]). @albanD , please review my changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99521 Approved by: https://github.com/albanD	2023-04-28 18:01:45 +00:00
Huy Do	56e235ad8c	Pin functorch docs requirements (#100257 ) The job https://github.com/pytorch/pytorch/actions/runs/4830815291/jobs/8607848573 starts to fail with the new IPython https://pypi.org/project/ipython/#history Pull Request resolved: https://github.com/pytorch/pytorch/pull/100257 Approved by: https://github.com/clee2000	2023-04-28 17:58:58 +00:00
Ke Wen	628a8df1c9	[c10d] Comment out ddp_hook_with_optimizer_parity tests (#100215 ) This is a mirror PR of D45339293 Summary: These tests cause the following errors internally with unknown reason: ``` AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adam' AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adamw' AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_sgd' ``` Commenting these tests out to unblock other PRs. Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/100215 Approved by: https://github.com/wz337, https://github.com/fduwjj	2023-04-28 17:38:12 +00:00
Catherine Lee	efed5e1c47	Fix triton auto update pin workflow (#100211 ) * allow pytorchbot to approve pin changes * add ciflow/inductor label when pin changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/100211 Approved by: https://github.com/huydhn	2023-04-28 17:06:31 +00:00
Richard Zou	1b84be551a	Improved CustomOp API with schema inference (#100127 ) This PR changes the CustomOp API. There are now two ways to create a CustomOp object. Method 1: with no schema string. We will infer what the schema string is from your type annotations ```py @custom_op("customlib::foo") def foo(x: Tensor) -> Tensor: ... ``` Method 2: with a schema string, if the inference doesn't work well. ```py @custom_op("customlib::foo", "(Tensor x) -> Tensor") def foo(x): ... ``` Some details: - We support most combinations of {Tensor, Number, int, float, bool} and {Optional[typ], Tuple[typ, ...]} as inputs. The combinations we support are mostly from me reading native_functions.yaml. - We support only Tensor or Tuple of Tensor of fixed size returns. - A lot of this PR is input validation for both of the above two methods. For example, when a user provides a manual schema string, then their function must not have any type annotations and the number of args and arg names must match the schema. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/100127 Approved by: https://github.com/ezyang	2023-04-28 16:53:07 +00:00
Richard Zou	7ebb60c9f4	[CustomOp] Fix lifetime semantics (#100114 ) This PR makes a CustomOp live forever. The motivation for it living forever is that: 1. It doesn't matter to a user if it lives forever or not 2. it is a higher-level abstraction over OpOverload, and OpOverload assumes that OpOverload lives forever. The only place where it matters that CustomOp lives forever is testing: I don't want to generate random names for my CustomOp objects. To resolve the testing problem, This PR adds a CustomOp._destroy() that clears all the C++ state, including the OpOverloadPacket, that is associated with the CustomOp object. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/100114 Approved by: https://github.com/ezyang	2023-04-28 16:53:07 +00:00
andrewor14	d176e3ff69	[quant][pt2] Add test for prepare_qat Conv + BN numerics (#99846 ) Summary: This adds the test to compare the numerics of PT2 vs FX after the Conv + BN fusion in prepare_qat. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_numerics Reviewers: kimishpatel, jerryzh168 Differential Revision: [D45360706](https://our.internmc.facebook.com/intern/diff/D45360706) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99846 Approved by: https://github.com/jerryzh168	2023-04-28 16:43:10 +00:00
Andrew Gu	23de2e0620	[Dynamo] Fix staticmethods for FSDP (#100117 ) This PR fixes capturing static methods for FSDP-managed modules. Previously, if a static method was invoked using `self.<staticmethod>`, then Dynamo would pass `self` twice to the method, causing a graph break due to the method being "unsupported". This PR achieves this by checking for `staticmethod` and using `UserFunctionVariable` instead of `UserMethodVariable`, which handles the correct calling convention. This fixes FSDP + PT2 on HuggingFace's `T5ForConditionalGeneration`, which otherwise reports an error like the following based on the most recent trunk: ``` Output 0 of AsStridedBackward0 is a view of a view which was created in no_grad mode and is being modified inplace with grad mode enabled. ``` This is in reference to the `scores` tensor in `scores += position_bias_masked` ([code](`a0ae2310ec/src/transformers/models/t5/modeling_t5.py (L559)`)). I am not clear if this PR's fix is actually masking a different problem though. I wonder if there are edge cases with respect to Dynamo resuming execution and input mutations. Possibly, this PR only side steps the problem because there is no more recompilation at the static method `_relative_position_bucket()` ([code](`a0ae2310ec/src/transformers/models/t5/modeling_t5.py (L443)`)). In `UserDefinedObjectVariable.var_getattr()`, there is an existing branch: `e5291e633f/torch/_dynamo/variables/user_defined.py (L395-L398)` I am not clear on when this branch can be triggered since if `subobj` is a static method, it still takes the `FunctionTypes` branch: `e5291e633f/torch/_dynamo/variables/user_defined.py (L403-L404)` To preserve backward compatibility, the current version of this PR only modifies this `FunctionTypes` branch to differentiate between `staticmethod` and not `staticmethod`. The PR that added this `FunctionTypes` branch is https://github.com/pytorch/pytorch/pull/92050/, and I checked that the added test `test_torch_distributions_functions()` only exercises the non-`staticmethod` case (since `Independent.log_prob` is not a `staticmethod`). The last commit in `pytorch` that touched the `staticmethod` branch before https://github.com/pytorch/pytorch/pull/92050/ was the move from the `torchdynamo` repo into `pytorch`, so I cannot easily tell which test cases it corresponds to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100117 Approved by: https://github.com/anijain2305	2023-04-28 14:31:20 +00:00
Richard Zou	e6f9bc500b	CustomOp simple abstract implementation registration (#99439 ) This PR: - adds an abstract registration API for CustomOp (CustomOp.impl_abstract) that is used for both FakeTensor and meta tensors - deletes CustomOp.impl_meta The user story behind this API is that it is the one-stop shop for registering implementations for data-less Tensors, i.e. FakeTensor and Meta tensor. The abstract implementation provided by the user: - gets registered as the FakeTensor implementation AND the meta formula - can be written like a regular meta formula. If the user decides that they need something more special (i.e. data-dependent output shape), then they are able to query a current context object (FakeTensorImplCtx) that has methods to construct new unbacked symints. Caveats: - we really need to make FakeTensor/FakeTensorMode public. Otherwise, there isn't a way for the user to interactively test that their abstract implementation is correct without running through large pieces of the PT2 stack (make_fx or torch.compile). - We do not memoize the symints produced by ctx.create_unbacked_symint(). It is possible to do this in the future, but it is difficult to do soundly and I am not convinced of the utility outside of the nonzero() usecase mentioned in #95399 Public API: - More docs will come when we actually expose this API to users by putting it in a public namespace, unless you folks want it now. - The APIs mentioned in `__all__` are the ones that are intended to be public. Test Plan: - Updated existing custom_op_db operators - Added new numpy_nonzero and numpy_nms operations that test operations that have data-dependendent output shape. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99439 Approved by: https://github.com/ezyang	2023-04-28 13:45:39 +00:00
Richard Zou	4135295a76	Excise yaml dependency in torchgen.model (#100203 ) The problem: - The new CustomOp API depends on torchgen.model - torchgen.model imports `yaml` - `yaml` is not a PyTorch runtime dependency To unblock myself, because I'm not sure how long it'll take to convince people yaml should be a PyTorch runtime dependency (unless one of you wants to approve #100166), this PR removes the yaml dependency from torchgen.model. It does so by splitting torchgen.utils (the offender) into torchgen.utils (no yaml) and torchgen.yaml (which uses yaml). Test Plan: - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/100203 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2023-04-28 13:45:39 +00:00
Bin Bao	55b661137f	[inductor] Use decomposition for smooth_l1_loss_backward (#100242 ) Summary: This forward fixes a CI check failure introduced by https://github.com/pytorch/pytorch/pull/99429. I also updated auto-labeler rule to trigger inductor test for any decomposition changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100242 Approved by: https://github.com/bertmaher	2023-04-28 13:23:20 +00:00
Irem Yuksel	2504089329	Enable test_linalg_solve_triangular_large (#96182 ) PR to see if test fails after removing skip line Fixes #70111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96182 Approved by: https://github.com/lezcano	2023-04-28 12:54:27 +00:00
PyTorch MergeBot	90c44b134a	Revert "[CI] Start to collect inference perf with cpp_wrapper ON (#100187 )" This reverts commit 3e87fc521ba0fc89b0980e018b4c625d5577d339. Reverted https://github.com/pytorch/pytorch/pull/100187 on behalf of https://github.com/desertfire due to scheduled dashboard run failed	2023-04-28 11:55:29 +00:00
yhl48	07c02b9e92	Add vmap support for `smooth_l1_loss_backward` (#99429 ) Follow-up of #98357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99429 Approved by: https://github.com/kshitij12345, https://github.com/zou3519	2023-04-28 10:58:07 +00:00
Tugsbayasgalan Manlaibaatar	d4bf76c2a4	Persist torch.assert in aten graph (#100101 ) This PR introduces a new operator called aten._assert_async.msg, which allows passing a tensor value and assertion message as inputs. As part of TorchDynamo, we're replacing the use of torch._assert with this new operator so that make_fx also knows how to handle assertions. This is subset of https://github.com/pytorch/pytorch/pull/98878, refer there for historic reviews. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100101 Approved by: https://github.com/jansel	2023-04-28 07:31:43 +00:00
Jithun Nair	cef15ecc2e	[ROCm] Also look for 'Cijk' (rocblas kernel) to verify gemm in test_kineto (#92889 ) PR #88207 enabled ActivityType::CUDA for ROCm. TestProfiler.test_kineto needs an update in the test code to look for the correct pattern for gemm kernels for ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92889 Approved by: https://github.com/malfet	2023-04-28 06:23:30 +00:00
gmagogsfm	751c54b546	Add experimental export() API (#100034 ) PT2 Export API Prototype Pull Request resolved: https://github.com/pytorch/pytorch/pull/100034 Approved by: https://github.com/angelayi	2023-04-28 06:12:59 +00:00
Iris	a23365885f	[FSDP] Make set_state_type to SHARDED_STATE_DICT compatible with NO_SHARD sharding_strategy (#100208 ) Currently, if we use NO_SHARD strategy for fully_shard and set state_dict_type to be SHARDED_STATE_DICT, a runtime error would be raised ("``sharded_state_dict`` can only be used when parameters are flatten and sharded."). This PR updates pre_state_dict_hook, post_state_dict_hook, pre_load_state_dict_hook, and post_load_state_dict_hook to set state_dict_type and state_dict_config to full state when using NO_SHARD, even if the state_dict_type and state_dict_config of the root module is set to sharded state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100208 Approved by: https://github.com/rohan-varma	2023-04-28 04:37:58 +00:00
cyy	7220201a2c	fix missing-prototypes warnings in torch_cpu (Part 2) (#100147 ) This PR fixes more missing-prototypes violations in the torch_cpu source following PR #100053. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100147 Approved by: https://github.com/zou3519	2023-04-28 04:15:57 +00:00
Edward Z. Yang	54c0edf6da	Track exact origin_node on best effort basis (#100110 ) Currently, we track 'origins' on IR nodes so that we have some idea about what FX IR nodes contributed to any given fused kernel. However, the origins are dumped into an undifferentiated set, so if you have, e.g., multiple outputs, you cannot easily tell which output corresponds to which FX node. This PR introduce a more precise notion of tracking "origin_node" which says that the contents of this Buffer/Loop node corresponds EXACTLY to the output of a particular FX node; e.g., if you serialized each intermediate when running the generated inductor code, you could compare them with the corresponding intermediates from the original FX graph. Tracking origin_node in all cases requires quite a bit of effort, so this PR introduces the tracking on a strictly best effort basis. The logic in torch/_inductor/graph.py sets up the associations, but only when it is "obvious" which IR node should get the assignment, and there is work in torch/_inductor/ir.py for propagating this information around as necessary. Like origins, origin_node is not a true dataclass field (as this would break all existing positional arg call sites), instead, it is added post facto via `__post_init__`. At the moment, it is only valid for Buffer/Loop to have an origin_node, but we could imagine relaxing this in the future. The payoff is in torch/_inductor/codegen/wrapper.py and torch/_inductor/codegen/triton.py where we currently just print the FX node name and the tensor (but a more useful integration will be coming later.) I also introduce a debugging tool `debug_ir_traceback` which tracks tracebacks of where IRNodes were allocated, to help you understand why a node doesn't have an `origin_node`. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100110 Approved by: https://github.com/voznesenskym	2023-04-28 04:15:27 +00:00
fduwjj	89b1e67d0a	[Tensor Parallel] Add a new Colwise Parallel style when Pairwise cannot directly used (#100137 ) Some use cases, users cannot directly `PairwiseParallelStyle` and they might need to specify colwise and rowwise separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100137 Approved by: https://github.com/wz337	2023-04-28 03:27:51 +00:00
Guang Yang	56a93ed56d	Store constraints and example inputs in the graph module as metadata in export (#99961 ) Metadata to store in the GraphModule: - input shape constraints - example inputs - other inline constraints The saved constraints (in mem) will be used directly after export to convert constraints to runtime assertion which is a separate pass after export. The requirement of saved constraints: 1. Be able to locate where the constraints is from 2. Should not break the exported graph module serialization. Examples of saved constraints ``` input_shape_constraints: {'t_id': 140266058179792, 'dim': 0, 'min': 6, 'max': oo} {'t_id': 140266058179792, 'dim': 0, 'min': 2, 'max': 10} inline_constraints: i1: ValueRanges(lower=2, upper=5) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99961 Approved by: https://github.com/tugsbayasgalan	2023-04-28 03:14:33 +00:00
cyy	c8877e6080	enable some cuda warnings (#95568 ) Currently some CUDA warnings are disabled due to some old issues of code quality that are fixed now. So it is time to remove the suppression. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95568 Approved by: https://github.com/albanD	2023-04-28 02:39:17 +00:00
AllenTiTaiWang	cba07ffe0c	[ONNX] Add xfail into subtests of op consistency and retire fixme (#100173 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100173 Approved by: https://github.com/justinchuby	2023-04-28 02:25:59 +00:00
Felix Janda	6168bed663	Remove unecessary <execinfo.h> include (#99800 ) Fixes compilation error on systems that have __cxa_demangle but no backtrace function, such as systems using GCC and the Musl C library. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99800 Approved by: https://github.com/kit1980	2023-04-28 02:20:24 +00:00
Animesh Jain	a8ad0dc333	[philox_rand] Add decomps (#100206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100206 Approved by: https://github.com/ngimel	2023-04-28 02:20:13 +00:00
andrewor14	9cda7b9e47	[hotfix] Do not import torch.ao.quantization._pt2e from dynamo (#100194 ) Summary: Importing torch.ao.quantization._pt2e from dynamo led to internal test failures related to memory profiling. For now, let's express the path using a simple string instead. Reviewers: jerryzh168, kimishpatel Pull Request resolved: https://github.com/pytorch/pytorch/pull/100194 Approved by: https://github.com/jerryzh168	2023-04-28 01:32:23 +00:00
PyTorch MergeBot	9609aeefbb	Revert "[c10d] Comment out ddp_hook_with_optimizer_parity tests (#100215 )" This reverts commit ae40a6c7356190ef86b14b10a94a58ca41ca496b. Reverted https://github.com/pytorch/pytorch/pull/100215 on behalf of https://github.com/huydhn due to Sorry for revert your change, but it breaks lint, please run lintrunner -a torch/testing/_internal/distributed/distributed_test.py to fix the issue then reland it	2023-04-28 01:21:06 +00:00
vfdev	0692bdd95f	Improved message to suppress errors in _dynamo/exc.py (#97345 ) If user adds simply to their code: ```python import torch torch._dynamo.config.suppress_errors = True ``` they will get: ``` AttributeError: module 'torch' has no attribute '_dynamo' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97345 Approved by: https://github.com/zou3519, https://github.com/kit1980	2023-04-28 01:12:08 +00:00
zhouzaida	b51f92ebda	[Docs] Fix docstring format (#99396 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99396 Approved by: https://github.com/awgu	2023-04-28 01:10:07 +00:00
zhi.cai	64efd88845	Add directly referenced header files for "ceil_div.h" (#99607 ) std::enable_if_t is defined in <type_traits>. Directly referencing header files is good programming style Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99607 Approved by: https://github.com/albanD, https://github.com/kit1980	2023-04-28 01:05:05 +00:00
Bin Bao	3e87fc521b	[CI] Start to collect inference perf with cpp_wrapper ON (#100187 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100187 Approved by: https://github.com/huydhn	2023-04-28 01:03:09 +00:00
Masaki Kozuki	674018903d	per-Tensor `grad_fn` for in-place foreach functions (#96405 ) Generate a `grad_fn` for each (tuple of) `Tensor`(s) of the same index for `_foreach_foo_` and each `grad_fn` is `FooBackward`. The current status of foreach functions' backward support for the record: - out-place: Implemented, but no optimized implementations like their forward path - in-place: not implemented. I think this check `7eaaefafb3/torchgen/api/autograd.py (L309-L311)` is partly responsible but the difference of signature between out-place and in-place (see https://github.com/pytorch/pytorch/pull/96405#discussion_r1154690940) would prevent in-place from using out-place versions (the logic is around `7eaaefafb3/torchgen/api/autograd.py (L495-L500)`) ```c++ void _foreach_abs_(c10::DispatchKeySet ks, at::TensorList self) { auto self_ = unpack(self, "self", 0); #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif { at::AutoDispatchBelowAutograd guard; at::redispatch::_foreach_abs_(ks & c10::after_autograd_keyset, self_); } #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) AT_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) AT_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif } ``` Related: - #95431 - #95765 for multiple `grad_fn`s logic --- Examples: outputs of `_foreach_add_.List`, `_foreach_addcmul_.ScalarList`, and `_foreach_exp` ```c++ void _foreach_addcmul__ScalarList(c10::DispatchKeySet ks, at::TensorList self, at::TensorList tensor1, at::TensorList tensor2, at::ArrayRef<at::Scalar> scalars) { auto self_ = unpack(self, "self", 0); auto tensor1_ = unpack(tensor1, "tensor1", 1); auto tensor2_ = unpack(tensor2, "tensor2", 2); auto _any_requires_grad = compute_requires_grad( self, tensor1, tensor2 ); (void)_any_requires_grad; std::vector<c10::optional<at::Tensor>> original_selfs(self.size()); std::vector<std::shared_ptr<AddcmulBackward0>> grad_fns; if (_any_requires_grad) { for (const auto& i : c10::irange( self.size() )) { const auto ith_requires_grad = compute_requires_grad(self[i], tensor1[i], tensor2[i]); check_inplace(self[i], ith_requires_grad); grad_fns.push_back([&]() -> std::shared_ptr<AddcmulBackward0> { if (!ith_requires_grad) { return nullptr; } else { auto grad_fn = std::shared_ptr<AddcmulBackward0>(new AddcmulBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self[i], tensor1[i], tensor2[i] )); return grad_fn; } }()); } if (!grad_fns.empty()) { for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { grad_fn->self_scalar_type = self[i].scalar_type(); grad_fn->tensor1_scalar_type = tensor1[i].scalar_type(); if (grad_fn->should_compute_output(1)) { grad_fn->tensor2_ = SavedVariable(tensor2[i], false); } grad_fn->value = scalars[i]; if (grad_fn->should_compute_output(2)) { grad_fn->tensor1_ = SavedVariable(tensor1[i], false); } grad_fn->tensor2_scalar_type = tensor2[i].scalar_type(); } } } } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); std::vector<c10::optional<Storage>> tensor1__storage_saved(tensor1_.size()); for (const Tensor& tensor : tensor1_) tensor1__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> tensor1__impl_saved(tensor1_.size()); for (size_t i=0; i<tensor1_.size(); i++) if (tensor1_[i].defined()) tensor1__impl_saved[i] = tensor1_[i].getIntrusivePtr(); std::vector<c10::optional<Storage>> tensor2__storage_saved(tensor2_.size()); for (const Tensor& tensor : tensor2_) tensor2__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> tensor2__impl_saved(tensor2_.size()); for (size_t i=0; i<tensor2_.size(); i++) if (tensor2_[i].defined()) tensor2__impl_saved[i] = tensor2_[i].getIntrusivePtr(); #endif { at::AutoDispatchBelowAutograd guard; at::redispatch::_foreach_addcmul_(ks & c10::after_autograd_keyset, self_, tensor1_, tensor2_, scalars); } #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } for (size_t i=0; i<tensor1_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor1__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(tensor1_)) TORCH_INTERNAL_ASSERT(tensor1__storage_saved[i].value().is_alias_of(tensor1_[i].storage())); } for (size_t i=0; i<tensor1_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor1__impl_saved[i] && !at::impl::tensorlist_has_dispatch(tensor1_)) TORCH_INTERNAL_ASSERT(tensor1__impl_saved[i] == tensor1_[i].getIntrusivePtr()); } for (size_t i=0; i<tensor2_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor2__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(tensor2_)) TORCH_INTERNAL_ASSERT(tensor2__storage_saved[i].value().is_alias_of(tensor2_[i].storage())); } for (size_t i=0; i<tensor2_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (tensor2__impl_saved[i] && !at::impl::tensorlist_has_dispatch(tensor2_)) TORCH_INTERNAL_ASSERT(tensor2__impl_saved[i] == tensor2_[i].getIntrusivePtr()); } #endif if (!grad_fns.empty()) { auto differentiable_outputs = flatten_tensor_args( self ); TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size()); for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { rebase_history(differentiable_outputs[i], grad_fns[i]); } } } } ``` ```c++ void _foreach_add__List(c10::DispatchKeySet ks, at::TensorList self, at::TensorList other, const at::Scalar & alpha) { auto self_ = unpack(self, "self", 0); auto other_ = unpack(other, "other", 1); auto _any_requires_grad = compute_requires_grad( self, other ); (void)_any_requires_grad; std::vector<c10::optional<at::Tensor>> original_selfs(self.size()); std::vector<std::shared_ptr<AddBackward0>> grad_fns; if (_any_requires_grad) { for (const auto& i : c10::irange( self.size() )) { const auto ith_requires_grad = compute_requires_grad(self[i], other[i]); check_inplace(self[i], ith_requires_grad); grad_fns.push_back([&]() -> std::shared_ptr<AddBackward0> { if (!ith_requires_grad) { return nullptr; } else { auto grad_fn = std::shared_ptr<AddBackward0>(new AddBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self[i], other[i] )); return grad_fn; } }()); } if (!grad_fns.empty()) { for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { grad_fn->other_scalar_type = other[i].scalar_type(); grad_fn->alpha = alpha; grad_fn->self_scalar_type = self[i].scalar_type(); } } } } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); std::vector<c10::optional<Storage>> other__storage_saved(other_.size()); for (const Tensor& tensor : other_) other__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> other__impl_saved(other_.size()); for (size_t i=0; i<other_.size(); i++) if (other_[i].defined()) other__impl_saved[i] = other_[i].getIntrusivePtr(); #endif { at::AutoDispatchBelowAutograd guard; at::redispatch::_foreach_add_(ks & c10::after_autograd_keyset, self_, other_, alpha); } #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } for (size_t i=0; i<other_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (other__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(other_)) TORCH_INTERNAL_ASSERT(other__storage_saved[i].value().is_alias_of(other_[i].storage())); } for (size_t i=0; i<other_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (other__impl_saved[i] && !at::impl::tensorlist_has_dispatch(other_)) TORCH_INTERNAL_ASSERT(other__impl_saved[i] == other_[i].getIntrusivePtr()); } #endif if (!grad_fns.empty()) { auto differentiable_outputs = flatten_tensor_args( self ); TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size()); for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { rebase_history(differentiable_outputs[i], grad_fns[i]); } } } } ... void _foreach_exp_(c10::DispatchKeySet ks, at::TensorList self) { auto self_ = unpack(self, "self", 0); auto _any_requires_grad = compute_requires_grad( self ); (void)_any_requires_grad; std::vector<c10::optional<at::Tensor>> original_selfs(self.size()); std::vector<std::shared_ptr<ExpBackward0>> grad_fns; if (_any_requires_grad) { for (const auto& i : c10::irange( self.size() )) { const auto ith_requires_grad = compute_requires_grad(self[i]); check_inplace(self[i], ith_requires_grad); grad_fns.push_back([&]() -> std::shared_ptr<ExpBackward0> { if (!ith_requires_grad) { return nullptr; } else { auto grad_fn = std::shared_ptr<ExpBackward0>(new ExpBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( self[i] )); return grad_fn; } }()); } } #ifndef NDEBUG std::vector<c10::optional<Storage>> self__storage_saved(self_.size()); for (const Tensor& tensor : self_) self__storage_saved.push_back( tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt); std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size()); for (size_t i=0; i<self_.size(); i++) if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr(); #endif { at::AutoDispatchBelowAutograd guard; at::redispatch::_foreach_exp_(ks & c10::after_autograd_keyset, self_); } #ifndef NDEBUG for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage())); } for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) { if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_)) TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr()); } #endif if (!grad_fns.empty()) { auto differentiable_outputs = flatten_tensor_args( self ); TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size()); for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { rebase_history(differentiable_outputs[i], grad_fns[i]); } } } if (!grad_fns.empty()) { for (const auto& i : c10::irange(grad_fns.size())) { auto grad_fn = grad_fns[i]; if (grad_fn != nullptr) { grad_fn->result_ = SavedVariable(self[i], true, self[i].is_view()); } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96405 Approved by: https://github.com/soulitzer	2023-04-28 00:55:04 +00:00
Edward Z. Yang	9a3e411a41	More rigorous mixed overloads on SymInt (#100008 ) Previously the change to aten/src/ATen/native/LossNLL.cpp eventually resulted in a double / SymInt division, which ended up calling the int64_t / SymInt overload, truncating the double (bad!) By adding overloads for all the int/float types, we avoid this situation from happening in the future. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/100008 Approved by: https://github.com/albanD	2023-04-28 00:54:44 +00:00
Sergii Dymchenko	9dcabe293a	Delete pytorch/caffe2/contrib/docker-ubuntu-14.04 (#100155 ) It's not used anywhere AFAIK and only triggers security issues scanners. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100155 Approved by: https://github.com/huydhn	2023-04-28 00:41:37 +00:00
Andrew Gu	d1fbd33c70	[FSDP] Remove unneeded disable of tf32 (#100179 ) I recall needing to disable tf32, but I cannot repro the issue anymore. Nowhere else in our unit tests do we disable tf32, so we can try to get rid of this disabling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100179 Approved by: https://github.com/rohan-varma	2023-04-28 00:14:40 +00:00
Andrew Gu	1f4183e275	[FSDP] Subtest sharding strategy in test_fsdp_grad_acc.py (#100178 ) Let us make the unit test faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100178 Approved by: https://github.com/rohan-varma	2023-04-28 00:14:40 +00:00
Ke Wen	ae40a6c735	[c10d] Comment out ddp_hook_with_optimizer_parity tests (#100215 ) This is a mirror PR of D45339293 Summary: These tests cause the following errors internally with unknown reason: ``` AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adam' AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adamw' AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_sgd' ``` Commenting these tests out to unblock other PRs. Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/100215 Approved by: https://github.com/wz337, https://github.com/fduwjj	2023-04-28 00:05:46 +00:00
Michael Voznesensky	a145a3332c	Add tensor to fake clone snapshot for immutable source of truth (#100128 ) There's a longstanding, well known mutability bug in dynamo, https://github.com/pytorch/pytorch/issues/93610 (and more issues, but this is the one I had at hand). Ops that do in place mutation of tensors will mutate their corresponding FakeTensors. So, for example, if you do `t_` on a tensor, you will reverse its strides. This, in turn, means that the FakeTensors strides are now also reversed, say, if you are trying to torch.compile: ``` class F(torch.nn.Module): def forward(self, x, y): x = x.t_() y = y.t_() return (x + y,) ``` However, we recently introduced accessing the fake_tensor memo/cache to get the symbolic shape values for sizes and strides during guard installation time. This means that tensors captured with a given size and stride, say, for x above, size:(3,3) stride:(3, 1), will get their memo updates to size(3, 3), stride(1, 3). Now, whenever you access this value for anything, it reflects it's current state in the tracing, as opposed to the state at which we initially started tracing on. This causes us to produce guards that are never valid, for the example above, that `x.stride()[0] == 3`. The solution is to not allow mutation to affect the fake tensors we use as source of truth here. We can do this by forcing a clone of the fake tensor at builder time, and storing that as the source of truth for our dynamic sizes and strides during guard installation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100128 Approved by: https://github.com/ezyang	2023-04-27 23:58:15 +00:00
Yanli Zhao	ca1cf434e7	Not flatten states when use_orig_param is True and sharding is NO_SHARD (#100189 ) When use_orig_param is True and sharding is NO_SHARD, parameters and states are not flattened, so optimizer states should not be flattened as well. The unit test will fail without the fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100189 Approved by: https://github.com/awgu	2023-04-27 23:47:01 +00:00
chunyuan	3241fbd627	Inductor cpp wrapper: support LinearBinary (#99957 ) Support `mkldnn::_linear_pointwise.binary` in cpp wrapper for GPT-J. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99957 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire	2023-04-27 23:46:42 +00:00
TamirCohen	0221198790	Added Typechecking to input tensor in RNN (#100100 ) The input tensor of the RNN forward must be the same type as the weights. While passing tensor of type long the error is: `RuntimeError: expected scalar type Long but found Float` Which is misleading because it said to convert Something to Long, but the correct solution is to convert the input to Float (Which is the type of the weights). The new error: `RuntimeError: input must have the type torch.float32, got type torch.int64` Is correct and more verbose Fixes #99998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100100 Approved by: https://github.com/drisspg	2023-04-27 23:36:57 +00:00
Driss Guessous	b8d7a28e1a	refactor test_sdpa into two test classes to account for failure modes (#100121 ) ### Summary This PR creates a new TestSDPAFailureModes test class in order to better seperate what each test is trying to do. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100121 Approved by: https://github.com/malfet, https://github.com/ngimel	2023-04-27 21:42:40 +00:00
Daniel Dale	477ca1789c	Avoid elementwise dispatch of gradient unscaling/validation ops in `_foreach_non_finite_check_and_unscale_cpu_` (#100108 ) Fixes [#82206](https://github.com/pytorch/pytorch/issues/82206) When executing a `ShardedGradScaler` step in the context of `cpu_offload`, [the function](`ecd2c71871/torch/distributed/fsdp/sharded_grad_scaler.py (L151-L152)`) `_foreach_non_finite_check_and_unscale_cpu_` is grindingly slow. This issue is due to the elementwise op dispatching/redispatching/execution that is engendered by the current approach to gradient tensor validation: `ecd2c71871/torch/distributed/fsdp/sharded_grad_scaler.py (L159-L163)` The subsequent `isinf` and `isnan` checks with associated `any` checks result in unscalable elementwise op dispatches: `ecd2c71871/torch/distributed/fsdp/sharded_grad_scaler.py (L173-L181)` This inefficency is of course hidden in the current FSDP tests given their (appropriately) trivial parameter dimensionality. In the perf analysis below, the example test configures only the final `Linear(4, 8)` module parameters to require grad, so there are 40 elements to iterate through. However, if one increases the dimensionality to a still-modest 320008 elements (changing the final module to `Linear(40000,8)`), the execution time/cpu cost of the test is dominated by the elementwise op dispatching/redispatching/execution of the `any` validation ops in this function. To characterize the current behavior, I use a slightly modified version of an existing `ShardedGradScaler` test [^1]. The following modifications to the test are made to allow the analysis: 1. Run just `CUDAInitMode.CUDA_BEFORE` for clarity instead of additional scenarios 2. Increase the final module to `Linear(40000, 8)` (along with modifying the preceding module to make the dimensions work) , 3. For the cProfile run (but not valgrind or perf) the test runs just a single [`_train_for_several_steps`](`ecd2c71871/torch/testing/_internal/common_fsdp.py (L926-L934)`) step per rank (instead of 2 steps) 4. I temporarily reduce `init_scale` further to ensure we don't hit any `infs`, short-circuiting our analysis ### Current behavior The most relevant call subgraph: ![callgrind_subgraph_elementwise_dispatch](https://user-images.githubusercontent.com/7462936/234656744-b7ca81b2-ce5b-4035-9918-0ad57d3689d3.png) Note that: 1. Instead of dispatching to the relevant autograd op and then redispatching to the relevant CPU op implementation 8 times per test, (2 train steps x 2 any calls per parameter per step x 2 orig parameters) we (I believe unnecessarily) call the relevant dispatch flow elementwise, so 640016 times! (only 1 node in this trace so 320008 elements/2 X 2 train steps x 2 calls per element per step). 2. Nearly 50% of the relative (inclusive) instruction reads for the entire test in `callgrind` are executed by the `isnan` (320008 execs), `isinf` (320008 execs) and `any` (640016 execs) calls. 3. The `any` pre-dispatch entry point IRs (`torch::autograd::THPVariable_any`) vs actual op implementation IRs (`at::native::structured_any_all_out::impl`) are below to give one a sense of the relative dispatch and op execution cost in an elementwise context[^3]. ![THPVariable_any_op_elementwise_dispatch_absolute_IR](https://user-images.githubusercontent.com/7462936/234656886-3c017ee3-8a04-4a7d-bdf8-6c690de42c92.png) ![structured_any_all_out_impl_absolute_IR](https://user-images.githubusercontent.com/7462936/234656915-0b203bb7-bd05-4ceb-a38b-67b0d4862aa7.png) Using cprofile stats: ```bash python -c "import pstats; stats=pstats.Stats('/tmp/fsdp_cprofile_8wa9uw39.stats'); stats.print_stats()" ... ncalls tottime percall cumtime percall filename:lineno(function) 1 20.159 20.159 66.805 66.805 torch/distributed/fsdp/sharded_grad_scaler.py:151(_foreach_non_finite_check_and_unscale_cpu_) 160004 18.427 0.000 18.427 0.000 {built-in method torch.isinf} 160004 6.026 0.000 6.026 0.000 {built-in method torch.isnan} ``` We see that a single step of the scaler runs for more than a minute. Though there is non-trivial cprofile overhead, we can infer from this that per-element op dispatches/executions are on the order of a 100ns. On the order of 100 nanoseconds per dispatch is acceptable if we're using typical tensor access patterns, but if we're dispatching each element for each op, obviously everything is going to come to a grinding halt for many practical use cases. (Given the cost of this function is currently O(n) in the number of gradient elements, feel free to set `TORCH_SHOW_DISPATCH_TRACE=1` if you want to make this function cry 🤣) I've attached a flamegraph at the bottom of the PR[^2] that more intuitively demonstrates the manner and extent of resource consumption attributable to this function with just a modest number of gradient elements. ### After the loop refactor in this PR: The most relevant call subgraph: ![callgrind_subgraph_elementwise_dispatch_fix](https://user-images.githubusercontent.com/7462936/234657001-0a448756-b4ce-468e-9f91-1d21597df057.png) Note that: 1. Less than 0.4% of the relative (inclusive) instruction reads for the entire test in `callgrind` are executed by the `isnan` (4 execs), `isinf` (4 execs) and `any` (8 execs) calls (versus ~50% and 320008, 320008, 640016 respectively above) 2. The `any` pre-dispatch entry point IRs (`torch::autograd::THPVariable_any`) vs actual op implementation IRs (`at::native::structured_any_all_out::impl`) reflect far less overhead (of secondary importance to item number 1) ![THPVariable_any_op_elementwise_dispatch_absolute_IR_fix](https://user-images.githubusercontent.com/7462936/234659454-b1e262cf-d291-4d44-aff2-e27efe284e9c.png) ![structured_any_all_out_impl_absolute_IR_fix](https://user-images.githubusercontent.com/7462936/234657154-91fa7cb8-e39e-48c7-abf0-cc58f06c0ae1.png) Using cprofile stats: ```bash python -c "import pstats; stats=pstats.Stats('/tmp/fsdp_cprofile_pfap7nwk.stats'); stats.print_stats()" ... ncalls tottime percall cumtime percall filename:lineno(function) 1 0.013 0.013 0.109 0.109 torch/distributed/fsdp/sharded_grad_scaler.py:151(_foreach_non_finite_check_and_unscale_cpu_) 2 0.022 0.011 0.022 0.011 {built-in method torch.isinf} 2 0.018 0.009 0.018 0.009 {built-in method torch.isnan} ``` We can see our function runtime has dropped from more than a minute to ~100ms. ### Assumptions associated with this loop refactor: The key assumptions here are: 1. The grads are always on CPU in this function so any MTA-safe constraints ([`can_use_fast_route`](`efc3887ea5/aten/src/ATen/native/cuda/AmpKernels.cu (L110-L111)`) relating to the relevant CUDA kernel path selection, i.e. slower `TensorIterator` gpu kernel vs `multi_tensor_apply_kernel`) do not apply in this context 2. We've already filtered by dtype and device and can assume the presence of a single CPU device. Unless manually creating separate CPU devices with manually set non-default indexes (which I don't think FSDP supports and should be validated prior to this function), device equality should always be `True` for `cpu` type devices so we should just need to check that the current device is of `cpu` type. [^4]. ![elementwise_dispatch](https://user-images.githubusercontent.com/7462936/234660413-8c96ef90-7a23-4307-b8ed-c1fbf932f1e9.svg) [^1]: `TestShardedGradScalerParityWithDDP.test_fsdp_ddp_parity_with_grad_scaler_offload_true_none_mixed_precision_use_orig_params` test in `test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py` [^2]: Note the native frame stacks for `torch::autograd::THPVariable_isinf`, `torch::autograd::THPVariable_isnan`, `torch::autograd::THPVariable_any` in particular. [^3]: There's more `TensorIterator` etc. setup overhead further up the stack beyond `structured_any_all_out`, but roughly speaking [^4]: Device equality is based on [type and index combination](`efc3887ea5/c10/core/Device.h (L47-L51)`), CPU device type is -1 by default (`None` on the python side) and is intended to [always be 0](`cf21240f67/c10/core/Device.h (L29)`) if set explicitly. Though technically, unless in debug mode, this constraint isn't [actually validated](`bb4e9e9124/c10/core/Device.h (L171-L184)`), so one can actually manually create separate `cpu` devices with invalid indices. I suspect it's safe to ignore that potential incorrect/unusual configuration in this context but let me know if you'd like to add another `cpu` device equality check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100108 Approved by: https://github.com/awgu	2023-04-27 21:33:27 +00:00
AllenTiTaiWang	1dba53cbab	[ONNX] Refactor test_op_consistenct.py and test_fx_op_consistency.py (#100172 ) ## Summary <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 9255aa3</samp> This pull request refactors the ONNX operator testing code to use a common module `./test/onnx/onnx_test_common.py` that defines constants, types, classes, and functions for testing ONNX operators. This improves the code quality, readability, and maintainability. ## Walkthrough <!-- copilot:walkthrough --> ### <samp>🤖 Generated by Copilot at 9255aa3</samp> * Refactor the common code for testing ONNX operators from different files into `./test/onnx/onnx_test_common.py` ([link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eL10-R24), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eR33), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eR367-R623)) * Remove the unused and duplicated imports, constants, types, and classes for testing ONNX operators from `./test/onnx/test_fx_op_consistency.py` and `./test/onnx/test_op_consistency.py` ([link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L28-R29), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L43-R42), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L28-R29), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L43-R44)) * Import the `unittest`, `opinfo_core`, and `onnx_test_common` modules and the `fixme`, `skip`, and `xfail` functions in `./test/onnx/test_fx_op_consistency.py` and `./test/onnx/test_op_consistency.py` ( [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4R36), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L37-R37)) * Update the references to the constants, types, functions, and classes for testing ONNX operators in `./test/onnx/test_fx_op_consistency.py` and `./test/onnx/test_op_consistency.py` to use the definitions from `./test/onnx/onnx_test_common.py` ([link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L324-R80), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L389-R135), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L405-R151), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L455-R204), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L333-R107), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L434-R183), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L448-R197), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L494-R246)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100172 Approved by: https://github.com/justinchuby	2023-04-27 21:32:04 +00:00
ydwu4	61917a006d	Make DimConstraints create actionable message (#100103 ) This pr makes summary of dimension constraints actionable. Before the pr, it will print: ``` torch.fx.experimental.symbolic_shapes: [WARNING] Summary of dimension constraints: The following dimensions have been specialized and CANNOT be dynamic. NOTE: Specializations will happen by default with `assume_static_by_default=True`. L['c'].size()[1] == 3 L['a'].size()[2] == 3 L['a'].size()[1] == 3 L['b'].size()[2] == 2 L['b'].size()[1] == 2 L['c'].size()[2] == 3 The following dimensions CAN be dynamic. You can use the following code to specify the constraints they must satisfy: ''' constraints=[ dynamic_dim(L['c'], 0) == dynamic_dim(L['a'], 0), 2 <= dynamic_dim(L['b'], 0), 2 <= dynamic_dim(L['a'], 0), ] ''' ``` Users need to initialize the L environment manually and copy the constraints over. After the pr, we have: ``` [2023-04-26 05:43:12,849] torch._dynamo.eval_frame: [WARNING] Summary of dimension constraints: The following dimensions have been specialized and CANNOT be dynamic. NOTE: Specializations will happen by default with `assume_static_by_default=True`. ''' def specializations(a, b, c): return (a.size()[2] == 3 and c.size()[1] == 3 and a.size()[1] == 3 and c.size()[2] == 3 and b.size()[2] == 2 and b.size()[1] == 2) ''' The following dimensions CAN be dynamic. You can use the following code to specify the constraints they must satisfy: ''' def specify_constraints(a, b, c): return [ 2 <= dynamic_dim(b, 0), dynamic_dim(c, 0) == dynamic_dim(a, 0), 2 <= dynamic_dim(a, 0), ] ''' ``` , where dynamic_constraints has the same input signature as users code. This allow users to copy-paste and run the code to generate the constraints before exporting as shown below: ``` def specify_constraints(a, b, c): return [ 2 <= dynamic_dim(b, 0), dynamic_dim(c, 0) == dynamic_dim(a, 0), 2 <= dynamic_dim(a, 0), ] torch._dynamo.export(my_dyn_fn, x, y, z, constraints=specify_constriants(x, y, z)) ``` Implementation-wise, this pr also 1. changes shape_env.produce_guards to produce_guards_and_constraints, 2. adds contraints_export_fn hooks, The purpose is to surface the DimConstraints to dynamo.export, where we could reliably get the original function's signature. The alternative to the above is to get the function signature before creating SHAPE_ENV guard (https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/output_graph.py#L227) and pass it to DimConstraints, but I couldn't recover the signature before creating SHAPE_ENV because the frame's f_globals/locals don't contain the original function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100103 Approved by: https://github.com/guangy10, https://github.com/tugsbayasgalan	2023-04-27 21:24:18 +00:00
Zain Rizvi	d5f15d3515	Check for debug mode (#92707 ) It works by validating the debug builds actually trigger debug level asserts Turns out, most of our debug jobs today don't actually build in debug mode (causing the test to fail). The PR also fixes that Contributes to https://github.com/pytorch/pytorch/issues/88842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92707 Approved by: https://github.com/malfet, https://github.com/albanD	2023-04-27 20:57:18 +00:00
wbigat	b02aa5e71d	[Feature] storage resize_ support custom device. (#99882 ) Fixes #99326 Support storage resize_ for custom device, by calling dispatched tensor operations. @ezyang this pr is another case that was brought up in issue #99326, please take a moment to review this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99882 Approved by: https://github.com/ezyang	2023-04-27 20:18:35 +00:00
Brian Hirsh	9834358e0f	Get SchemaCheckMode to error on ops that return inputs directly. Expose as a dynamo backend, eager_debug (#99744 ) Talked to @zou3519 and @ezyang on what the right UX is: tentatively, adding a new dynamo backend is cheap and simple, so it seems worth doing. And longer term, we agreed (?) that it's worth seeing if we can get custom ops sanity asserts to run more automatically, instead of needing a separate backend. Side comment: that actually seems tough: the mode detects secret mutations by cloning every input to every op, running the op, and checking that the data matches between the real input and the cloned input. So I doubt we'll be able to make that behavior always-on? It would need some config at least. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99744 Approved by: https://github.com/albanD, https://github.com/ezyang, https://github.com/zou3519	2023-04-27 20:12:42 +00:00
Brian Hirsh	1f2d00e537	move SchemaCheckMode to torch/_subclasses (#99743 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99743 Approved by: https://github.com/albanD	2023-04-27 20:12:41 +00:00
Jason Ansel	884c5c86f1	Pass torch.compile mode/options to all backends (#99645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99645 Approved by: https://github.com/anijain2305	2023-04-27 19:41:26 +00:00
AllenTiTaiWang	7295ab6746	[ONNX] Add test_fx_op_consistency.py (#99465 ) Add op consistency test for fx exporter. There will be another PR to work around the limitations of https://github.com/pytorch/pytorch/issues/99534 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99465 Approved by: https://github.com/justinchuby	2023-04-27 19:39:32 +00:00
Angela Yi	d06b93b0c7	Decompose arange.default to arange.start_step (#99739 ) The aten op arange.default is not in the core aten IR, and should decompose into the arange.start_step op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99739 Approved by: https://github.com/SherlockNoMad	2023-04-27 19:06:36 +00:00
Li-Huai (Allan) Lin	a67fa845bd	[vmap] Fix searchsorted batch rule (#99698 ) Hopefully there is no other missed cases. Fixes #99603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99698 Approved by: https://github.com/kshitij12345, https://github.com/zou3519	2023-04-27 19:03:41 +00:00
Nikita Shulga	991b1c0286	Do not use `--extra-index-url` in testing wheels (#100183 ) Should prevent regressions like the ones reported in https://github.com/pytorch/pytorch/issues/100104 from sneaking undetected. Same for `install_triton_wheel.sh` - always use packages from https://download.pytorch.org/whl/ <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at deda821</samp> > _`pip install` changed_ > _Only use PyTorch nightly_ > _Snowflake packages_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/100183 Approved by: https://github.com/kit1980, https://github.com/pmeier	2023-04-27 18:48:02 +00:00
Max Ren	151d76cc23	[quant][pt2e] remove dropout from fx quant Differential Revision: D45250152nnPull Request resolved: https://github.com/pytorch/pytorch/pull/99935	2023-04-27 11:22:41 -07:00
Nikita Shulga	089b085c32	Optimize periodic jobs (#100182 ) Split existing 4 hour scheduled into two 8 hour ones And schedule x86 MacOS test every 8 hours and exclude them from leak checks Schedule iOS tests every 8 hours and exclude them from leak-checks as well Remove IOS metal job, as it is already tested by ARM64 MPS job as well as x86 and arm64 vanilla jobs, as they never caught any regressions in last 60 days, based on data from running the following query on RockSet: ```sql SELECT started_at, DATE_DIFF( 'MINUTE', PARSE_TIMESTAMP_ISO8601(started_at), PARSE_TIMESTAMP_ISO8601(completed_at) ) as duration, conclusion, name, html_url, torchci_classification FROM commons.workflow_job WHERE workflow_name = 'periodic' and name like 'ios-12% % build (default, 1, 1, macos-12)' and url like 'https://api.github.com/repos/pytorch/pytorch/%' and conclusion = 'failure' order by started_at desc, run_id; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100182 Approved by: https://github.com/PaliC, https://github.com/huydhn	2023-04-27 18:07:28 +00:00
Chien-Chin Huang	01de8ee845	[SPMD][Easy] Add time counter in graph_optimization_pass (#99969 ) This can give the idea how expensive the pass is. Differential Revision: [D45255366](https://our.internmc.facebook.com/intern/diff/D45255366/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99969 Approved by: https://github.com/lessw2020	2023-04-27 17:56:07 +00:00
Rohan Varma	87db02ea38	[DDP] Perform input casting in pre forward (#100131 ) This is so that replicate can also have the feature to cast its inputs, which it currently does not. Next diff will change replicate pre hook to support this. Differential Revision: [D45335179](https://our.internmc.facebook.com/intern/diff/D45335179/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100131 Approved by: https://github.com/zhaojuanmao	2023-04-27 17:34:46 +00:00
Ke Wen	ae0eb2342d	[Experimental] Remove store barrier after PG init (#99937 ) Store based barrier is not scalable. Experimenting to see if removing it breaks any CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/99937 Approved by: https://github.com/kumpera, https://github.com/H-Huang	2023-04-27 17:23:10 +00:00
Angela Yi	7bece142a9	[export] Port over const prop pass (#100102 ) Stacked on top of https://github.com/pytorch/pytorch/pull/100000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100102 Approved by: https://github.com/gmagogsfm	2023-04-27 17:06:47 +00:00
Iris	fad2f6edab	[PTD][Checkpoint] Upstream fsspec storage read/write to PT (#98387 ) Remove sync_files. Remove single_file_per_rank and will add it back once we resolve the issue. https://github.com/pytorch/pytorch/issues/98386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98387 Approved by: https://github.com/fegin	2023-04-27 16:47:28 +00:00
Chien-Chin Huang	b94a0ba5bb	[SPMD] Add embedding dense backward prop rule for postional embedding (#100038 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100038 Approved by: https://github.com/mrshenli	2023-04-27 16:31:51 +00:00
pbialecki	8fe91d16b0	Remove CUDA 11.6 note from complex docs (#100118 ) Removes note in the complex docs pointing to the CUDA 11.6 wheels introduced in https://github.com/pytorch/pytorch/pull/80363. Background: this warning was added via https://github.com/pytorch/pytorch/issues/79876 which pointed out a slow compilation time in 11.3. The 11.6 pip wheels were thus recommended but are not build anymore as our current support is 11.7, 11.8 (and 12.1 experimental in nightlies). The note is confusing users as it doesn't explain why 11.6 is needed. Reference: https://discuss.pytorch.org/t/complex-numbers-cuda-11-6-documentation-warning/178588/1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100118 Approved by: https://github.com/msaroufim	2023-04-27 16:26:27 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	02f059c2b7	Add private _export API (#99992 ) Differential Revision: D45279206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99992 Approved by: https://github.com/angelayi, https://github.com/gmagogsfm	2023-04-27 16:24:14 +00:00
Larry Liu	f5853342ea	[dynamo][numpy] Handle return value being numpy ndarray (#99560 ) On top of #95849 this PR is trying to handle the special case when dealing with numpy. Consider the following example: ``` def f(x: torch.Tensor) -> np.ndarray: a = x.numpy() return a.T ``` In previous PR this will error out because we translate `a.T` to be a method call on `torch_np.ndarray.T` which is also a `torch_np.ndarray`. This PR handles this case, by conditionally converting a `torch_np.ndarray` to `np.ndarray` before returning, to match the original behavior. The compiled version will be: ``` def f(x): ___tmp_0 = __compiled_fn_0(x) if isinstance(___tmp_0, torch_np.ndarray): return ___tmp_0.tensor.numpy() else: return ___tmp_0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99560 Approved by: https://github.com/jansel, https://github.com/yanboliang	2023-04-27 16:18:35 +00:00
Larry Liu	687afeb686	[dynamo][numpy] Add NumpyTensorVariable to translate ndarray attribute calls to tensor attributes (#95849 ) Issue: #93684 # Problem Reduce graph breaks when dynamo compiles python functions containing numpy functions and ndarray operations. # Design (as I know it) * Use torch_np.ndarray(a wrapper of tensor) to back a `VariableTracker`: `NumpyTensorVariable`. * Translate all attributes and methods calls, on ndarray, to torch_np.ndarray equivalent. This PR adds `NumpyTensorVariable` and supports: 1. tensor to ndarray, ndarray to tensor 2. numpy functions such as numpy.meshgrid() 3. ndarray attributes such as `itemsize`, `stride` Next PR will handle returning `np.ndarray` and add support for ndarray methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/95849 Approved by: https://github.com/ezyang	2023-04-27 16:18:35 +00:00
Yanbo Liang	d855b6aed6	[Dynamo] Add unit test for explicitly calling __call__ (#100146 ) @wconstab As we discussed last Friday, I added the unit test for explicitly calling __call__ and added comment to explain why we redirecting ```UserMethodVariable.call_function``` to ```NNModuleVariable.call_method``` for a certain case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100146 Approved by: https://github.com/wconstab	2023-04-27 15:47:11 +00:00
Li-Huai (Allan) Lin	cb569dbccd	Fix cat forward-AD tests (#99596 ) Fixes #94115 Not sure where to add the test. There is an existing sample input but apparently doesn't fail any test. `6580b160d3/torch/testing/_internal/common_methods_invocations.py (L2043)` Edited: Found the skipper and xfailed some failures, which are pre-existing and unrelated to the fix in question. (Those failures are of gradgrad check, while the fix is of forward-AD). Pull Request resolved: https://github.com/pytorch/pytorch/pull/99596 Approved by: https://github.com/soulitzer	2023-04-27 15:21:26 +00:00
Peter Bell	659dcc5e71	[inductor] Fix argmin/max with duplicate values (#99920 ) Fixes #99879 This adds `minimum_with_index` helper functions to compute the minimum value and index simultaneously, with a preference for the smaller index which is required to match eager in case of duplicates. I also remove the mask-and-sum hack with a `tl.reduce` using the previously mentioned helper. This additionally fixes the indices being added together in the case of duplicates. As an example, this is the kernel generated for `torch.argmin(x, 1)`: ```python def triton_(in_ptr0, out_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr): xnumel = 1028 # dynamic_shapes=False rnumel = 1028 # dynamic_shapes=False xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rbase = tl.arange(0, RBLOCK)[None, :] x0 = xindex _tmp1 = tl.full([XBLOCK, RBLOCK], float("inf"), tl.float32) _tmp1_index = tl.full([XBLOCK, RBLOCK], 9223372036854775807, tl.int64) for roffset in range(0, rnumel, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r1 = rindex tmp0 = tl.load(in_ptr0 + (r1 + (1028x0)), rmask & xmask, eviction_policy='evict_last', other=0) _tmp1_next, _tmp1_index_next = triton_helpers.minimum_with_index( _tmp1, _tmp1_index, tmp0, rindex ) _tmp1 = tl.where(rmask & xmask, _tmp1_next, _tmp1) _tmp1_index = tl.where(rmask & xmask, _tmp1_index_next, _tmp1_index) _, tmp1_tmp = triton_helpers.min_with_index(_tmp1, _tmp1_index, 1) tmp1 = tmp1_tmp[:, None] tl.store(out_ptr0 + x0, tmp1, xmask) ``` Or for a persistent reduction, it generates: ```python tmp0 = tl.load(in_ptr0 + (r1 + (1024x0)), rmask & xmask, other=0) tmp2 = tl.where(rmask & xmask, tmp0, float("inf")) tmp3 = tl.broadcast_to(rindex, tmp2.shape) _, tmp4_tmp = triton_helpers.min_with_index(tmp2, tmp3, 1) tmp4 = tmp4_tmp[:, None] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99920 Approved by: https://github.com/ngimel	2023-04-27 15:10:50 +00:00
Peter Bell	f9c3fcd1df	[inductor] Fix nan-handling of max and min reductions (#99881 ) This adds helpers that replace tritons `minimum`, `maximum`, `min` and `max` with the correct NaN prrpagation. I also removed `ops.int_minimum` in favor of `ops.minimum` because we can just omit the nan-checks by checking the dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99881 Approved by: https://github.com/ngimel	2023-04-27 15:10:50 +00:00
Peter Bell	ed2eb13d76	[inductor] Create triton_helpers module for helper functions (#99880 ) This changes codegen of `torch.prod` from: ```python tl.reduce(tmp2, 1, _prod_accumulate)[:, None] ``` where `_prod_accumulate` is defined elsewhere, to ```python triton_helpers.prod(tmp2, 1)[:, None] ``` A quirk I uncovered though is that `TritonCodeCache` breaks if you define any new symbol beginning with `triton_`, since it assumes that must be the kernel name. Instead, I've made the kernel name an explicit argument to `async_compile.triton` so it doesn't have to guess. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99880 Approved by: https://github.com/ngimel	2023-04-27 15:10:50 +00:00
Rodrigo Kumpera	ad21890f8f	[c10d] Scalable PG initiation. (#99931 ) Add use_local_synchronization argument to new_group. When this argument is True, is change new_group to do a store_barrier only on the ranks that are park of the group and not the whole cluster. This addressess both scalability and composability problems associated with new_group. Fixes #81291. This is relanding #84224 As part of the original PR I did a quick benchmark of creating 3 PGs per rank using both functions and perf is the following: new_group use_local_synchronization=False: \| World Size \| Time (in secs) \| \| --- \| ----------- \| \| 4 \| 0.12 \| \| 8 \| 0.25 \| \| 16 \| 0.51 \| \| 32 \| 0.87 \| \| 64 \| 1.50 \| \| 128 \| 2.87 \| new_group use_local_synchronization=True: \| World Size \| Time (in secs) \| \| --- \| ----------- \| \| 4 \| 0.05 \| \| 8 \| 0.04 \| \| 16 \| 0.03 \| \| 32 \| 0.03 \| \| 64 \| 0.04 \| \| 128 \| 0.04 \| Scaling for `use_local_synchronization=False` is sub linear because the number of process groups created as a multiple of world_size decreases as we go up. It's 6 with world_size 4 and 192 with world_size 128. Scaling for `use_local_synchronization=True` is constant as the number of store barriers executed per rank remains constant at 3. Setup: 1 AWS host, backend gloo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99931 Approved by: https://github.com/xw285cornell	2023-04-27 13:44:02 +00:00
Nikita Vedeneev	2eab5abb50	sparse.sum backward: short circuit on zero/empty grad (#98838 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98838 Approved by: https://github.com/pearu	2023-04-27 12:06:11 +00:00
Edward Z. Yang	67e0913de9	Add support for serializing real tensor data in after aot minifier (#99834 ) The new minifier script looks like this: ``` import torch._dynamo.repro.after_aot reader = torch._dynamo.repro.after_aot.InputReader(save_dir='/tmp/tmpcsngx39e') buf0 = reader.storage('e2b39c716c0d4efb9fa57375a3902b9dab666893', 16) t0 = reader.tensor(buf0, (4,)) args = [t0] mod = make_fx(Repro(), tracing_mode='real')(*args) ``` The real tensor data is stored in the storages folder of the checkpoint dump directory. If you delete this folder / it is otherwise missing, we will transparently fall back to generating random data like before. The tensors are serialized using content store from #99809, which means each storage is content-addressed and we will automatically deduplicate equivalent data (which is useful if you keep dumping out, e.g., your parameters.) We don't use the tensor serialization capability from content store, instead all of the tensor metadata is stored inline inside the repro script (so that everything is in one file if you lose the checkpointed tensors). We also add a stable_hash option to content store, where we use a slow SHA-1 sum on the data in CPU side to compute a hash that is stable across systems with the same endianness. Out of rage, I also added support for Dtype.itemsize property access. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99834 Approved by: https://github.com/voznesenskym	2023-04-27 11:52:13 +00:00
Nikita Vedeneev	5cfaea15c4	relu/threshold backward for sparse: enable 0-nnz grads (#98935 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98935 Approved by: https://github.com/pearu	2023-04-27 10:57:05 +00:00
Sergii Dymchenko	c2402a9257	Change caffe2 branch links to main (#100129 ) Just a change pytorch/tree/master -> pytorch/tree/main pytorch/blob/master -> pytorch/blob/main Pull Request resolved: https://github.com/pytorch/pytorch/pull/100129 Approved by: https://github.com/huydhn	2023-04-27 10:31:50 +00:00
Jiong Gong	77a37a54ce	Include all mkl/mkldnn related test files to CPU ATen backend (#99592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99592 Approved by: https://github.com/kit1980	2023-04-27 10:26:01 +00:00
Will Constable	100a25d021	Basic dynamo support for traceable collectives (#94440 ) Make traceable collectives work with torchdynamo, bypassing problems with tracing the AsyncTensor subclass. Accept a suboptimal solution for now, and optimize it later. For now, wait happens immediately, which generally forces an early sync. Later, find a way either in dynamo or AOT stack to handle AsyncCollectiveTensor to get the wait in the optimal place. Note on implementation: - Dynamo traces 'user-level' fc apis that are designed to behave differently in eager vs compiled. In eager, there will be work-obj registration and a wrapper subclass will insert a 'wait' call at the appropriate time. In compile/trace mode, wait will be immetiately called, and work obj registration is required to be handled by the compile backend at runtime. - Dynamo needs to trace into some of the helper functions in the 'user-level' api, such as '_expand_group' which is essentially a constant transformation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94440 Approved by: https://github.com/kumpera	2023-04-27 05:38:36 +00:00
Eddie Yan	925a3788ec	[CUDA] Switch to `at::empty` in `max_pool3d_with_indices_backward_cuda` (#100138 ) Looks like there's an extraneous call to `at::zero` as `gradInput` will always be zero'd by `max_pool3d_with_indices_backward_out_cuda_template`. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/100138 Approved by: https://github.com/ngimel	2023-04-27 04:52:29 +00:00
medivh-xp	859e82a7a9	Making fsdp device-agnostic for custom-backend which implement cuda-semantics (#99024 ) Custom backend implementation based on privateuse1 with semantics identical to CUDA (CUDA is so popular), named for example 'my_device', and registered as the same module name torch.my_device. This PR aims to satisfy the constraints of such a backend, which can be directly integrated into the current FSDP implementation. The main issues addressed are: #### 1. Device decision for FSDP wrapping of Modules without Parameters Users typically organize FSDP code as follows: ```python m = Module().to('my_device:0') fsdp_m = FSDP(m) ``` or like this: ```python m = Module() fsdp_m = FSDP(m, device_id=torch.device('my_device', 0)) ``` If the model has Parameters, everything works fine because FSDP will prioritize the device where the Parameters are located. However, for Modules without Parameters, the to() call has no side effects, and FSDP will assume the current CUDA device, which prevents the use of devices other than the current CUDA device for Modules without Parameters. Therefore, when FSDP is called with a device_id argument, this configuration takes top priority. #### 2. Abstraction of a cuda-like device Now, in addition to compute_device, _FSDPState includes a device_handler member. In fact, this device_handler is now just a reference to either torch.cuda or torch.my_device. From now on, code that works based on _FSDPState should use state.device_handler to operate streams create, wait or sync, just like using torch.cuda previously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99024 Approved by: https://github.com/awgu	2023-04-27 04:13:28 +00:00
Jiong Gong	4456e932f8	[inductor] fix _print_Pow given reciprocal of dynamic dim with float exponent (#100090 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100090 Approved by: https://github.com/XiaobingSuper, https://github.com/jansel	2023-04-27 04:10:15 +00:00
XiaobingSuper	569eff85a0	inductor: enhance conv+binary fusion path test for cpu path (#100058 ) The motivations for this PR are: 1. Add negative/positive testing for conv+binary fusion path. 2. Add alias check for in-place fusion path: if the write buffer is an alias tensor, we will lower the out-place path(one test is also added). 3. Fix https://github.com/pytorch/pytorch/issues/99842. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100058 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-27 04:09:22 +00:00
cyy	e248016472	fix missing-prototypes warnings in torch_cpu (Part 1) (#100053 ) This PR fixes some missing-prototypes violations in the torch_cpu source. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100053 Approved by: https://github.com/albanD	2023-04-27 04:05:51 +00:00
Bert Maher	e0bf51d3bf	[dynamo] Add ddp_graphs artifact (#100021 ) I want to be able to decouple DDP graph printing from the rest of dynamo DEBUG-level logging, since frequently these logs are particularly enlightening. Differential Revision: [D45290919](https://our.internmc.facebook.com/intern/diff/D45290919/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100021 Approved by: https://github.com/wconstab, https://github.com/mlazos	2023-04-27 03:53:23 +00:00
Jiong Gong	1504bdf9e7	[inductor] logger message fix in split_cat (#100088 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100088 Approved by: https://github.com/Skylion007, https://github.com/jansel	2023-04-27 03:33:08 +00:00
feifan	c0ecd98958	Rename DispatchKey.PrivateUse1 to custom device in torchgen. (#99406 ) I want to use torchgen to generate code, and my yaml file format is the same as `native_functions.yaml`. I will use the PrivateUse1, but in my yaml file, I don't want to show PrivateUse1 to the user. So I want to achieve the following result(e.g. my device is `YPU`): ``` >>>from torchgen.model import DispatchKey >>>str(DispatchKey.PrivateUse1) "YPU" >>>DispatchKey.parse("YPU") DispatchKey.PrivateUse1 ``` I also thought that not everyone would need this feature, so I add a new func to handle this scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99406 Approved by: https://github.com/ezyang	2023-04-27 03:30:48 +00:00
XiaobingSuper	3588688ade	inductor: simplify the test_mkldnn_pattern_matcher.py code (#100057 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100057 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-04-27 03:16:03 +00:00
BowenBao	13259fe8f0	[ONNX] Fix type annotation for 'fx_to_onnxscript' (#100050 ) Curious why it wasn't caught by linter and beartype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100050 Approved by: https://github.com/titaiwangms, https://github.com/justinchuby	2023-04-27 03:05:28 +00:00
Jason Ansel	3f5d768b56	Refactors/improvements in _inductor/fx_passes (#100063 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100063 Approved by: https://github.com/devashishshankar	2023-04-27 01:18:09 +00:00
Rohan Varma	be8c7c06b6	[Tensor Parallel] Simplify distribute for MHA (#100046 ) This function is only called for nn.MHA or the custom MHA we use, and if it is the former it is converted to the latter. So this check can actually be an assert. Differential Revision: [D45300396](https://our.internmc.facebook.com/intern/diff/D45300396/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100046 Approved by: https://github.com/wanchaol	2023-04-27 00:54:21 +00:00
Fuzzkatt	97f4af3f4f	add sm80orlater check for bfloat test in test_torchinductor (#98034 ) small fix to add sm80orlater check for bfloat test in test_torchinductor since bfloat16 is not supported in sm < 80 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98034 Approved by: https://github.com/ngimel	2023-04-27 00:45:23 +00:00
Aleksei Nikiforov	bc0c74bcd5	Don't apply _Py_OPCODE twice (#97986 ) It's already applied in PyInstDecoder::opcode. Applying it twice returns incorrect result on big endian systems. This change fixes 14 tests in test/functorch/test_dims.py on big endian systems. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97986 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-04-27 00:27:32 +00:00
Thiago Crepaldi	32a67e42c4	Introduce FXGraphExtractor into torch.onnx.dynamo_export (#99940 ) The current API architecture can be seen as 3 independent exporters as shown below. The public API `dynamo_export()` defaults to one of the 3 variants and the other 2 must be used by instantiating private classes: ![image](https://user-images.githubusercontent.com/5469809/231567368-ec899718-b7c1-4e59-b6a8-383142df245a.png) This PR refactors the API in a way that `dynamo_export` is the only way to use the ONNX exporter. It defaults to a FX tracer based on ``torch.export``, but an internal-only idiom allows switching the FX tracer (aka `FXGraphExtractor` interface), as shown below: ![image](https://user-images.githubusercontent.com/5469809/231567495-3936362d-06de-4cfc-b752-6c2060701c08.png) Summary of changes: * Unifies all exporter variants under a single `dynamo_export` API * `ResolvedExportOptions` was expanded to allow `fx_tracer: FXGraphExtractor` to be specified, selecting which FX graph extractor to use, according to the design proposal * As a consequence, `torch.onnx._internal.exporter.Exporter` does not have to internally specialize for each type of FX API that the exporter might be used. This leads to a single `Exporter` with many `FX graph extractors` * Before in red, after in green: ![image](https://user-images.githubusercontent.com/5469809/232633531-4c67449b-4863-474d-9e18-78fc1d31b1bd.png) * Input processing was moved from `Exporter` subclasses to `FXGraphExtractor` subclasses, where they are actually consumed * `Exporter` is a [data]class that holds export options, model and input data in a single cohesive object. Specializing it means create different exporters instead of having one exporter capable of exporting models through different options. * `Exporter` doesn't consume the `model_args` that caused it to specialize * Improved the circular dependency story. * https://github.com/pytorch/pytorch/pull/99070 moves `import torch.onnx` to after all dynamo subcomponents, preventing `torch.onnx` to have circular depemndencies when `torch.XXXX` is imported during initialization * There are other points we need to improve in subsequent PRs. APIs are organized in a way that it is easy to "import too much" * Refactored `decomposition_table` as an internal-only `ResolvedExportOptions` property. * Similar to input processing, this helper is not actually consumed at tyhe `Exporter` layer. This PR moves it to the layer in which it is used * Demoted `Exporter.model_signature` to a simple standalone helper * There is no need to have this as a exporter method; this is a standard `inpect.signature` usage without any state Possible next steps are: * Decouple `passes` and `dispatching` from the cluttered `export_fx_to_onnx` * Further integration with http://github.com/pytorch/pytorch/pull/98421/ into `FXGraphExtractor` public API + helper for unit testing * Some passes are changing input processing, which are not captured by the proposed input adapter COPILOT SUMMARY <!-- copilot:all --> ### <samp>🤖 Generated by Copilot at bdaba31</samp> ### Summary 📝🚀🔧 <!-- 1. 📝 - This emoji represents the formatting and documentation changes, such as adding an empty line, updating the `__all__` list, and improving the type annotations and docstrings. 2. 🚀 - This emoji represents the new features and enhancements, such as adding the `DynamoExport` class, supporting custom export options, and flattening HuggingFace model outputs. 3. 🔧 - This emoji represents the refactoring and restructuring changes, such as using the FX graph representation, the `io_adapter` module, and the simplified FX symbolic tracer, and renaming and reorganizing some modules and classes. --> This pull request refactors the ONNX exporter code to use the FX graph representation and the new `io_adapter` module for input and output adaptation. It also adds support for custom export options and flattening HuggingFace model outputs in the ONNX test framework. It updates the ONNX dynamo exporter API tests and adds a new module `torch/onnx/_internal/fx/dynamo_graph_extractor.py` for exporting FX models to ONNX with dynamo support. It fixes some type annotations, imports, and formatting issues in the ONNX exporter code. > _The ONNX exporter got a new look_ > _With FX graph and dynamo hook_ > _It uses `io_adapter`_ > _And custom options matter_ > _For HuggingFace models and `model_signature` book_ ### Walkthrough * Move the `fx` submodule from `torch/onnx/_internal` to `torch/onnx/_internal/fx`, and rename some of its modules ( [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL21-R26), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L25-R26), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-3eef404cb9d85216c050be153c33255ebce1170a77d8b9b17be79bcfb238c9c4L5-R15), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-4da17ba9e1a187bfacb65a70d6ff15f6c2a60480be8e20fc452d8984a279cd0aL3-R30)) * Add a new module `torch/onnx/_internal/fx/dynamo_graph_extractor.py` that defines a `DynamoExport` class for generating FX graphs using the `torch._dynamo.export` API ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-078d7b8d0e4050e650fc3c15dc97a0564852191ac7b7bdc069d0b3959c5ee39aR1-R77)) * Add a new module `torch/onnx/_internal/fx/io_adapter.py` that defines the input and output adapter classes and steps for the ONNX exporter, and a helper function to wrap models with output adapters ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L159-R192), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-4da17ba9e1a187bfacb65a70d6ff15f6c2a60480be8e20fc452d8984a279cd0aL3-R30), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-4da17ba9e1a187bfacb65a70d6ff15f6c2a60480be8e20fc452d8984a279cd0aR72-R176), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-4da17ba9e1a187bfacb65a70d6ff15f6c2a60480be8e20fc452d8984a279cd0aL237-R478)) * Update the `ResolvedExportOptions` class in `torch/onnx/_internal/exporter.py` to inherit from the `ExportOptions` class, and to set the `fx_tracer` and `decomposition_table` attributes based on the `dynamo_graph_extractor` and `function_dispatcher` modules ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L81-R99), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862R117-R126)) * Update the `Exporter` class in `torch/onnx/_internal/exporter.py` to remove the `export` method and add a new abstract `generate_fx` method, and to use the `fx_tracer` attribute to generate and export the FX graph ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L413-R475), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L422-R486)) * Update the `FXSymbolicTraceExporter` class in `torch/onnx/_internal/fx/fx_symbolic_graph_extractor.py` to be renamed to `FXSymbolicTracer`, and to inherit from `exporter.FXGraphExtractor` and implement the `generate_fx` method ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-3eef404cb9d85216c050be153c33255ebce1170a77d8b9b17be79bcfb238c9c4L128-R175), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-3eef404cb9d85216c050be153c33255ebce1170a77d8b9b17be79bcfb238c9c4L157-R219)) * Update the `export_fx_to_onnx` method of the `FXSymbolicTracer` class to be renamed to `_export_fx_to_onnx`, and to be moved to the `exporter.FXGraphExtractor` class ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-3eef404cb9d85216c050be153c33255ebce1170a77d8b9b17be79bcfb238c9c4L193-R234)) * Update the `dynamo_export` function in `torch/onnx/_internal/exporter.py` to accept and return `ResolvedExportOptions` and `Exporter` objects, respectively ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L536-R606)) * Update the `run_test_with_fx_to_onnx_exporter_and_onnx_runtime` function in `test/onnx/onnx_test_common.py` to add a new parameter `export_options` for passing custom export options to the `torch.onnx.dynamo_export` function ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eR176), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eL216-R222)) * Update the `test_log_sigmoid` and `_test_large_scale_exporter` tests in `test/onnx/test_fx_to_onnx_with_onnxruntime.py` to use the updated `run_test_with_fx_to_onnx_exporter_and_onnx_runtime` function and the `torch.onnx.dynamo_export` function ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL297-R301), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL682-R686), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL696-R716), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL721-R730)) * Update the `test_raise_on_invalid_save_argument_type` test in `test/onnx/dynamo/test_exporter_api.py` to use the `io_adapter.InputAdapter` and `io_adapter.OutputAdapter` classes instead of the `exporter.InputAdapter` and `exporter.OutputAdapter` classes ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L139-R139)) * Move the `model_signature` property from the `Exporter` class in `torch/onnx/_internal/exporter.py` to a standalone function in `torch/onnx/utils.py`, and update the references to it ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L432-R505), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-3eef404cb9d85216c050be153c33255ebce1170a77d8b9b17be79bcfb238c9c4L157-R219), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL54-R75)) * Move the `UnsatisfiedDependencyError` class from the `Exporter` class in `torch/onnx/_internal/exporter.py` to the top level of the module, and update the references to it ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L442-R512)) * Rename the `_create_onnx_friendly_decomposition_table` function and the `_ONNX_FRIENDLY_DECOMPOSITION_TABLE` dictionary in `torch/onnx/_internal/fx/function_dispatcher.py` to `_create_default_onnx_decomposition_table` and `_DEFAULT_ONNX_EXPORTER_DECOMPOSITION_TABLE`, respectively, and update the references to them ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL213-R219), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL231-R239)) * Update the imports in `torch/onnx/_internal/fx/function_dispatcher.py` to use the `torch._ops` and `torch._decomp` modules instead of the `torch.ops` and `torch.decomp` modules, and to use aliases for accessing the `onnxscript.function_libs.torch_aten.ops` and `torch._ops` modules ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL11-R16), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL35-R156), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL160-R166), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL173-R182), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL189-R194), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL201-R204), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL231-R239)) * Update the `ExportOutput` class in `torch/onnx/_internal/exporter.py` to use the `InputAdapter` and `OutputAdapter` classes from `io_adapter` instead of the ones defined in the same module ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L275-R199)) * Update the type annotations in `torch/onnx/_internal/fx/serialization.py` and `torch/onnx/_internal/exporter.py` to fix some inconsistencies ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0c7a4333620a22a5c3e5315e30272b59fb7a11b393cb42f8255070bedeb02738L15-R15), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0c7a4333620a22a5c3e5315e30272b59fb7a11b393cb42f8255070bedeb02738L83-R83), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L11-R11), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862R18)) * Remove an unused import of `inspect` from `torch/onnx/_internal/exporter.py` ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L5)) * Remove an unused import of `torch._dynamo` from `torch/onnx/_internal/fx/passes/shape_inference.py` ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-d38827b1f79525963c39e5c480240cd81f4edcaf8b3bd374a1c6ee2fdb28b334L7)) * Add a comment to `torch/onnx/_internal/fx/passes/shape_inference.py` to explain why the import of `torch._dynamo` is done inside the `_run` method of the `ShapeInferenceWithFakeTensor` class ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-d38827b1f79525963c39e5c480240cd81f4edcaf8b3bd374a1c6ee2fdb28b334R32-R35)) * Fix a typo in the docstring of the `_module_expansion_symbolic_trace` function in `torch/onnx/_internal/fx/fx_symbolic_graph_extractor.py` ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-3eef404cb9d85216c050be153c33255ebce1170a77d8b9b17be79bcfb238c9c4L96-R98)) * Add an empty line to `torch/onnx/__init__.py` for formatting purposes ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553R12)) * Delete the `torch/onnx/_internal/fx/__init__.py` file ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-a39fa3741f027bb9717388fc922d1e846fbd43d44f2c5fbee4e8d2188a7edb85)) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99940 Approved by: https://github.com/BowenBao, https://github.com/jansel	2023-04-27 00:25:28 +00:00
cdzhan	763e0a9027	[inductor] fix inconsistent behaviours when padding size is zero (#100082 ) Fixes #97117 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100082 Approved by: https://github.com/jansel	2023-04-26 23:58:04 +00:00
Xilun Wu	19e81b7b19	[BE][DTensor] add DeviceMesh test to periodic testing list (#100029 ) ## Why This PR adds `test_device_mesh.py` to periodic tests because it requires more GPUs (4/8) to run testing than the CI machine has. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100029 Approved by: https://github.com/wanchaol, https://github.com/fduwjj	2023-04-26 23:45:10 +00:00
Yanbo Liang	4c6f7cbc86	Fix prims unbind if given dimension size is 0 (#100122 ) Fixes #99832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100122 Approved by: https://github.com/ngimel	2023-04-26 23:40:21 +00:00
Yanbo Liang	2989d6c93d	[Dynamo] Fix constructing lazy submodule inside of lazy module's initialize_parameters (#100047 ) This PR fixed two issues: * Constructing lazy submodule inside of lazy module's ```initialize_parameters``` - don't unspecialized module if it's lazy. * Fixes #100001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100047 Approved by: https://github.com/jansel	2023-04-26 23:36:31 +00:00
mikey dagitses	fab2e3971f	enable -Werror=sign-compare in our Bazel build (#98671 ) enable -Werror=sign-compare in our Bazel build Summary: This is already turned on for CMake, let's see what breaks. Test Plan: Rely on CI. Reviewers: sahanp Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/98671 Approved by: https://github.com/kit1980	2023-04-26 23:23:24 +00:00
Bert Maher	6789342a56	[dynamo] Make bytecode logging off-by-default (#100093 ) A big model (like Meta's production models) can dump 100s of MBs of bytecode, making the logs virtually unusable. Let's only turn these on if they're explicitly requested. Differential Revision: [D45314055](https://our.internmc.facebook.com/intern/diff/D45314055/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100093 Approved by: https://github.com/mlazos	2023-04-26 23:06:22 +00:00
Richard Li	c523d7d899	Add a new hook (#99854 ) Differential Revision: D45220984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99854 Approved by: https://github.com/albanD	2023-04-26 23:00:38 +00:00
Aleksei Nikiforov	eaa00017c8	S390x tests (#99871 ) Disable tests using quantized operators if QNNPACK is not available Two disabled tests use Int8FC operators which are not available if QNNPACK is not available, and fail only due to that. Disable cpuid_test on s390x Pull Request resolved: https://github.com/pytorch/pytorch/pull/99871 Approved by: https://github.com/albanD	2023-04-26 21:48:03 +00:00
Aleksei Nikiforov	45337e20bb	Fix byteswapping (#99869 ) On big endian systems byteswapping should be done other way around. This change fixes TestE2ETensorPipe.TestTrainingLoop test from test_cpp_rpc testsuite on big endian systems. Use uint64_t when decoding double values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99869 Approved by: https://github.com/ezyang	2023-04-26 21:44:07 +00:00
Animesh Jain	5f138a6b65	[minifier][after dynamo] clone inputs while retaining gradness (#100066 ) Helps with minifying one failure in https://github.com/pytorch/pytorch/issues/98561 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100066 Approved by: https://github.com/ezyang	2023-04-26 21:31:18 +00:00
penguin-wwy	3d39bd5976	[dynamo] Remove redundant recompile call (#100084 ) A single call to the `GraphModule.recompile` function occurs after the `GraphModule` has been constructed. `62f9189d9d/torch/_dynamo/output_graph.py (L754-L755)` However, the recompile function has already been called once during construction, so this call should be redundant. ``` call stack: recompile, graph_module.py:644 graph, graph_module.py:411 __setattr__, module.py:1674 __init__, graph_module.py:370 compile_and_call_fx_graph, output_graph.py:754 ... ``` So maybe it can be deleted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100084 Approved by: https://github.com/ezyang	2023-04-26 21:23:21 +00:00
Aleksei Nikiforov	b9146d8b0b	Remove inclusion of non-existent header on s390x (#99870 ) This change fixes build on s390x Pull Request resolved: https://github.com/pytorch/pytorch/pull/99870 Approved by: https://github.com/albanD	2023-04-26 21:13:17 +00:00
Catherine Lee	dc10004553	Add asan slow test shard (#99925 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99925 Approved by: https://github.com/huydhn	2023-04-26 21:10:55 +00:00
Angela Yi	9bbd3d6489	[export] ExportPassBase + view_copy pass (#100000 ) * Added ExportPassBase, an interpreter based helper pass writing class * It can also help maintain the dialect based on the operator namespace through having users override the `get_valid_dialects` function (returning an empty lists implies the pass works for any dialect). * Added a `ReplaceBrokenOpsWithFunctionalOpsPass` to replace all ops that have not been converted with functionalization with their functional ones. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100000 Approved by: https://github.com/gmagogsfm	2023-04-26 21:01:25 +00:00
mikey dagitses	9bf2dfbbb0	migrate memcpy src to const_data_ptr (#98781 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98781 Approved by: https://github.com/Skylion007	2023-04-26 20:43:59 +00:00
PyTorch MergeBot	e5291e633f	Revert "Migrate jobs from windows.4xlarge to windows.4xlarge.nonephemeral instances (#100091 )" This reverts commit 1183eecbf19f77e2b1d9f3cee56dd8039653a5f5. Reverted https://github.com/pytorch/pytorch/pull/100091 on behalf of https://github.com/huydhn due to CPU jobs start failing in trunk due to some error in MSVC setup	2023-04-26 19:17:58 +00:00
Animesh Jain	006785cd46	[dynamo][hf_bigbird] Actually graph break on tensor.unsqueeze_/resize_ (#99986 ) Currently, we return `unimplemented` w/o a graph break on seeing a x.unsqueeze_ when x is input. This essentially means we fall back to the original frame. This PR actually graph breaks so that we can generate the continuation frame for the rest of the function. Instead of graph breaking at LOAD_ATTR, we delay the graph break to the actual CALL_FUNCTION, where its cleaner to graph break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99986 Approved by: https://github.com/jansel	2023-04-26 18:50:06 +00:00
vfdev-5	aa99c5b4ed	Added round_with_scale_factor arg to ATen (#97868 ) Addresses #62396 following the strategy described in https://github.com/pytorch/pytorch/pull/64983#issuecomment-1026177629. Fixing output size to match opencv, scikit-image, scipy if scale factor is specified on ATen side only due to JIT FC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97868 Approved by: https://github.com/lezcano, https://github.com/mikaylagawarecki	2023-04-26 18:48:37 +00:00
mikey dagitses	cc628293bf	simplify method_def generation (#100059 ) simplify method_def generation Summary: This removes some duplication. This was originally done to streamline a subsequent change, but that change turned out to be misguided. Nevertheless, this is a nice simplification. Test Plan: This should change the code gen by removing some redundant parentheses. Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100059 Approved by: https://github.com/ezyang	2023-04-26 18:46:57 +00:00
mikey dagitses	c778980fb8	remove casts to `getter` in python_cpp_function.h (#100065 ) remove casts to `getter` in python_cpp_function.h Summary: These were triggering the warning `-Wcast-function-type-strict` and breaking the build on my machine. Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100065 Approved by: https://github.com/ezyang	2023-04-26 18:46:41 +00:00
mikey dagitses	a337c42dfc	make ATen/native/cuda/AdaptiveAveragePooling.cu data_ptr-correct (#100030 ) make ATen/native/cuda/AdaptiveAveragePooling.cu data_ptr-correct Summary: Traced through each input and output to ensure correctness. Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100030 Approved by: https://github.com/ezyang	2023-04-26 18:42:20 +00:00
mikey dagitses	6170be9012	make ATen/native/cuda/EmbeddingBag.cu data_ptr-correct (#99083 ) make ATen/native/cuda/EmbeddingBag.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99083 Approved by: https://github.com/ezyang	2023-04-26 18:41:55 +00:00
mikey dagitses	9b3862cd02	make im2col calls data_ptr-correct (#99111 ) make im2col calls data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99111 Approved by: https://github.com/ezyang	2023-04-26 18:41:25 +00:00
mikey dagitses	06ca9bb915	make col2im calls data_ptr-correct (#99112 ) make col2im calls data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99112 Approved by: https://github.com/ezyang	2023-04-26 18:41:14 +00:00
SigureMo	0bc02d3805	[pt2] remove unnecessary if expr (#99865 ) `LocalSource(k)` is equivalent to `LocalSource((k))` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99865 Approved by: https://github.com/ezyang	2023-04-26 18:35:20 +00:00
Angela Yi	004f3d71aa	[export] Move verifier over to export from torch/fx (#100019 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100019 Approved by: https://github.com/tugsbayasgalan	2023-04-26 18:26:46 +00:00
andrewor14	6c550bb4d5	[quant][be] Easier way to override default in QConfigMapping (#99888 ) Summary: This commit adds a private helper function to override the default QConfig in the default QConfigMapping. Previously we needed to override all the object_types manually while skipping the fixed qparams ops. This led to duplicate code every time someone wanted a new default QConfig. After this commit, we can just call the same helper function instead. Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: jerryzh168, vkuzo Pull Request resolved: https://github.com/pytorch/pytorch/pull/99888 Approved by: https://github.com/vkuzo, https://github.com/jerryzh168	2023-04-26 18:14:01 +00:00
Bin Bao	9f0092c4b7	[CI] Replace timm_efficientdet with timm_vision_transformer in smoketest (#100106 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100106 Approved by: https://github.com/yanboliang	2023-04-26 18:03:59 +00:00
Edward Z. Yang	3a5427baf4	Add torch.utils._content_store (#99809 ) Implements a simple content-addressable store for storages (with tensors implemented as cheap references on top), enabling incremental serialization of tensors to disk, which I intend to use in the accuracy repro extractor. Check the comment at the top of torch/utils/_content_store.py for more details on the intended use case. One major piece of this PR is implementing the content hash for tensors. For our prospective use case, we may need to repeatedly hash up to 80 GB of tensor data every time we snapshot (and we may snapshot multiple times). Using a conventional cryptographic hash and hashing each snapshot would likely take on order of minutes, which seemed too slow to me. So instead, I implemented a crappy hash function that can be run on GPU. It is at least somewhat theoretically grounded: using random parameters generated by Philox, we use the standard shift-multiply and xor sum universal hash family. The hash function is a bit dorky though; instead of properly doing 160-bit math, it just runs 32-bit hash five times and cats them together. By the way, this sets the first precedent for kernel in PyTorch library which MUST be torch.compile'd to be run (in fact, this kernel does not run in eager mode because of the use of xor_sum, which doesn't actually exist in ATen.) I had to add a few more primitives to inductor, namely randint (over the entire int range) and xor_sum. Fortunately, these primitives are natively supported by Triton/C++, and so they were very easy to plumb through. xor_sum is exposed as a prim, while randint special cases on when low/high span the entire 32-bit signed integer range. Thanks to Jeff Johnson for letting me bounce ideas of him on a Saturday morning lol. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99809 Approved by: https://github.com/voznesenskym	2023-04-26 18:02:59 +00:00
milesial	45bf3f6216	Optimized EMA implementation (#94820 ) This PR proposes an optimized way to do Exponential Moving Average (EMA), which is faster than the current way using `swa_utils.AveragedModel` described in https://pytorch.org/docs/stable/optim.html#custom-averaging-strategies. This implementation is asynchronous, and is built as an optimizer wrapper so that the EMA weight update happens without any additional CPU/GPU sync, just after optimizer steps, and with limited code changes. Example usage: ``` model = Model().to(device) opt = torch.optim.Adam(model.parameters()) opt = EMAOptimizer(opt, device, 0.9999) for epoch in range(epochs): training_loop(model, opt) regular_eval_accuracy = evaluate(model) with opt.swap_ema_weights(): ema_eval_accuracy = evaluate(model) ``` Here are some benchmarks (time per iteration) on various torchvision models: \|model\|this PR iteration time \|swa_utils.AveragedModel iteration time\| iteration speedup \| \|-----\|-----------------------------\|-----------------------\|---------------------------------------------\| \| \| \| \| \| \|regnet_x_1_6gf\|62.73 \|67.998 \|1.08 \| \|regnet_x_3_2gf\|101.75 \|109.422 \|1.08 \| \|regnet_x_400mf\|25.13 \|32.005 \|1.27 \| \|regnet_x_800mf\|33.01 \|37.466 \|1.13 \| \|regnet_x_8gf\|128.13 \|134.868 \|1.05 \| \|regnet_y_16gf\|252.91 \|261.292 \|1.03 \| \|regnet_y_1_6gf\|72.14 \|84.22 \|1.17 \| \|regnet_y_3_2gf\|99.99 \|109.296 \|1.09 \| \|regnet_y_400mf\|29.53 \|36.506 \|1.24 \| \|regnet_y_800mf\|37.82 \|43.634 \|1.15 \| \|regnet_y_8gf\|196.63 \|203.317 \|1.03 \| \|resnet101\|128.80 \|137.434 \|1.07 \| \|resnet152\|182.85 \|196.498 \|1.07 \| \|resnet18\|29.06 \|29.975 \|1.03 \| \|resnet34\|50.73 \|53.443 \|1.05 \| \|resnet50\|76.88 \|80.602 \|1.05 \| \|resnext101_32x8d\|277.29 \|280.759 \|1.01 \| \|resnext101_64x4d\|269.56 \|281.052 \|1.04 \| \|resnext50_32x4d\|100.73 \|101.102 \|1.00 \| \|shufflenet_v2_x0_5\|10.56 \|15.419 \|1.46 \| \|shufflenet_v2_x1_0\|13.11 \|18.525 \|1.41 \| \|shufflenet_v2_x1_5\|18.05 \|23.132 \|1.28 \| \|shufflenet_v2_x2_0\|25.04 \|30.008 \|1.20 \| \|squeezenet1_1\|14.26 \|14.325 \|1.00 \| \|swin_b\|264.52 \|274.613 \|1.04 \| \|swin_s\|180.66 \|188.914 \|1.05 \| \|swin_t\|108.62 \|112.632 \|1.04 \| \|swin_v2_s\|220.29 \|231.153 \|1.05 \| \|swin_v2_t\|127.27 \|133.586 \|1.05 \| \|vgg11\|95.52 \|103.714 \|1.09 \| \|vgg11_bn\|106.49 \|120.711 \|1.13 \| \|vgg13\|132.94 \|147.063 \|1.11 \| \|vgg13_bn\|149.73 \|165.256 \|1.10 \| \|vgg16\|158.19 \|172.865 \|1.09 \| \|vgg16_bn\|177.04 \|192.888 \|1.09 \| \|vgg19\|184.76 \|194.194 \|1.05 \| \|vgg19_bn\|203.30 \|213.334 \|1.05 \| \|vit_b_16\|217.31 \|219.748 \|1.01 \| \|vit_b_32\|69.47 \|75.692 \|1.09 \| \|vit_l_32\|223.20 \|258.487 \|1.16 \| \|wide_resnet101_2\|267.38 \|279.836 \|1.05 \| \|wide_resnet50_2\|145.06 \|154.918 \|1.07 \| You can see that in all cases it is faster than using `AveragedModel`. In fact in many cases, adding EMA does not add any overhead since the computation is hidden behind the usual iteration flow. This is a similar implementation to the one currently in [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). If the team is interested in merging this, let me know and I'll add some documentation similar to `swa_utils` and tests. Credits to @szmigacz for the implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94820 Approved by: https://github.com/janeyx99	2023-04-26 18:02:11 +00:00
mikey dagitses	c6ab4ff35c	convert to mutable_data_ptr data_ptr calls immediately after at::empty() (#98734 ) The tensor is uninitialized in this case, so it is highly likely to be written before it will be read. Furthermore, because it is a new tensor, there's no harm in getting mutable access to it: there's no lazy copy that would be materialized. This was automatically generated with a regular expression. Differential Revision: [D44831830](https://our.internmc.facebook.com/intern/diff/D44831830/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98734 Approved by: https://github.com/ezyang	2023-04-26 17:40:24 +00:00
mikey dagitses	65823619c0	convert trivial data reads to const_data_ptr (#98751 ) Differential Revision: [D44834421](https://our.internmc.facebook.com/intern/diff/D44834421/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98751 Approved by: https://github.com/ezyang	2023-04-26 17:37:12 +00:00
Rodrigo Kumpera	5b4a523583	Add all_reduce_coalesced to functional collectives (#98640 ) This adds all_reduce_coalesced to MTPG to ease testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98640 Approved by: https://github.com/wanchaol	2023-04-26 17:05:54 +00:00
Yanli Zhao	9bc03db670	Move nn.module state dict pre hook (#98964 ) Some modules like lazyModule may override '_save_to_state_dict()', in this case, pre_state_dict hook will not be called. So move the pre_state_dict hook out from '_save_to_state_dict()' to make sure the pre hook could be called Pull Request resolved: https://github.com/pytorch/pytorch/pull/98964 Approved by: https://github.com/albanD	2023-04-26 16:51:13 +00:00
Brian Hirsh	bb4e9e9124	functionalization: error during mutations on mem overlap (#99919 ) Fixes https://github.com/pytorch/pytorch/issues/98143. If a user mutates a tensor that has overlapping memory, this can cause silent correctness issues with torch.compile. This PR adds a few checks to detect that situation and error. Unfortunately `at::has_internal_overlap()` wasn't smart enough to detect the one linked in the issue, so I added a (simple) check that only runs in functionalization, that can catch the overlapping memory. We might need to revisit and add more complex checks later though (luckily, functionalization runs during compilation time so we can afford more expensive checks). Pull Request resolved: https://github.com/pytorch/pytorch/pull/99919 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-04-26 16:39:40 +00:00
Jean Schmidt	1183eecbf1	Migrate jobs from windows.4xlarge to windows.4xlarge.nonephemeral instances (#100091 )	2023-04-26 18:32:50 +02:00
Chien-Chin Huang	33fba6ef07	[SPMD] Add arange and zeros to default factory ops (#100037 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100037 Approved by: https://github.com/mrshenli, https://github.com/wanchaol	2023-04-26 16:32:10 +00:00
Bin Bao	afa9d10ed6	[inductor] Support mixed device in cpp wrapper (#99950 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99950 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-26 16:26:56 +00:00
Michael Voznesensky	e789de952f	Make sizevar addition work properly (#100015 ) Rm Pull Request resolved: https://github.com/pytorch/pytorch/pull/100015 Approved by: https://github.com/ezyang	2023-04-26 15:59:26 +00:00
gui11aume	7ec4392068	Remove in-place operations in NegativeBinomial (#96748 ) This is a suggestion for a minor modification. The line `log_normalization[self.total_count + value == 0.] = 0.` prevents Jit compilation when the condition occurs, with the error message `RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.` I propose an alternative that does not involve in-place operations. It uses the function `nan_to_num()` to replace infinite values by 0 where `self.total_count + value == 0.` while leaving `nan` and `-inf` as they are. Readability is suboptimal because the code does not replace nan with numbers, but I could not find a function that only replaces infinite values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96748 Approved by: https://github.com/fritzo, https://github.com/soulitzer	2023-04-26 14:45:08 +00:00
Li-Huai (Allan) Lin	81978120ec	[MPS] Fix trace exceptions not raised for error inputs (#99239 ) Also rename `trace_mps_out` to `trace_mps` as it is not an out version. Remove `index_add` from XFAILLIST as it seems working as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99239 Approved by: https://github.com/kulinseth	2023-04-26 14:41:50 +00:00
Li-Huai (Allan) Lin	f4a37c9a5d	[MPS] Fix max_pool2d exceptions not raised for error inputs (#99238 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99238 Approved by: https://github.com/kulinseth	2023-04-26 14:41:50 +00:00
Li-Huai (Allan) Lin	f4cf744380	[MPS] Fix gelu exceptions not raised for error inputs (#99237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99237 Approved by: https://github.com/kulinseth	2023-04-26 14:41:46 +00:00
mikey dagitses	aaa3eb059a	add some missing includes (#100049 ) add some missing includes Summary: These were failing in my build environment. Clang16, Fedora38, no extra build config. Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100049 Approved by: https://github.com/Skylion007	2023-04-26 14:27:06 +00:00
mikey dagitses	4b1310bfa4	suppress `-Wcast-function-type-strict` when casting to PyCFunction (#100068 ) suppress `-Wcast-function-type-strict` when casting to PyCFunction Summary: These casts are a necessary evil due to the design of Python. Python ultimately casts it back to the original type based on the flags specified in the `PyMethodDef`. Nevertheless, the new Clang flag `-Wcast-function-type-strict` breaks with this. While here, convert the cast to a `reinterpret_cast`. Test Plan: Should be a no-op. Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100068 Approved by: https://github.com/Skylion007	2023-04-26 14:24:26 +00:00
Richard Zou	69bf0241b1	Allow calling functorch transforms when their DispatchKeys are disabled (#100011 ) This was always the intended behavior: e.g. you should be able to call functorch.functionalize even when DispatchKey::Functionalize is disabled. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/100011 Approved by: https://github.com/tugsbayasgalan	2023-04-26 13:04:13 +00:00
mikey dagitses	62f9189d9d	make ATen/native/cuda/AveragePool2d.cu data_ptr-correct (#99336 ) make ATen/native/cuda/AveragePool2d.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99336 Approved by: https://github.com/ezyang	2023-04-26 06:01:30 +00:00
Michael Voznesensky	a0934f8bad	Replace maybe_guard with statically_known (#99383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99383 Approved by: https://github.com/ngimel	2023-04-26 05:53:48 +00:00
mikey dagitses	400dbde8a0	make ATen/native/cuda/ScanUtils.cuh data_ptr-correct (#99080 ) make ATen/native/cuda/ScanUtils.cuh data_ptr-correct Test Plan: Rely on Ci. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99080 Approved by: https://github.com/Skylion007	2023-04-26 05:48:47 +00:00
Li-Huai (Allan) Lin	1fcf40da63	[MPS] Add linear inputs check (#99228 ) Fixes #98211 https://github.com/pytorch/pytorch/issues/98211#issuecomment-1496005668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99228 Approved by: https://github.com/kit1980	2023-04-26 04:44:23 +00:00
Akinori Mitani	c11441fda3	Update torch.arange doc. (#99963 ) To always exclude `end` without being affected by rounding error, `epsilon` should be subtracted, instead of being added. Fixes #99853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99963 Approved by: https://github.com/kit1980	2023-04-26 04:18:56 +00:00
AllenTiTaiWang	08c49eee5e	[ONNX] Support aten::atan2 in torchscript exporter (#100040 ) Fixes #51334 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100040 Approved by: https://github.com/BowenBao	2023-04-26 04:00:47 +00:00
mikey dagitses	9d99d8879c	add missing include on <stdexcept> from Registry.h (#100036 ) add missing include on <stdexcept> from Registry.h Summary: This throws std::runtime_error in the header. Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100036 Approved by: https://github.com/Skylion007	2023-04-26 03:59:20 +00:00
andrewjcg	0b1b063158	[buckbuild.bzl] Fix dep handling in cross-builds Differential Revision: D44960349nnPull Request resolved: https://github.com/pytorch/pytorch/pull/99826	2023-04-25 20:53:28 -07:00
Andrew Gu	8fd866c666	Add frame summary to for/while loop backedge log message (#100045 ) This PR adds the frame summary to the log message, e.g.: ``` [2023-04-26 00:11:21,035] torch._dynamo.symbolic_convert: [INFO] Skipping frame because there is a graph break in a for/while loop <FrameSummary file /fsx/users/andgu/work/transformers/src/transformers/models/t5/modeling_t5.py, line 1086 in <resume in forward>> ``` Note that the line cited by the frame summary may not be the for/loop itself but rather a line inside the for/while loop. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100045 Approved by: https://github.com/anijain2305	2023-04-26 03:24:24 +00:00
Aleksei Nikiforov	1ded73f909	Remove little endian asserts (#99713 ) They block tests test_embedding_bag_2bit_unpack, test_embedding_bag_4bit_unpack and test_embedding_bag_byte_unpack in test/quantization/core/test_quantized_op.py. Without these asserts tests start passing on big endian systems. Fixes #97803 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99713 Approved by: https://github.com/kit1980	2023-04-26 02:08:28 +00:00
Avik Chaudhuri	c680f2b8ea	relax restriction on cond branches calling closed functions (#100013 ) As of https://github.com/pytorch/pytorch/pull/99367 we error when cond branches look up closed vars. The suggested fix is to add these closed vars as args to the branches. However, while this works for tensor vars (and also primitive vars by explicit wrapping), this is impossible to do for function vars. Moreover, function vars are OK because we trace through them. So relaxing this restriction for function vars is a strict win. Differential Revision: [D45287893](https://our.internmc.facebook.com/intern/diff/D45287893/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100013 Approved by: https://github.com/tugsbayasgalan	2023-04-26 01:57:24 +00:00
Aidyn-A	488e0effe3	Fix test_multiple_devices_randint_cuda (#99775 ) The test fails with device mismatch error: ``` Traceback (most recent call last): File "/pytorch/torch/testing/_internal/common_utils.py", line 2137, in wrapper method(args, kwargs) File "/pytorch/torch/testing/_internal/common_device_type.py", line 401, in instantiated_test result = test(self, param_kwargs) File "/pytorch/torch/testing/_internal/common_device_type.py", line 846, in test_wrapper return test(args, *kwargs) File "/pytorch/torch/testing/_internal/common_device_type.py", line 1005, in only_fn return fn(slf, args, *kwargs) File "/pytorch/torch/testing/_internal/common_device_type.py", line 1029, in multi_fn return fn(slf, devices, args, **kwargs) File "/pytorch/test/test_ops.py", line 148, in test_multiple_devices self.assertTrue(result.device == cuda_device) AssertionError: False is not true ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99775 Approved by: https://github.com/ngimel	2023-04-26 01:49:41 +00:00
Denis Vieriu	89baa1a74c	[MPS] Add support for linalg.vector_norm (#99811 ) Summary of changes: - Add support for linalg.vector_norm - Fix zero norm, correct formula is: sum(x != 0) - Add additional tests in test_mps Pull Request resolved: https://github.com/pytorch/pytorch/pull/99811 Approved by: https://github.com/kulinseth	2023-04-26 01:34:29 +00:00
Animesh Jain	539363a873	[inductor] Lowering of rngprims philox_rand (#99289 ) An example graph with Dynamic shapes on `arg0_1` is seed, `arg1_1` is base offset. ~~~ ===== Forward graph 0 ===== <eval_with_key>.5 class <lambda>(torch.nn.Module): def forward(self, arg0_1: i64[], arg1_1: i64[], arg2_1: Sym(s0), arg3_1: f32[s0]): # File: /scratch/anijain/work/pytorch/test/inductor/test_torchinductor.py:4605, code: a = torch.rand_like(x) * x add: i64[] = torch.ops.aten.add.Tensor(arg1_1, 0) philox_rand = torch.ops.rngprims.philox_rand.default([arg2_1], arg0_1, add, None, device(type='cuda', index=0), torch.float32); add = None getitem: f32[s0] = philox_rand[0] getitem_1: i64[] = philox_rand[1]; philox_rand = None add_1: i64[] = torch.ops.aten.add.Tensor(getitem_1, 0); getitem_1 = None mul: f32[s0] = torch.ops.aten.mul.Tensor(getitem, arg3_1); getitem = arg3_1 = None # File: /scratch/anijain/work/pytorch/test/inductor/test_torchinductor.py:4606, code: a = torch.rand_like(x) * a add_2: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_1) philox_rand_1 = torch.ops.rngprims.philox_rand.default([arg2_1], arg0_1, add_2, None, device(type='cuda', index=0), torch.float32); arg2_1 = arg0_1 = add_2 = None getitem_2: f32[s0] = philox_rand_1[0] getitem_3: i64[] = philox_rand_1[1]; philox_rand_1 = None add_3: i64[] = torch.ops.aten.add.Tensor(add_1, getitem_3); add_1 = getitem_3 = None mul_1: f32[s0] = torch.ops.aten.mul.Tensor(getitem_2, mul); getitem_2 = mul = None # No stacktrace found for following nodes add_4: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_3); arg1_1 = add_3 = None add_5: i64[] = torch.ops.aten.add.Tensor(add_4, 3); add_4 = None div: i64[] = torch.ops.aten.div.Tensor_mode(add_5, 4, rounding_mode = 'floor'); add_5 = None mul_2: i64[] = torch.ops.aten.mul.Tensor(div, 4); div = None return (mul_1, mul_2) ~~~ Note that the output `mul2` is basically total `numel` of the random ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99289 Approved by: https://github.com/jansel	2023-04-26 01:22:41 +00:00
Rob Guo	111358de19	Support non-ASCII characters in model file paths (#99453 ) Fixes #98918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99453 Approved by: https://github.com/albanD, https://github.com/malfet	2023-04-26 01:15:49 +00:00
Bin Bao	efded3f3e9	[inductor] Add cpp_wrapper support for FallbackKernel (#99887 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99887 Approved by: https://github.com/ngimel	2023-04-26 01:03:53 +00:00
Yanbo Liang	d3143d0be6	Skip timm_vision_transformer in Inductor torchbench smoketest (#99766 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99766 Approved by: https://github.com/desertfire	2023-04-26 00:49:36 +00:00
Devashish Shankar	79f8ac14d5	Add pass to normalize torch.ops.fb.equally_split Differential Revision: D45198465nnPull Request resolved: https://github.com/pytorch/pytorch/pull/99941	2023-04-25 17:35:53 -07:00
William Wen	785676ccb0	[dynamo 3.11] refactor cpython function defs out of eval_frame.c (#99947 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99947 Approved by: https://github.com/voznesenskym, https://github.com/albanD	2023-04-26 00:18:12 +00:00
Kiersten Stokes	bafa2c4724	Change 'w.r.t.' to 'wrt' in function docstrings to fix doc rendering (#100028 ) Fixes #72428 according to decision reached in comments. I've left other instances of `w.r.t.` in tact (e.g. in parameter/return descriptions, in comments, etc) because there were many, and I didn't' want to go out-of-scope. That being said, I'm happy to change those as well if we'd prefer the consistency! I've also fixed a typo that I came across while grepping for instances. Will update with screenshots once docs are built. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100028 Approved by: https://github.com/albanD	2023-04-25 23:53:26 +00:00
Shrikant Nagori	676a23f452	[RFC] Allow elastic agent to fail fast (#99051 ) Summary: Today, on a segfault on a single trainer , we end up keeping the gpu on all ranks blocked for 5 minutes due to elastic agents barrier timeouts Test Plan: Rely on existing test to validate . Looking to get some feedback on adding UTs Differential Revision: D44929488 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99051 Approved by: https://github.com/kurman, https://github.com/kiukchung	2023-04-25 23:51:20 +00:00
Catherine Lee	eddb3a060e	Rename master -> main in docs workflow (#100022 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100022 Approved by: https://github.com/janeyx99	2023-04-25 23:33:48 +00:00
AllenTiTaiWang	1c110652a8	[ONNX] Support aten::tile in torchscript exporter (#99927 ) Fixes #99692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99927 Approved by: https://github.com/justinchuby	2023-04-25 22:58:18 +00:00
Animesh Jain	6bc4651193	[philox_rand] Dynamic shape support (#99290 ) Extends the functionalization of rng work to Dynamic shapes. An example of the generated graph looks like this ~~~ [2023-04-24 21:41:37,446] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH ===== Forward graph 1 ===== <eval_with_key>.7 class <lambda>(torch.nn.Module): def forward(self, arg0_1: i64[], arg1_1: i64[], arg2_1: Sym(s0), arg3_1: Sym(s1), arg4_1: f32[s0, s1]): # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:46, code: a = torch.rand_like(x) * x add: i64[] = torch.ops.aten.add.Tensor(arg1_1, 0) philox_rand = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add, None, device(type='cuda', index=0), torch.float32); add = None getitem: f32[s0, s1] = philox_rand[0] getitem_1: i64[] = philox_rand[1]; philox_rand = None add_1: i64[] = torch.ops.aten.add.Tensor(getitem_1, 0); getitem_1 = None mul: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem, arg4_1); getitem = arg4_1 = None # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:47, code: a = torch.rand_like(x) * a add_2: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_1) philox_rand_1 = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add_2, None, device(type='cuda', index=0), torch.float32); arg2_1 = arg3_1 = arg0_1 = add_2 = None getitem_2: f32[s0, s1] = philox_rand_1[0] getitem_3: i64[] = philox_rand_1[1]; philox_rand_1 = None add_3: i64[] = torch.ops.aten.add.Tensor(add_1, getitem_3); add_1 = getitem_3 = None mul_1: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem_2, mul); getitem_2 = mul = None # No stacktrace found for following nodes add_4: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_3); arg1_1 = add_3 = None return (mul_1, add_4) ~~~ Each rand op is accompanied by its offset calculation op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99290 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2023-04-25 22:40:28 +00:00
Aaron Gokaslan	dfba65be8b	Update Cutlass to v3.1 (#94188 ) Now that we are on CUDA 11+ exclusively, we can update Nvidia's Cutlass to the next version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94188 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/malfet	2023-04-25 22:02:42 +00:00
Brian Hirsh	15e1bee269	change torch._dynamo.export(aten_graph=...) to allow pre_autograd tracing (#98031 ) pre_autograd tracing is still early, but it should work for basic cases. This PR changes the API a bit for export to expose pre_autograd tracing. Name bikeshedding is welcome, but it looks like: ``` torch._dynamo.export(..., aten_graph="aten_pre_autograd") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98031 Approved by: https://github.com/ezyang	2023-04-25 21:58:14 +00:00
Brian Hirsh	62fad315c1	fix per-dispatchkey-mode caching bug (#98030 ) The bug was that: if you want to move a mode to the autograd key, we need to use the "functionality" key for it (AutogradFunctionality). But when we do that, we need to clear any PythonDispatcher caches for every op for every autograd key (since you could run autograd ops with both cpu and cuda tensors underneath the mode, which both may have been cached). I didn't add a test, since this ends up getting indirectly tests by export in the PR. If someone would prefer a direct test I can add one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98030 Approved by: https://github.com/ezyang	2023-04-25 21:58:14 +00:00
David Berard	d976df49c5	[dynamo] don't use LazyModuleMixin.cls_to_become if it is None (#99943 ) TL;DR: This PR fixes handling for lazy modules where `cls_to_become is None`. In those cases, we should leave the type of the lazy module as the old value. Details: Lazy modules are intended to be initialized at execution; some of them are also supposed to switch to a different type after they have been initialized. However, not all are supposed to switch; see this logic from `nn/modules/lazy.py` ```python def _infer_parameters(self, ...): ... if module.cls_to_become is not None: module.__class__ = module.cls_to_become ``` i.e., we should leave the module type as the old value if `module.cls_to_become is None`. This PR updates dynamo's handling to match this behavior. Test `test_lazy_module_no_cls_to_become` added to `test/dynamo/test_module.py`. Differential Revision: [D45253698](https://our.internmc.facebook.com/intern/diff/D45253698) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99943 Approved by: https://github.com/jansel	2023-04-25 21:34:11 +00:00
zhxchen17	9e012fd401	[export] Associate one cond() error case with exportdb. (#99844 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99844 Approved by: https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri	2023-04-25 21:33:24 +00:00
Kiersten Stokes	5c16dfd708	Add half to `real` param description in `torch.complex` docs (#99938 ) Fixes #89733 according to the issue description Pull Request resolved: https://github.com/pytorch/pytorch/pull/99938 Approved by: https://github.com/Skylion007	2023-04-25 21:23:16 +00:00
Denis Vieriu	cf21240f67	[MPS] Squeeze last dimensions if possible for 5D (or bigger) reductions (#99856 ) Summary of changes: - Reduction ops optimization - squeeze all dimensions after 4th dim if they are all 1 - Disable type inference only for 1D unary ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/99856 Approved by: https://github.com/kulinseth	2023-04-25 21:07:28 +00:00
Aleksei Nikiforov	87a2af6d4a	Fix loading data on different encoding (#94503 ) Add endianness marker when saving, and if it doesn't match host endianness when loading data, do a byteswap. Older data will load correctly only on systems with same endianness it was saved on. New data should load correctly on systems with any endianness. Fixes #65300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94503 Approved by: https://github.com/kurtamohler, https://github.com/ezyang	2023-04-25 21:05:20 +00:00
mikey dagitses	8cc57593b9	remove redundant trailing semicolons in StorageImpl.h (#97658 ) remove redundant trailing semicolons in StorageImpl.h Pull Request resolved: https://github.com/pytorch/pytorch/pull/97658 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-04-25 21:04:22 +00:00
Jane Xu	808267767c	Prevent grad scale from overflowing (#98876 ) Fixes #98828 by capping the growth in the kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/98876 Approved by: https://github.com/ngimel	2023-04-25 20:59:44 +00:00
Catherine Lee	ae5e1819a5	stepcurrent (#98035 ) * add stepcurrent flag (--sc) based off the stepwise flag that saves the currently running test so that test running can resume from the last successful test after segfaults, takes in an argument for a key so that different test runs dont overwrite each other * send sigint to process when timeout so that xml can be made * add currently unused stepcurrent skip flag (--scs) based off stepwise skip flag that skips the failing test, was going to use if for the keep-going label but having trouble with CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/98035 Approved by: https://github.com/huydhn	2023-04-25 20:56:04 +00:00
Will Constable	3e57d49e5b	Unblock fbcode build issues with torch/testing IS_CI issue (#99997 ) Prefer to land a better fix, but in the worst case this will unblock to buy time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99997 Approved by: https://github.com/jansel, https://github.com/bertmaher	2023-04-25 20:45:29 +00:00
Ivan Zaitsev	3a630d9e3a	[stronghold][bc-linter] Use merge to find the changes in the PR (#99958 ) Fixes the issue with the PR base detection for bc-lint. See also #98538 The context: to lint PR for BC-breaking changes we need two commits with the history between them that accurately represents the changes, introduced in the PR (and only these changes). --- Previous attempts to achieve this failed due to the following reasons: 1. Use `github.event.pull_request.base.sha` and `github.event.pull_request.head.sha`. If the PR's base branch advances, the new commits will be included in the `github.event.pull_request.base.sha`, mixing with the changes, introduced by the PR. 2. Find a common ancestor between `github.event.pull_request.base.sha` and `github.event.pull_request.head.sha`, use it as a base commit. This approach fails as well, if the PR includes merge commits from the newest history of its base branch. Such commits will appear as the changes, introduced in the PR and thus interfere with BC-linting. --- Current approach (this PR): Perform a merge of the `github.event.pull_request.head.sha` onto the `github.event.pull_request.base.sha`, and use the new commit SHA as the new head. This approach should always accurately find the changes introduced in the linted PR. The only shortcoming is when the PR cannot be merged onto the HEAD of it's base branch. In this case the BC-linting is skipped (the linting will be performed when the PR is rebased and conflicts are resolve, which is requires anyway before the PR is accepted). --- ### Testing * [in a separate repo for experiments](https://github.com/izaitsevfb/pr-head-test/pull/2/checks) * [BC-linter failure (this PR)](https://github.com/pytorch/pytorch/actions/runs/4793434350/jobs/8525891436?pr=99958) * gh-stack test: [failure](https://github.com/pytorch/pytorch/pull/100004), [success ](https://github.com/pytorch/pytorch/pull/100003) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99958 Approved by: https://github.com/osalpekar	2023-04-25 20:35:46 +00:00
Driss Guessous	0ebd3a78f4	Make sdp_utils more c++ style guide compliant (#99948 ) Preivously all of sdp_utils was implemented using inline functions in a header. This is not very comformant to style guide so broke into cpp and header file Pull Request resolved: https://github.com/pytorch/pytorch/pull/99948 Approved by: https://github.com/Skylion007	2023-04-25 20:07:57 +00:00
Justin Chu	865d30a3dd	[ONNX] Add new ops and enable the MNIST test (#99284 ) <!-- copilot:all --> ### <samp>🤖 Generated by Copilot at d490881</samp> ### Summary 📦🚀🗺️ <!-- 1. 📦 for the package name change and update. 2. 🚀 for the test case enablement and feature support. 3. 🗺️ for the operator mappings addition. --> This pull request adds support for converting more PyTorch FX operators to ONNX using the `onnxscript` package. It updates the package installation script and the operator mappings in `function_dispatcher.py`. It also enables a test case for the `max_pool2d` operator. > _We unleash the fury of the ONNX script_ > _We break the chains of the skipped test_ > _We map the functions of the ATen torch_ > _We forge the metal of the ONNX forge_ ### Walkthrough * Fix and update the installation of the `onnxscript` package ([link](https://github.com/pytorch/pytorch/pull/99284/files?diff=unified&w=0#diff-ddc2d714e91795d6bfc059caca91ab22d0740909f744d4c22f6642a71b09e8ecL29-R30)) * Add more mappings from PyTorch ATen operators to ONNX operators in the `_ATENLIB_FUNCTIONS` dictionary in the `function_dispatcher.py` module ([link](https://github.com/pytorch/pytorch/pull/99284/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL44-R91), [link](https://github.com/pytorch/pytorch/pull/99284/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL78-R130), [link](https://github.com/pytorch/pytorch/pull/99284/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL103-R193), [link](https://github.com/pytorch/pytorch/pull/99284/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL131-R210), [link](https://github.com/pytorch/pytorch/pull/99284/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL144-R223)) * Enable the `test_max_pool2d` method in the `TestFxToOnnx` class in the `test_fx_to_onnx.py` file, since the `max_pool2d` operator is now supported by the `onnxscript` package ([link](https://github.com/pytorch/pytorch/pull/99284/files?diff=unified&w=0#diff-9dde99e3ce414c7ca10dd319e22f74bcaaddaccd3ab6560c24777baf20616d27L71-L73)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99284 Approved by: https://github.com/BowenBao	2023-04-25 19:45:18 +00:00
Wanchao Liang	fc6f2f6e4e	[spmd] simplify data parallel tests (#99901 ) As titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/99901 Approved by: https://github.com/awgu, https://github.com/mrshenli	2023-04-25 19:31:00 +00:00
Wanchao Liang	0901b41a5e	[spmd] Add a few more loss ops to the reduction op list (#99900 ) This PR adds a few more loss ops to the reduction op list Pull Request resolved: https://github.com/pytorch/pytorch/pull/99900 Approved by: https://github.com/mrshenli	2023-04-25 19:31:00 +00:00
Wanchao Liang	932ed333f7	[spmd] expose input_batch_dim to DataParallelMode (#99899 ) This PR exposes the input batch dim to the DataParallelMode so that we could have explicit control of which input dim is batch dim Pull Request resolved: https://github.com/pytorch/pytorch/pull/99899 Approved by: https://github.com/awgu, https://github.com/mrshenli	2023-04-25 19:30:58 +00:00
Wanchao Liang	c6949db481	[spmd] enable fully_shard fused_adam test (#99898 ) This PR enables fully_shard fused adam tests with some additional tweaks about how to handle scalar tensor. Now we treat scalar tensors as if it's just a scalar value, we don't distribute it as there's no need to shard a scalar tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/99898 Approved by: https://github.com/mrshenli	2023-04-25 19:30:55 +00:00
Wanchao Liang	ad882c5210	[spmd] Use TupleStrategy and enable replicate fused_adam (#99374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99374 Approved by: https://github.com/mrshenli	2023-04-25 19:30:53 +00:00
Edward Z. Yang	f274c4ecf6	Don't change filelock log level by default (#99991 ) Fixes https://github.com/pytorch/pytorch/issues/99730 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99991 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-04-25 19:02:35 +00:00
Edward Z. Yang	650ba57236	Remove Anjali from CODEOWNERS (#99955 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99955 Approved by: https://github.com/albanD	2023-04-25 18:57:46 +00:00
cyy	dbc7e919b8	add Wmissing-prototypes to clang-tidy (#96805 ) This PR introduces -Wmissing-prototypes of clang-tidy to prevent further coding errors such as the one fixed by PR #96714. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at fd2cf2a</samp> This pull request makes several internal functions static to improve performance and avoid name clashes. It also fixes some typos, formatting, and missing includes in various files. It adds a new .clang-tidy check to warn about missing prototypes for non-static functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96805 Approved by: https://github.com/malfet, https://github.com/albanD	2023-04-25 18:20:36 +00:00
Luis Montero	39ff87c6a4	[ROCM] Extend try-catch mechanism for ROCM detection (#99980 ) ROCM path detection currently relies on `hipconfig`. On some systems when calling `hipconfig` through `subprocess` python raises a `NotADirectoryError` that isn't catch at the moment. This commit adds `NotADirectoryError` to exceptions catched when calling `hipconfig`. Fixes #98629 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99980 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2023-04-25 18:07:29 +00:00
Jerry Zhang	df3455b716	[reland][quant][pt2e][refactor] Cleanup the logic for deciding whether to insert observer/fq or not (#99220 ) (#99767 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/99220 Previously we have two places we need to decide whether to insert observer or fake quantizer or not: (1) input arguments of a node (2) output of a node, and right now we have separate code to do this in this PR, the logic is unified in `_needs_obs_or_fq` helper function that takes the target_dtype and is_dynamic from previous output and target_dtype and is_dynamic for the current Tensor we are looking at let's use an example for conv node: ``` conv = convolution(input, weight, bias, ...) ``` let's say we have `input_node` object for argument `input`, and `conv_node` for `conv` node in the graph (1) input arguments, e.g. `input` the target_dtype/is_dynamic from previous output is the node that produces `input`, we get this from input_node.meta["target_dtype_info"]["output_act_obs_or_fq"] the taregt_dtype/is_dynamic for the current argument `input`, comes from conv_node.meta["target_dtype_info"]["input_act_obs_or_fq"] similarly for weight it comes from conv_node.meta["target"]["weightobs_or_fq"] etc. (2) output for conv node the target_dtype/is_dynamic from previous output will be the floating point output from the fp32 convolution operator, so it is hardcoded to be (torch.float, False), however, technically we should get this from node.meta["val"], but since the current code base is shared by fx graph mode quantization and pytorch 2.0 export quantization, we cannot do that, we can revisit after we decide to deprecate fx graph mode quantization the target_dtype/is_dynamic for the current output comes from conv_node.meta["target_dtype_info"]["output_act_obs_or_fq"] there is one caveat here about dynamic quantization, that is explained in the comment, so I won't repeat here Note: also fixed some places in `_get_arg_target_dtype_as_input_to_node` and `_get_arg_target_is_dynamic_as_input_to_node` to make sure "not specified" == specifying a fp32 placeholder observer as well Next: we can merge the two get target dtype and get is_dynamic function to reduce code duplication Test Plan: python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps python test/test_quantization.py TestQuantizeFxModels python test/test_quantization.py TestQuantizePT2E python test/test_quantization.py TestQuantizePT2EModels Imported from OSS Differential Revision: D45198323 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99767 Approved by: https://github.com/kimishpatel	2023-04-25 16:53:02 +00:00
Elias Ellison	14c3eb7fb6	[Testing] flip switch, remove slow path assertions (#99101 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99101 Approved by: https://github.com/bdhirsh	2023-04-25 15:38:28 +00:00
Aaron Gokaslan	e2a3817dfd	[BE] Enable C419 rule for any all shortcircuiting (#99890 ) Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890 Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet	2023-04-25 15:02:13 +00:00
Bin Bao	e43918b93a	[inductor] Fix AOTInductor (#99203 ) Summary: Fix the broken AOTInductor flow and add a smoketest on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99203 Approved by: https://github.com/jansel	2023-04-25 14:42:12 +00:00
Chan Ger Hean	3afa60bf0f	Get OutOfMemoryError to inherit from RuntimeError (#99786 ) Get OutOfMemoryError to inherit from RuntimeError so that type checkers do not complain. Also added a defined in comment to match with the other definitions. Fixes #95936 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99786 Approved by: https://github.com/Skylion007	2023-04-25 14:31:49 +00:00
Bert Maher	e7157bd048	[inductor] Fix shape padding (#99917 ) Summary: We were using the "percentiles" form of triton.testing.do_bench, which returns a list of like (20th, 50th, 80th) percentile timing; I don't think we care about that much detail, so let's just use the mean. I also took the opportunity to clean up the redundant setting of rep, warmup, and fast_flush. Test Plan: ``` TORCHBENCH_ATOL=1e-3 TORCHBENCH_RTOL=1e-3 TORCHINDUCTOR_PERMUTE_FUSION=1 TORCHINDUCTOR_SHAPE_PADDING=1 buck2 run mode/opt mode/inplace pytorch/benchmark:run -- ads_dhen_5x --part over --bs 1024 -d cuda -t train --torchdynamo inductor ``` Reviewed By: jiawenliu64 Differential Revision: D45241751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99917 Approved by: https://github.com/jiawenliu64	2023-04-25 13:40:29 +00:00
Xiaodong Wang	cc01568efd	[pt2] Register meta func to randperm.default (#99593 ) Summary: Looks we're missing the meta func for randperm.default. I get complaints like this when I compile randperm with dynamic shape which I think is because it gets into the real implementation but not the meta func. ``` RuntimeError: expected int but got s0 Exception raised from expect_int at fbcode/caffe2/c10/core/SymInt.h:128 (most recent call first): # 0 c10::get_backtrace[abi:cxx11](unsigned long, unsigned long, bool) # 1 std::_Function_handler<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > (), c10::(anonymous namespace)::GetFetchStackTrace()::$_1>::_M_invoke(std::_Any_data const&) # 2 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) # 3 c10::detail::torchCheckFail(char const, char const, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) # 4 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__randperm>, at::Tensor, c10::guts::typelist::typelist<c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool> > >, at::Tensor (c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)>::call(c10::OperatorKernel, c10::DispatchKeySet, c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) # 5 at::Tensor c10::Dispatcher::redispatch<at::Tensor, c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool> >(c10::TypedOperatorHandle<at::Tensor (c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)> const&, c10::DispatchKeySet, c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) const # 6 at::_ops::randperm::redispatch(c10::DispatchKeySet, c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) # 7 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>), &at::(anonymous namespace)::randperm>, at::Tensor, c10::guts::typelist::typelist<c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool> > >, at::Tensor (c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)>::call(c10::OperatorKernel, c10::DispatchKeySet, c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) # 8 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>), &at::(anonymous namespace)::randperm>, at::Tensor, c10::guts::typelist::typelist<c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool> > >, false>::call(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) ``` Differential Revision: D45137851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99593 Approved by: https://github.com/ezyang	2023-04-25 08:55:43 +00:00
Douglas Lehr	8a0badfff1	[ROCM] Do not send "debug"=True down to triton.compile (#99756 ) ROCm's version of triton does not currently support tl.device_assert. This operator among others is effectively a no-op unless "debug" = True is passed in the triton.compile function. Until we have full suport for tl.device_assert, avoid enabling the debug flag in triton.compile, so we do not have to find every possible location tl.device_assert may be used. Fixes #99725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99756 Approved by: https://github.com/lezcano	2023-04-25 08:08:05 +00:00
Huy Do	9a69634b28	Skip some failing dynamic shape models on periodic (#99895 ) After some recent changes, these tests are failing in periodic trunk. So let's move them to unstable while waiting for the team to root cause the issue https://github.com/pytorch/pytorch/issues/99893. Note that a forward fix can use `ciflow/unstable` to run those unstable jobs to confirm that they are fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99895 Approved by: https://github.com/malfet	2023-04-25 07:05:08 +00:00
XiaobingSuper	df1ff0925e	inductor: fix issue for conv+binary issue for binary scalar path (#99860 ) Fix https://github.com/pytorch/pytorch/issues/99838. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99860 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-25 06:25:14 +00:00
XiaobingSuper	ed3957795c	inductor: add fallback test case for hardtanh and leakyrelu fusion pattern (#99859 ) Fix https://github.com/pytorch/pytorch/issues/99841. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99859 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2023-04-25 06:17:11 +00:00
Edward Z. Yang	e2cb6bcc91	Fix typos and clarify text in tags.yaml (#99954 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99954 Approved by: https://github.com/eellison	2023-04-25 05:19:12 +00:00
AllenTiTaiWang	4cf654625c	[ONNX] Bump onnx-script version with imported module renaming (#99926 ) Avoid potential blocking after https://github.com/microsoft/onnxscript/pull/659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99926 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2023-04-25 05:17:18 +00:00
Edward Z. Yang	04e8df4dd7	Return full accuracy status for printing, not abbreviated version (#99894 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99894 Approved by: https://github.com/jansel	2023-04-25 05:17:10 +00:00
Michael Voznesensky	bfa63bf45f	div16 changes for dyn shapes (#99930 ) Lint, fix Pull Request resolved: https://github.com/pytorch/pytorch/pull/99930 Approved by: https://github.com/ngimel	2023-04-25 04:56:39 +00:00
Jiong Gong	e5c9a0fcf5	[dynamo] avoid graph break on repeat_interleave.self_int (#99528 ) Address convit_base failure: https://github.com/pytorch/torchdynamo/issues/1886 mentioned in https://github.com/pytorch/pytorch/issues/93777 Also for models like EleutherAI/gpt-j-6B. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99528 Approved by: https://github.com/ezyang	2023-04-25 04:47:39 +00:00
fakeYan	ecd2c71871	Implement the get_device method in the storage base class. (#99818 ) Fixes #ISSUE_NUMBER like #99817, I find a method is missing, I'm not sure if it was intentionally removed. But I found that the function is still called on the python side, and the function seems to be very simple to implement. So I made a change in python side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99818 Approved by: https://github.com/ezyang	2023-04-25 03:45:39 +00:00
BowenBao	e51453298b	[ONNX] Improve diagnostics performance (#99936 ) Summary - Do not call `fx_graph_module.print_readable` when recording `fx.GraphModule` function argument diagnostics. - Cache `inspect.getsourcelines` results. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99936 Approved by: https://github.com/justinchuby, https://github.com/abock	2023-04-25 03:31:55 +00:00
Iris	466adab7c4	Add fsspec to PT setup.py (#99768 ) Follow up for https://github.com/pytorch/pytorch/pull/96532. Including this in setup.py so the package will be available for CI. Fsspec package size: ``` du -h /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg 264K /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec/__pycache__ 58K /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec/implementations/__pycache__ 377K /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec/implementations 1017K /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec 96K /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/EGG-INFO 1.2M /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99768 Approved by: https://github.com/kit1980	2023-04-25 01:34:08 +00:00
Michael Voznesensky	4be0aa1382	Allow persistent reductions in dynamic shapes if last numel is static (#99789 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99789 Approved by: https://github.com/ezyang	2023-04-25 01:15:35 +00:00
Edward Z. Yang	cd61707167	yolov3 dynamic training accuracy is fixed (#99896 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99896 Approved by: https://github.com/albanD	2023-04-25 01:15:24 +00:00
XiaobingSuper	41069f2faa	inductor: align inductor behavior with eager mode for split_with_sizes (#99702 ) Fix https://github.com/pytorch/pytorch/issues/99686, for eager mode, if the given sizes is not meet requirements, it will report an error, but inductor can run, I think we need align inductor behavior with eager mode, the behavior will be like after this PR: ``` Traceback (most recent call last): File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1267, in run_node return node.target(args, kwargs) File "/home/xiaobing/pytorch-offical/torch/functional.py", line 189, in split return tensor.split(split_size_or_sections, dim) File "/home/xiaobing/pytorch-offical/torch/_tensor.py", line 804, in split return torch._VF.split_with_sizes(self, split_size, dim) File "/home/xiaobing/pytorch-offical/torch/utils/_stats.py", line 20, in wrapper return fn(args, *kwargs) File "/home/xiaobing/pytorch-offical/torch/_subclasses/fake_tensor.py", line 1095, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/home/xiaobing/pytorch-offical/torch/_subclasses/fake_tensor.py", line 1259, in dispatch return decomposition_table[func](args, *kwargs) File "/home/xiaobing/pytorch-offical/torch/_decomp/decompositions.py", line 1102, in split_with_sizes raise ValueError( ValueError: Split sizes don't add up to the tensor's size in the given dimension The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1215, in get_fake_value return wrap_fake_exception( File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 835, in wrap_fake_exception return fn() File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1216, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1279, in run_node raise RuntimeError( RuntimeError: Failed running call_function <function split at 0x7f45b8402ee0>((FakeTensor(..., size=(1, 5)), [2, 1, 1]), **{'dim': 1}): Split sizes don't add up to the tensor's size in the given dimension (scroll up for backtrace) The above exception was the direct cause of the following exception: ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99702 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/jansel	2023-04-25 01:13:52 +00:00
Michael Voznesensky	96ceae3a7f	Use memoized only mode for guard size/stride extraction (#99742 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99742 Approved by: https://github.com/ezyang	2023-04-25 01:05:42 +00:00
Edward Z. Yang	0b545bc667	Stop marking sequence length as dynamic (#99889 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99889 Approved by: https://github.com/voznesenskym, https://github.com/huydhn	2023-04-25 01:04:16 +00:00
Nikita Karetnikov	42921fc801	[torchgen] accept scalars for unary `SymInt` arrays (#99921 ) Fixes https://github.com/pytorch/pytorch/issues/99907 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99921 Approved by: https://github.com/malfet	2023-04-25 00:49:15 +00:00
mikey dagitses	1dbecbf913	make ATen/native/cuda/NaiveConvolutionTranspose3d.cu data_ptr-correct (#99347 ) make ATen/native/cuda/NaiveConvolutionTranspose3d.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99347 Approved by: https://github.com/ezyang	2023-04-25 00:45:01 +00:00
mikey dagitses	4ca44d32d3	make ATen/native/cuda/SortStable.cu (#99340 ) make ATen/native/cuda/SortStable.cu Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99340 Approved by: https://github.com/ezyang	2023-04-25 00:44:55 +00:00
mikey dagitses	1b30f588e6	make ATen/native/cuda/RreluWithNoise.cu (#99341 ) make ATen/native/cuda/RreluWithNoise.cu Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99341 Approved by: https://github.com/ezyang	2023-04-25 00:44:45 +00:00
Nikita Karetnikov	fbb0ff10a4	[pt2] add `SymInt` support for trapezoid ops (#99281 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99281 Approved by: https://github.com/ezyang	2023-04-25 00:44:25 +00:00
Michael Gschwind	36e1ae6778	De-select odd numbered heads from nn.MHA fastpath (#99672 ) Summary: https://github.com/pytorch/pytorch/issues/97128 * Add test for mha num_heads %2 != 0 * Fix test * Add test for bias false * show test passes Test Plan: sandcastle Differential Revision: D45161767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99672 Approved by: https://github.com/ngimel	2023-04-25 00:27:18 +00:00
Chien-Chin Huang	3de7fd461a	[FSDP][Reland] Include duplicate parameters and modules when calling named_parameters and named_modules (#99448 ) The default option of `named_parameters` and `named_modules` is to remove the duplicated parameters and modules. However, in FSDP, we need to know what parameters are shared. As a result, setting `remove_duplicate` to False is required in FSDP. Without setting `remove_duplicate` to False, FSDP won't be able to discover shared weights in some cases (e.g., the shared weights are in the same module or there are shared modules). The previous PR is reverted due to some modules overwriting the signature of `named_parameters()`. This new PR adds a workaround for the case. Differential Revision: [D45065973](https://our.internmc.facebook.com/intern/diff/D45065973/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99448 Approved by: https://github.com/zhaojuanmao	2023-04-25 00:27:07 +00:00
Edward Z. Yang	0eb59ad093	Change export tracing_mode default to symbolic (#99877 ) Differential Revision: [D45231039](https://our.internmc.facebook.com/intern/diff/D45231039/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99877 Approved by: https://github.com/albanD, https://github.com/voznesenskym	2023-04-25 00:12:12 +00:00
Edward Z. Yang	73f7459a90	Do not assume static by default when exporting (#99554 ) (#99876 ) Fixes https://github.com/pytorch/pytorch/issues/99360 Differential Revision: [D45230857](https://our.internmc.facebook.com/intern/diff/D45230857/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99876 Approved by: https://github.com/albanD, https://github.com/voznesenskym	2023-04-24 23:48:39 +00:00
Andrew Gu	08a8a37ffe	[FSDP] Set `NCCL_DESYNC_DEBUG=0` for FSDP unit tests (#99916 ) This should fix https://github.com/pytorch/pytorch/issues/99011. With `NCCL_DESYNC_DEBUG=0`, we can run 100 iterations of ``` CUDA_LAUNCH_BLOCKING=1 NCCL_DESYNC_DEBUG=1 CUDA_VISIBLE_DEVICES=0,7 numactl -C 2 python test/distributed/fsdp/test_fsdp_core.py -v -k test_transformer_no_grad --repeat 100 2>&1 \| tee out ``` without erroring, whereas with `NCCL_DESYNC_DEBUG=1`, we can repro the error with high failure rate (usually <10 iterations). Pull Request resolved: https://github.com/pytorch/pytorch/pull/99916 Approved by: https://github.com/zhaojuanmao	2023-04-24 23:20:45 +00:00
Wanchao Liang	855f611baf	[spmd] skip gradient copying for fused adam (#99489 ) gradients does not need to be copy back as it's not useful Pull Request resolved: https://github.com/pytorch/pytorch/pull/99489 Approved by: https://github.com/mrshenli	2023-04-24 22:50:02 +00:00
Wanchao Liang	ca24a96216	minor fix to fused adam meta registration (#99436 ) This PR fixes the registration by adding `max_exp_avg_sqs` to the output shape list too, and fix some type check issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/99436 Approved by: https://github.com/mrshenli	2023-04-24 22:50:02 +00:00
Wanchao Liang	ff7d5b62d4	Improve ProxyTensor tensor_tree list/tuple handling (#99897 ) This PR improves the list/tuple handling by merging the logic into `wrap_with_proxy` directly, and set_meta when we find the current proxy is a fx.Proxy. This also solves the problem that even `fused_adam` have `val`, some corresponding `getitem` calls followed after `fused_adam` don't have val Pull Request resolved: https://github.com/pytorch/pytorch/pull/99897 Approved by: https://github.com/ezyang	2023-04-24 22:50:02 +00:00
Angela Yi	78c2e3374d	[fx] Remove replace_literals_with_placeholders (#99728 ) Summary: SubraphMatcher contains an ignore_literals flag which we can turn on instead. Test Plan: CI Differential Revision: D45168383 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99728 Approved by: https://github.com/cccclai	2023-04-24 22:33:36 +00:00
Animesh Jain	862d658059	[inductor][non determinism] Disable autotuning when determinisitic mode is ON (#99851 ) This removes a source of non-determinism seen in `sebotnet33ts_256`. However, we should see if we can reduce non-determinism in autotuning in general. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99851 Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/desertfire	2023-04-24 22:31:22 +00:00
Iris	7398b5650d	[Lint] Fix wrong docstring for dcp save_state_dict() (#99778 ) ``no_dist=True`` mean not saving in SPMD style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99778 Approved by: https://github.com/H-Huang	2023-04-24 22:18:42 +00:00
Bug Hunter Yan	33fe2dbb23	Fix a minor bug about method generation. (#99704 ) Fixes #ISSUE_NUMBER when create a torch.device obj, like `x=torch.device("foo")`, the device index is None. So in this scenario, we need to get the current device index again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99704 Approved by: https://github.com/albanD	2023-04-24 22:18:18 +00:00
Nikita Shulga	baf092b82d	Update pt2-bug-report.yml (#99928 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 9691a66</samp> Update the `pt2-bug-report.yml` template to use `curl` instead of `wget`, `main` instead of `master`, and `python3` instead of `python`. These changes improve the portability and reliability of the bug report process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99928 Approved by: https://github.com/kit1980, https://github.com/msaroufim	2023-04-24 21:57:28 +00:00
Ke Wen	3a09aa5977	[c10d] Faster coalescing (#98793 ) ### Description The PR aims at reducing CPU overhead of context manager style coalescing. By "context manager style coalescing", we mean: Sync style: ``` with _coalescing_manager(): for i in range(num_coll): dist.all_reduce(tensors[i]) ``` Async style: ``` with _coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.all_reduce(tensors[i]) cm.wait() ``` In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead. In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager. ### Tests In current PR, the "fast path" only applies to all-reduce. - Flattened 512M: 16.38 ms, including CPU time 131.21 us - Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us - New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us Hence a 4x reduction in CPU overhead (dependent on `num_coll`). Cc @mrshenli @kumpera @wanchaol @fegin Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793 Approved by: https://github.com/kumpera	2023-04-24 21:27:26 +00:00
Animesh Jain	3dcc7b396c	[easy] iterate dict with sorted keys for accuracy checking (#99793 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99793 Approved by: https://github.com/jansel	2023-04-24 21:26:35 +00:00
Catherine Lee	2cea2edc27	[easy] Fix upload test stats after master -> main switch (#99924 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99924 Approved by: https://github.com/huydhn	2023-04-24 21:19:09 +00:00
Nikita Shulga	367d3afd7c	Update MacOS deployment target to 11.0 (#99857 ) MacOS-10.9 (Mavericks) was released a decade ago, update it to Big Sur, that was released in 2020. But keep platform name to 10_9, as `pip` treats platform as one CPython was built on, not the one it runs on. Delete duplicate `compile_x86_64` function from `macos_build.sh` and specify platform name there. Should fix MacOS x86 periodic build failures. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at ee4d5a8</samp> > _`macosx_10_9` wheel_ > _Aligns with PyTorch support_ > _Winter of updates_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/99857 Approved by: https://github.com/huydhn, https://github.com/atalman	2023-04-24 21:15:00 +00:00
mingfeima	4c9d660733	fix gather issue when index is shape of n by 1 (#99709 ) Fix https://github.com/pytorch/pytorch/issues/99595 When the index is shape of {N, 1}, it will also have strides of {1, 0}, which is the same as an expanded tensor (e.g. shape of {5, 5} and strides {1, 0}), leading to wrong output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99709 Approved by: https://github.com/XiaobingSuper, https://github.com/ezyang	2023-04-24 20:55:46 +00:00
chuanqiw	e9e5ffe83e	Re-enable dynamic shapes test in dynamo benchmark (#99816 ) Set `torch._dynamo.config.assume_static_by_default = False` for dynamic shapes flag enabled Fixes #99815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99816 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-04-24 20:34:52 +00:00
Elias Ellison	d881b2978c	Make autocast cache and buffer stealing aware of cudagraph static output tensors (#99368 ) In this stack of PRs we adding caching to output tensors for cudagraph trees after we've done initial recording. On initial recording we do not cache tensor outputs because this prevents memory from being reclaimed. On subsequent exeuctions we do cache them to avoid overhead. However, because there is an extra reference around, this caused divergent recording & execution behavior in both autocast caching and autograd gradient stealing. Divergent recording & execution would keep on re-recording and eventually stabilize, but it's not what you want to see happen. This pr makes the autocast cache and buffer stealing aware of the cudagraph static output tensors. I will add this to the other cudagraph impl in another pr. Not sure if this should be in autograd or in autocast since it affects both.. Or somewhere else Pull Request resolved: https://github.com/pytorch/pytorch/pull/99368 Approved by: https://github.com/albanD, https://github.com/ezyang	2023-04-24 20:23:12 +00:00
Yanbo Liang	3009c42e7d	[CI Testing] Re-enable timm_efficientdet training (#99787 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99787 Approved by: https://github.com/desertfire	2023-04-24 20:05:15 +00:00
Devashish Shankar	a1633b1776	Add support for call_method patterns (#99782 ) Summary: This add support for CallMethod patterns in pattern_matcher. Also extends split_cat transforms to normalize tensor.split() type nodes Test Plan: Unit tests (fb + OSS) Differential Revision: D45195548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99782 Approved by: https://github.com/jansel	2023-04-24 19:26:26 +00:00
Edward Z. Yang	41280a0791	Don't detach to create parameters in MetaConverter (#99618 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99618 Approved by: https://github.com/albanD	2023-04-24 19:01:26 +00:00
medivh-xp	39590d06c5	Make new_subgroups avaliable for non-cuda depend backend (#99706 ) The `new_subgroups` allows for the easy creation of sub-communication groups, but it currently requires CUDA availability. For communications that do not rely on CUDA, such as the CPU-based gloo or custom communication backends, I still hope to be able to use it, such as with the CPU-based gloo (which is also the case when using a custom backend): ```python import os import torch import torch.distributed as dist import torch.multiprocessing as mp def gloo_process(rank_id, world_size, group_size, mp_lock): assert not torch.cuda.is_available() def lock_print(args, kwargs): with mp_lock: print(args, *kwargs, flush=True) os.environ['MASTER_ADDR'] = '127.0.0.1' os.environ['MASTER_PORT'] = '29500' dist.init_process_group('gloo', rank=rank_id, world_size=world_size) subgroup, _ = dist.new_subgroups(group_size) subgroup_ranks = list(range(subgroup.rank() group_size, (subgroup.rank() + 1) * group_size)) lock_print(f"Rank {rank_id} initialized in subgroup_{subgroup.rank()}: {subgroup_ranks}") tensor = torch.Tensor([rank_id + 1]) subgroup.broadcast(tensor, root=0) lock_print(f"After broadcast, rank {rank_id} in subgroup_{subgroup.rank()}:{subgroup_ranks} got {tensor}") if __name__ == "__main__": world_size = 4 group_size = 2 processes = [] mp.set_start_method("spawn") mp_lock = mp.Lock() for rank in range(world_size): p = mp.Process(target=gloo_process, args=(rank, world_size, group_size, mp_lock)) p.start() processes.append(p) for p in processes: p.join() ``` ```bash Rank 0 assigned to subgroup_0: [0, 1] Rank 1 assigned to subgroup_1: [2, 3] Rank 2 assigned to subgroup_0: [0, 1] Rank 3 assigned to subgroup_1: [2, 3] After broadcast, rank 2 in subgroup_0:[0, 1] got tensor([3.]) After broadcast, rank 3 in subgroup_1:[2, 3] got tensor([3.]) After broadcast, rank 1 in subgroup_1:[2, 3] got tensor([1.]) After broadcast, rank 0 in subgroup_0:[0, 1] got tensor([1.]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99706 Approved by: https://github.com/kumpera	2023-04-24 18:22:59 +00:00
Chris Gottbrath	f0e28b1cb9	Adding the maintainers approved in 2023Q1 Core Maintainers meeting (#98520 ) Added Nikita to Core Maintainers Merged MKLDNN with CPU Performance Renamed CUDA to GPU Performance Added Jiong to Compiler and CPU Performance Added Xiaobing to CPU Performance Marking Vitaly and Jian Hui as Emeritus Pull Request resolved: https://github.com/pytorch/pytorch/pull/98520 Approved by: https://github.com/ezyang, https://github.com/soumith, https://github.com/dzhulgakov	2023-04-24 17:58:18 +00:00
Justin Chu	7d2a18da0b	Enable ruff in lintrunner (#99785 ) ### This change - Implements the ruff linter in pytorch lintrunner. It is adapted from https://github.com/justinchuby/lintrunner-adapters/blob/main/lintrunner_adapters/adapters/ruff_linter.py. It does both linting and fixing. 🔧 - Migrated all flake8 configs to the ruff config and enabled it for the repo. ✅ - `ruff` lints the whole repo in under 2s 🤯 Fixes https://github.com/pytorch/pytorch/issues/94737 Replaces #99280 @huydhn @Skylion007 <!-- copilot:all --> ### <samp>🤖 Generated by Copilot at 6b982dd</samp> ### Summary 🧹🛠️🎨 <!-- 1. 🧹 This emoji represents cleaning or tidying up, which is what `ruff` does by formatting and linting the code. It also suggests improving the code quality and removing unnecessary or redundant code. 2. 🛠️ This emoji represents tools or fixing, which is what `ruff` is as a code formatter and linter. It also suggests enhancing the code functionality and performance, and resolving potential issues or bugs. 3. 🎨 This emoji represents art or creativity, which is what `ruff` allows by providing a consistent and configurable style for the code. It also suggests adding some flair or personality to the code, and making it more readable and enjoyable. --> Add `[tool.ruff]` section to `pyproject.toml` to configure `ruff` code formatter and linter. This change aims to improve code quality and consistency with a single tool. > _`ruff` cleans the code_ > _like a spring breeze in the fields_ > _`pyproject.toml`_ ### Walkthrough * Configure `ruff` code formatter and linter for the whole project ([link](https://github.com/pytorch/pytorch/pull/99785/files?diff=unified&w=0#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711R22-R79)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99785 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-04-24 16:18:44 +00:00
Denis Vieriu	dcd686f478	[MPS] Add PSO caching for advanced indexing kernels (#99855 ) Use bindless Argument Buffers (unbounded arrays) for advanced indexing kernels - this allows caching of the PSOs since we don't have to query anymore the main metal function for the AB size (this is filled directly now on the CPU). Pull Request resolved: https://github.com/pytorch/pytorch/pull/99855 Approved by: https://github.com/kulinseth	2023-04-24 15:41:47 +00:00
lezcano	09b189edc3	Improve the precision of abs() and sign() for large values (#99550 ) @ev-br found in https://github.com/Quansight-Labs/numpy_pytorch_interop/pull/117#issuecomment-1514959633 that the precision of `abs()` for large values in the vectorised case is less-than-good. This PR fixes this issue. While doing that, we are able to comment out a few tests on extremal values. Fixes https://github.com/pytorch/pytorch/issues/53958 https://github.com/pytorch/pytorch/issues/48486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99550 Approved by: https://github.com/ngimel, https://github.com/peterbell10	2023-04-24 14:32:56 +00:00
soulitzer	5ee5afb82c	Update channel shuffle to return alias instead of self as-is (#99745 ) Partially addresses https://github.com/pytorch/pytorch/issues/99655 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99745 Approved by: https://github.com/albanD	2023-04-24 14:02:14 +00:00
PyTorch MergeBot	ab0a8215bb	[xla hash update] update the pinned xla hash (#99863 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99863 Approved by: https://github.com/pytorchbot	2023-04-24 10:18:27 +00:00
Michael Voznesensky	466877b692	Nicer logs for dynamic shapes (#99277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99277 Approved by: https://github.com/ezyang	2023-04-24 10:08:05 +00:00
PyTorch MergeBot	d0886f686e	Revert "Do not assume static by default when exporting (#99554 )" This reverts commit d3bb762f1edc770879b7ba51019b02455109349b. Reverted https://github.com/pytorch/pytorch/pull/99554 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-04-24 08:27:56 +00:00
PyTorch MergeBot	c83e1f517d	Revert "Delete tracing_mode argument to export (#99555 )" This reverts commit e9786149ab71874fad478109de173af6996f7eec. Reverted https://github.com/pytorch/pytorch/pull/99555 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-04-24 08:21:41 +00:00
Kurt Mohler	1e8cf6ad7f	Add documentation for `torch._logging.set_logs` (#99219 ) Part of #98871 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99219 Approved by: https://github.com/mlazos, https://github.com/lezcano	2023-04-24 08:06:57 +00:00
Justin Chu	6e3cdcad08	Fix flake8 lint errors - part 2 - manual fixes (#99799 ) <!-- copilot:all --> ### <samp>🤖 Generated by Copilot at 8aef78f</samp> ### Summary 📝🚀🛠️ <!-- 1. 📝 for modifying the logging format and style 2. 🚀 for improving performance and avoiding unnecessary string creation 3. 🛠️ for fixing flake8 issues --> This pull request updates some logging calls to use old-style string formatting with `%s` placeholders instead of f-strings in `torch/_dynamo/logging.py`, `torch/_functorch/compilers.py`, and `torch/fx/passes/pass_manager.py` as part of a logging standardization effort. It also adds a `# noqa: F404` comment to the `import __future__` statement in `torch/overrides.py` to fix a flake8 warning. > _`log` uses old style_ > _formatting strings with `%s`_ > _logging is faster_ ### Walkthrough * Standardize logging format and style to use old-style string formatting with `%s` placeholders instead of f-string syntax for performance and consistency ([link](https://github.com/pytorch/pytorch/pull/99799/files?diff=unified&w=0#diff-18807f7fd187b8bc8e69e93722566195b36d5bf269099b415a6f90b552228d6bL55-R55), [link](https://github.com/pytorch/pytorch/pull/99799/files?diff=unified&w=0#diff-fae8a66564055743ec031edb87eb22edeebf7fdebef9d21660d5e6a6252e5222L370-R373), [link](https://github.com/pytorch/pytorch/pull/99799/files?diff=unified&w=0#diff-5f3e37ded032f24e247dcf4a3be4b73ea0cf21382e342631742e5a04550202e1L72-R72)) * Suppress flake8 warning for `import __future__` statement in `torch/overrides.py` with `# noqa: F404` comment ([link](https://github.com/pytorch/pytorch/pull/99799/files?diff=unified&w=0#diff-4f601fe7f31e875ee4354882c0bb490bc35e51d3d413d058cc5fda3be8ca9f15L23-R23)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99799 Approved by: https://github.com/Skylion007	2023-04-24 06:03:26 +00:00
Natalia Gimelshein	48d112c431	Fix fake tracing of cross entropy with label smoothing and weight (#99830 ) Fixes #99726 Adds a special path in cross entropy implementation for tensor subclasses, we don't always use it as it requires slightly more memory and is a bit slower. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99830 Approved by: https://github.com/ezyang	2023-04-24 04:07:23 +00:00
Peter Bell	7a6c650b81	[inductor] Lower aten.prod (#99484 ) This lowers `aten.prod` using the new `tl.reduce` functionality in triton. I also introduce `TritonKernel.helper_functions` which allows code to be defined outside of the kernel body so that we can defined the `_prod_accumulate` helper function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99484 Approved by: https://github.com/ngimel	2023-04-24 02:48:27 +00:00
Justin Chu	79c9e82e27	Fix flake8 lint errors reported by ruff - take 2 (#99798 ) Replaces #99784. This PR is pure autofix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99798 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-04-23 23:09:51 +00:00
Edward Z. Yang	dc1c0924ec	Properly parenthesize dynamo_dynamic_indices test (#99823 ) I've got the E2E test case which triggered this in https://github.com/pytorch/pytorch/pull/99809 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99823 Approved by: https://github.com/voznesenskym	2023-04-23 22:41:34 +00:00
Aaron Gokaslan	6d5040a1ac	[BE] Update python versions for black formatter config (#99827 ) Update black config to currently supported python versions in PyTorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/99827 Approved by: https://github.com/ezyang	2023-04-23 20:38:18 +00:00
Nikita Shulga	f8c6861120	[MPS][BE] Introduce `LookUpOrCreateCachedGraph` (#99422 ) A template that replaces following common pattern: ```cpp MPSGraphCache* cache_ = MPSGraphCache::getInstance(); CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key); if (!cachedGraph) { cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^MPSCachedGraph() { CachedGraph newCachedGraph = nil; @autoreleasepool { MPSGraph* mpsGraph = make_mps_graph(); newCachedGraph = new PoolingCachedGraph(mpsGraph); ... } return newCachedGraph: ); } ``` with ```cpp auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) { ... }); ``` Fixes memory leak in addmv_out_mps_impl, where new entires were added the cache without doing the lookup first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99422 Approved by: https://github.com/albanD, https://github.com/kulinseth	2023-04-23 19:34:38 +00:00
Bin Bao	d29cf18442	[CI] Pause perf data collection for max-autotune (#99829 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99829 Approved by: https://github.com/ezyang	2023-04-23 18:41:31 +00:00
Denis Vieriu	a89d6b0a82	[MPS] Add encoder coalescing support for native kernels (#99810 ) Add support for kernel coalescing to native kernels. This change reuses the same compute command encoder across successive metal kernel dispatches. The coalescing will stop when a graph op is encountered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99810 Approved by: https://github.com/kulinseth	2023-04-23 18:33:07 +00:00
wgb	2d3456167d	[Typo]Summary:Fix a typo in comments (#99824 ) Fixes a typo in a comment in torch/testing/_internal/common_device_type.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/99824 Approved by: https://github.com/Skylion007	2023-04-23 18:10:26 +00:00
Qing Wang	716ef2f5ad	Improve code to make it more pythonic. (#99720 ) No need to use `keys()` method to assert a key is in or not in a dict. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99720 Approved by: https://github.com/kit1980	2023-04-23 16:42:13 +00:00
YJ Shi	72daadef2c	[dynamo] Explicitly fall back to eager with GraphModule with no output for onnx&tvm backends (#99805 ) Fixes #99437 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99805 Approved by: https://github.com/jansel	2023-04-23 06:59:00 +00:00
XiaobingSuper	9b0b31a5e3	fix conv+bn folding issue for mixed dtype (#99696 ) Align the conv+bn folding behavior with jit path for mixed type case: always keep conv's weight and bias dtype after folding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99696 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-23 05:13:40 +00:00
XiaobingSuper	1fc4d58f43	inductor: fix split+cat issue when cat order is not align the split output's order (#99700 ) we should make sure the cat order does align with the split output's order before removing the cat operation. Fix https://github.com/pytorch/pytorch/issues/99686. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99700 Approved by: https://github.com/EikanWang, https://github.com/devashishshankar, https://github.com/jansel	2023-04-23 04:10:20 +00:00
Edward Z. Yang	ebd47b0eec	Propagate mark_dynamic in Dynamo compiled outputs. (#99634 ) If you run a user operation you'll lose it, but this will at least get the easy stuff. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99634 Approved by: https://github.com/voznesenskym	2023-04-23 03:24:28 +00:00
Mengchi Zhang	efed5a4969	Allow data size equal to 0 for SegmentReduce (#99733 ) Summary: Support special case that data size can be 0 for SegmentReduce. Example code below: ``` x = torch.ones((0, 6)).cuda() lengths = torch.tensor([0, 0]).cuda() torch.segment_reduce(x, "sum", lengths=lengths, unsafe=False, initial=0) ``` Previously, error message: Expected data.numel() > 0 to be true, but got false. Now expect to return 0. Test Plan: contbuild & OSS CI Differential Revision: D45133827 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99733 Approved by: https://github.com/ngimel	2023-04-23 01:59:45 +00:00
Arthur	7a8d0ccddf	Correct LBFGS tolerance_grad doc string (#99792 ) LBFGS' `tolerance_grad` parameter has had a default value of `1e-7` since #25240. The doc string wasn't updated in that PR to match the change https://github.com/pytorch/pytorch/blob/main/torch/optim/lbfgs.py#L207. no open issue for it, just happened to set it to 1e-7 and was surprised my results didn't change :-) eventually noticed inconsistency in the doc and seemed like an easy opportunity to figure out how to contribute. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99792 Approved by: https://github.com/janeyx99	2023-04-22 20:19:01 +00:00
Edward Z. Yang	f602b3a6ae	Preserve mark_dynamic when cloning inputs (#99617 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99617 Approved by: https://github.com/ngimel, https://github.com/voznesenskym, https://github.com/anijain2305	2023-04-22 19:46:31 +00:00
Michael Voznesensky	5e73569ab4	Add memoized_only mode to tensor conversion (#99741 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99741 Approved by: https://github.com/ezyang	2023-04-22 19:19:39 +00:00
Michael Voznesensky	4c2892944f	Guard static shapes alongside tensors, instead of from shape_env, in dynamic_shapes=True (#99566 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99566 Approved by: https://github.com/ezyang	2023-04-22 16:46:52 +00:00
Jason Ansel	220712f4de	Fix torch.compile() on a skipped module (#98894 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98894 Approved by: https://github.com/xw285cornell	2023-04-22 16:10:55 +00:00
blzheng	d192729cfd	inductor: fix AllenaiLongformerBase dynamic shape error on CPU (#98842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98842 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-22 14:43:10 +00:00
Animesh Jain	31eb9949e4	[dynamo] disallow_in_graph bugfix (#99600 ) Testing if the minor change breaks other test cases. For the added test case, TorchDynamo causes graph break on `torch.ops.foo.custom` but then again starts running on the recursively invoked frame - `foo_cpu` on L48 in testfile. This raises assertion like this ~~~ Traceback (most recent call last): File "/scratch/anijain/work/pytorch/test/dynamo/test_decorators.py", line 65, in test_disallow_in_graph_for_custom_op res = opt_fn(x) File "/scratch/anijain/work/pytorch/torch/_dynamo/eval_frame.py", line 252, in _fn return fn(args, kwargs) File "/scratch/anijain/work/pytorch/test/dynamo/test_decorators.py", line 56, in fn b = torch.ops.foo.custom(a) File "/scratch/anijain/work/pytorch/torch/_ops.py", line 646, in __call__ return self._op(args, *kwargs or {}) File "/scratch/anijain/work/pytorch/torch/_dynamo/eval_frame.py", line 401, in catch_errors return callback(frame, cache_size, hooks, frame_state) File "/scratch/anijain/work/pytorch/torch/_dynamo/convert_frame.py", line 495, in _convert_frame result = inner_convert(frame, cache_size, hooks, frame_state) File "/scratch/anijain/work/pytorch/torch/_dynamo/convert_frame.py", line 122, in _fn return fn(args, *kwargs) File "/scratch/anijain/work/pytorch/torch/_dynamo/convert_frame.py", line 331, in _convert_frame_assert return _compile( File "/scratch/anijain/work/pytorch/torch/_dynamo/utils.py", line 169, in time_wrapper r = func(args, kwargs) File "/scratch/anijain/work/pytorch/torch/_dynamo/convert_frame.py", line 401, in _compile out_code = transform_code_object(code, transform) File "/scratch/anijain/work/pytorch/torch/_dynamo/bytecode_transformation.py", line 1000, in transform_code_object transformations(instructions, code_options) File "/scratch/anijain/work/pytorch/torch/_dynamo/convert_frame.py", line 371, in transform tracer = InstructionTranslator( File "/scratch/anijain/work/pytorch/torch/_dynamo/symbolic_convert.py", line 1890, in __init__ self.symbolic_locals = collections.OrderedDict( File "/scratch/anijain/work/pytorch/torch/_dynamo/symbolic_convert.py", line 1893, in <genexpr> VariableBuilder( File "/scratch/anijain/work/pytorch/torch/_dynamo/variables/builder.py", line 165, in __call__ return self._wrap(value).clone(self.options()) File "/scratch/anijain/work/pytorch/torch/_dynamo/variables/builder.py", line 290, in _wrap return type_dispatch(self, value) File "/scratch/anijain/work/pytorch/torch/_dynamo/variables/builder.py", line 776, in wrap_tensor tensor_variable = wrap_fx_proxy( File "/scratch/anijain/work/pytorch/torch/_dynamo/variables/builder.py", line 923, in wrap_fx_proxy return wrap_fx_proxy_cls( File "/scratch/anijain/work/pytorch/torch/_dynamo/variables/builder.py", line 983, in wrap_fx_proxy_cls example_value = wrap_to_fake_tensor_and_record( File "/scratch/anijain/work/pytorch/torch/_dynamo/variables/builder.py", line 1213, in wrap_to_fake_tensor_and_record fake_e = wrap_fake_exception( File "/scratch/anijain/work/pytorch/torch/_dynamo/utils.py", line 835, in wrap_fake_exception return fn() File "/scratch/anijain/work/pytorch/torch/_dynamo/variables/builder.py", line 1214, in <lambda> lambda: tx.fake_mode.from_tensor( File "/scratch/anijain/work/pytorch/torch/_subclasses/fake_tensor.py", line 1434, in from_tensor return self.fake_tensor_converter( File "/scratch/anijain/work/pytorch/torch/_subclasses/fake_tensor.py", line 329, in __call__ return self.from_real_tensor( File "/scratch/anijain/work/pytorch/torch/_subclasses/fake_tensor.py", line 283, in from_real_tensor out = self.meta_converter( File "/scratch/anijain/work/pytorch/torch/_subclasses/meta_utils.py", line 531, in __call__ r = self.meta_tensor( File "/scratch/anijain/work/pytorch/torch/_subclasses/meta_utils.py", line 184, in meta_tensor assert not torch._C._dispatch_tls_local_exclude_set().has( AssertionError: ~~~ It seems `_dynamo.disable` is the right option for custom ops added by `torch.library`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99600 Approved by: https://github.com/jansel	2023-04-22 12:40:33 +00:00
maxren	e63c502baa	[Executorch][XNNPACK] Quantized Max Pool 2d (#99587 ) Adding support for Quantized Max Pool 2d Additions: - Add quantized max pool 2d to executorch backend config - modify max pool node visitors to grab quant params from input/output - Add qmaxpool 2d patterns for partitioners Differential Revision: [D44977783](https://our.internmc.facebook.com/intern/diff/D44977783/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99587 Approved by: https://github.com/jerryzh168	2023-04-22 07:17:13 +00:00
Nikita Shulga	7749eec8df	Remove deprecated declaration suppression (#99749 ) As https://github.com/pytorch/pytorch/pull/55889 landed a while back Pull Request resolved: https://github.com/pytorch/pytorch/pull/99749 Approved by: https://github.com/huydhn, https://github.com/kit1980	2023-04-22 06:49:28 +00:00
maxren	a964a3dbed	[quant][pt2e] add all convs-relu fusion qat configs (#99586 ) Currently when prepare_qat_fx with executorch backend config we do not properly quantize conv or conv - relu To fix this we add all the necessary qat configs for conv and conv-relu Differential Revision: [D45135947](https://our.internmc.facebook.com/intern/diff/D45135947/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99586 Approved by: https://github.com/jerryzh168	2023-04-22 06:44:23 +00:00
maxren	c139dfd71e	[quant][pt2e] add dropout to executorch backend config (#99585 ) OD Model has a dropout layer in training, In order to match eager mode qat, we also fake quantize the drop out layer in prepare_qat_fx. To do this we add the dropout layer to the default_op_configs in which the observation type uses a different observer from its input Differential Revision: [D45095936](https://our.internmc.facebook.com/intern/diff/D45095936/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99585 Approved by: https://github.com/jerryzh168	2023-04-22 06:41:44 +00:00
Wanchao Liang	9db6920635	[spmd] Add list handling to data parallel and add foreach tests (#99373 ) This PR adds list handling logic to the new DataParallel expansion and add foreach optimizer tests, currently current testing sgd optimizers in foreach mode, for both replicate and fully shard Next step: Add fused optim tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/99373 Approved by: https://github.com/mrshenli	2023-04-22 05:39:20 +00:00
Wanchao Liang	c1e2fa8189	[dtensor] add StrategyType and TupleStrategy (#99435 ) This PR refactors the current StrategyList. It introduces a StrategyType, which is the base class of Strategy, and it have two sub strategies: 1. Refactor the previous StrategyList to OpStrategy 2. Add TupleStrategy, the new strategy added to deal with tuple cases where it could return multiple different OpStrategy for an op. This would help support a more complicated op and unblocks compile mode FSDP Pull Request resolved: https://github.com/pytorch/pytorch/pull/99435 Approved by: https://github.com/mrshenli	2023-04-22 05:39:20 +00:00
Shiyan Deng	82a54513ac	[fx] Add a function to allow adding more functions to the side effect function set (#97288 ) Summary: There're some customized functions that we would also like to keep during eliminate dead code pass. Add a function to help us to do. Test Plan: Added a unit test Differential Revision: D44273630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97288 Approved by: https://github.com/houseroad	2023-04-22 04:42:24 +00:00
Aaron Enye Shi	87b71e570e	[Profiler] Support HTML plot output for profiler export_memory_timeline API (#99751 ) Summary: Support the file extension .html, which will include a PNG image of the plot embedded into an HTML file. This allows users to avoid processing the timeline manually in their own frontend UI. Test Plan: CI Tests Ran on resnet50 model and generated this html file w/ plot: See attached html file: {F954232276} Screenshot: {F954232469} Differential Revision: D45152735 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/99751 Approved by: https://github.com/davidberard98	2023-04-22 04:21:58 +00:00
Iris	ca8625f456	[BE][1/N]Add sharding spec logger for ShardedTensor (#99748 ) Set up a nullHandler() on the OSS side. Next step is to set up the counterpart in internal. This is part of the effort for ShardedTensor deprecation. We want to log internal use cases for different sharding spec. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99748 Approved by: https://github.com/H-Huang, https://github.com/fegin	2023-04-22 04:05:21 +00:00
AllenTiTaiWang	bd7191111f	[ONNX] Add additional_test_kwargs into test_fx_to_onnx_with_onnxruntime.py (#99434 ) 1. Expand additional_test_inputs to include kwargs 2. Revisit and update tests status by adding ops 3. Disabling dtype -1 assignment avoids potential bugs 4. Expand input/output to accept buit-in type, but they are not dynamically captured by dynamo.export right now, and they would be added as constant input to op.targets. 5. Move run_test_with_fx_to_onnx_exporter_and_onnx_runtime to onnx_test_common.py <!-- copilot:all --> ### <samp>🤖 Generated by Copilot at 3c03579</samp> ### Summary 🛠️🧪🚀 <!-- 1. 🛠️ for updating the `filter_incompatible_and_dtype_convert_kwargs` function 2. 🧪 for updating the test function and test cases 3. 🚀 for adding support for new operators and scalar types --> This pull request improves the ONNX export support for scalar types and some ATen operators in PyTorch. It updates the test framework, the input and output adapters, the function dispatcher and the ONNX script generator to handle these cases. It also fixes or removes some failing or outdated tests. > _We defy the limits of the ONNX script_ > _We export the models with scalar and copy_ > _We filter and convert the kwargs of dtype_ > _We run the tests with FX and docstring_ ### Walkthrough * Update the `_InputArgsType` type annotation and the `_run_test_with_fx_to_onnx_exporter_and_onnx_runtime` function signature and docstring to handle int, float and bool inputs for some ONNX operators ([link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL44-R46), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL144-R157), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL155-R164), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL162-R172), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL201-R224), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L197-R199), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L291-R293)) * Update the `filter_incompatible_and_dtype_convert_kwargs` function to omit the `dtype` argument if it is None ([link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-cabc3e58713d6fe7ab764ade4f2692f6753402322a7b542397cad16fcc72cf4bL203-R205)) * Update the test cases in `test_fx_to_onnx_with_onnxruntime.py` to use the `input_kwargs` parameter as a mapping, to fix the format of the `additional_test_inputs` parameter, and to add or remove `xfail`, `skip_dynamic_fx_test` and `skip_min_ort_version` decorators as needed ([link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL320-R336), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL330-R353), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL357-R380), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL452-L470), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL488-R486), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL509-R510), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbR543), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL559-R565), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL578-R580), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL597-R599), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL611-R620), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL636-R636), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL656-R659), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL672-R675), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL691-R698), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL709-R714), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL732-R730), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL752-R750), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL773-R771), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbR797-R803), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL807-R816)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99434 Approved by: https://github.com/justinchuby	2023-04-22 04:03:50 +00:00
Wanchao Liang	e9bf94149e	[spmd] Introduce Compile Mode FSDP with DTensor (#99062 ) This PR introduces compile mode Data Parallel (FSDP/DDP) using DTensor sharding. Along with the algorithm, it also introduces a new DataParallelMode so that `compile` API can take it and apply data parallel. This PR trys to preserve the DTensorExpand approach first to avoid BC, we shall discuss steps to remove DTensorExpand. The data parallel mode uses heuristics to determine node types in the graphs and assign the corresponding sharding. The detailed algorithm described in the design doc. The benefits of this approach: - Model parameters and optimizer states are all DTensors after `spmd.compile`, which is necessary for FSDP, and also makes it super easier for checkpointing - As model parameter/optim states are sharding in a per-parameter approach, it would be able to compose with sophisticated second order optimizer (i.e. Shampoo) in a easier way. - We leverage the model parameter/grads information to derive data parallel pattern. In this way we don't need to worry about DTensor op coverage anymore! As data parallel is just a special case of DTensor operation. - Use dtensor_expand might work for DDP but aren't going to work for FSDP as dtensor might choose to allgather activation, which might violate native fsdp algorithm. - The approach is general enough to support both DDP/FSDP and a mixed mode Follow ups: - Add the "default" data parallel mode which supports mixing of replicate/fully shard - Test more e2e models with more different types of optimizers, etc - migrate the existing stack from the DTensorExpand mode - build optimizations on top of this prototype Differential Revision: [D45174400](https://our.internmc.facebook.com/intern/diff/D45174400) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99062 Approved by: https://github.com/mrshenli	2023-04-22 03:13:05 +00:00
PyTorch MergeBot	be62a80787	[vision hash update] update the pinned vision hash (#99486 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99486 Approved by: https://github.com/pytorchbot	2023-04-22 03:06:44 +00:00
AllenTiTaiWang	e5664c652a	[ONNX] Support aten::scaled_dot_product_attention in torchscript exporter (#99658 ) Fixes #97262 <!-- copilot:all --> ### <samp>🤖 Generated by Copilot at d06d195</samp> ### Summary 🆕🚀📝 <!-- 1. 🆕 for adding tests and annotations for a new operator. 2. 🚀 for adding support for exporting a new operator to ONNX. 3. 📝 for fixing a minor formatting issue. --> This pull request adds ONNX opset 14 support for the `nn.functional.scaled_dot_product_attention` operator, which is used for self-attention in transformer models. It does so by adding tests and annotations in `test/onnx/test_op_consistency.py`, and by adding a symbolic function in `torch/onnx/symbolic_opset14.py` that reuses an existing implementation. > _To export `scaled_dot_product_attention`_ > _To ONNX opset 14, we need some extension_ > _We import some modules and types_ > _And add a symbolic that pipes_ > _The existing code with some annotation_ ### Walkthrough * Implement the `nn.functional.scaled_dot_product_attention` operator for ONNX opset 14 ([link](https://github.com/pytorch/pytorch/pull/99658/files?diff=unified&w=0#diff-244955d820ec138d5ddffb20ee6f517cc4c5d281f19ccb53d8db47043b5ac46fR122-R292)) * Add imports for modules and types needed for the operator implementation ([link](https://github.com/pytorch/pytorch/pull/99658/files?diff=unified&w=0#diff-244955d820ec138d5ddffb20ee6f517cc4c5d281f19ccb53d8db47043b5ac46fL17-R23)) * Add a command to run the pytest module for testing the operator consistency ([link](https://github.com/pytorch/pytorch/pull/99658/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753R13)) * Add the operator to the list of operators tested for consistency ([link](https://github.com/pytorch/pytorch/pull/99658/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753R311)) * Add annotations to indicate the operator's limitations and issues ([link](https://github.com/pytorch/pytorch/pull/99658/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L333-R339), [link](https://github.com/pytorch/pytorch/pull/99658/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753R354-R358)) * Remove an empty line at the end of `test/onnx/test_op_consistency.py` ([link](https://github.com/pytorch/pytorch/pull/99658/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L441)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99658 Approved by: https://github.com/justinchuby	2023-04-22 02:36:39 +00:00
soulitzer	6585d76f0f	[docs] nn.functional.embedding: Note expected discrepancy between numerical and analytical gradients (#99181 ) * Fixes https://github.com/pytorch/pytorch/issues/93950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99181 Approved by: https://github.com/albanD	2023-04-22 02:30:53 +00:00
cyy	2b7161e2bf	lower cmake version requirement in FindSanitizer.cmake (#97073 ) As indicated by the last comment from PR #93147, we should replace CheckSourceRuns in cmake/Modules/FindSanitizer.cmake with older versions to avoid dependency on CMake 3.19+ Pull Request resolved: https://github.com/pytorch/pytorch/pull/97073 Approved by: https://github.com/vfdev-5, https://github.com/Skylion007	2023-04-22 02:02:14 +00:00
Devashish Shankar	93d0a9c1b5	Add pattern to normalize split (#99588 ) Summary: We normalize split with sections to split with sizes, so that it is easier to do subsequent transforms Differential Revision: D45136185 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99588 Approved by: https://github.com/jansel	2023-04-22 01:08:11 +00:00
Jason Ansel	79ec91a943	Add pass to remove redundant conversions (#99697 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99697 Approved by: https://github.com/ngimel	2023-04-22 00:37:11 +00:00
PyTorch MergeBot	4637c5ae5b	Revert "Simplify _use_grad_for_differentiable (#98706 )" This reverts commit b9da79d2800c2ca00b57bc3ac86b43e01be174b6. Reverted https://github.com/pytorch/pytorch/pull/98706 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but a bunch of inductor tests are failing after this commit, so reverting the PR just to be sure	2023-04-22 00:35:56 +00:00
BowenBao	872319d393	[ONNX] Cover 'undiscoverable' ops 'torch.ops.aten' (#99682 ) Due to https://github.com/pytorch/pytorch/issues/99681, ops supported by torchlib in `function_dispatcher` may not get registered. This PR works around it by doing reverse look up. Also fixes aten signature for `log_sigmoid`, which appears to be an outlier with unique naming style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99682 Approved by: https://github.com/titaiwangms, https://github.com/justinchuby	2023-04-22 00:31:59 +00:00
Huy Do	96d3f3dee3	Discover and run C++ tests with run_test.py (#99559 ) This depends on [pytest-cpp](https://github.com/pytest-dev/pytest-cpp) to discover and run C++ tests with pytest. C++ tests are built under `${WORKSPACE}/build/bin` directory and copied to the test job under the same path. * To expose them to `run_test`, I choose to use the mock path prefix `cpp`, for example `build/bin/c10_Array_test` would be named as `cpp/c10_Array_test` and the `python test/run_test.py --cpp -i cpp/c10_Array_test` would run the test in the same way as other Python tests. I could copy them from `build/bin` to `test/cpp`, but it will be mixed with the source code and CMake file. So this looks easier * Some executable under `build/bin` are not C++ tests, and they are exclude, for example `build/bin/torch_shm_manager` * C++ tests need to run with pytest directly as python command doesn't understand it * The change is gated by the new `--cpp` argument to `run_test.py`, for example `python test/run_test.py --cpp` will run all available C++ tests * The tests can be run in parallel * Failing tests can be retried with `--reruns=2` and `--sw` ``` ============================= test session starts ============================== platform darwin -- Python 3.9.15, pytest-7.2.0, pluggy-1.0.0 -- /Users/huydo/miniconda3/envs/py3.9/bin/python3 cachedir: .pytest_cache hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/Users/huydo/Storage/mine/pytorch/test/.hypothesis/examples') rootdir: /Users/huydo/Storage/mine/pytorch, configfile: pytest.ini plugins: xdoctest-1.1.0, cpp-2.3.0, rerunfailures-10.3, shard-0.1.2, flakefinder-1.1.0, hypothesis-6.56.4, xdist-3.0.2, repeat-0.9.1 collecting ... collected 3 items / 2 deselected / 1 selected Running 1 items in this shard: build/bin/scalar_tensor_test::TestScalarTensor.TestScalarTensorMPS stepwise: skipping 2 already passed items. ../build/bin/scalar_tensor_test::TestScalarTensor::TestScalarTensorMPS RERUN [100%] ../build/bin/scalar_tensor_test::TestScalarTensor::TestScalarTensorMPS RERUN [100%] ../build/bin/scalar_tensor_test::TestScalarTensor::TestScalarTensorMPS FAILED [100%] ``` * `--import-slow-tests` and `--import-disabled-tests` won't work for now and that's ok to have it as a future task. I also add `pytest-cpp==2.3.0` to Linux Docker, MacOS, and Windows. ### Testing Build PyTorch and run `python test/run_test.py --cpp` on my laptop. CI change would come later in a separate PR. Also running `python test/run_test.py --help` now shows all C++ test discovered under `build/bin` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99559 Approved by: https://github.com/clee2000	2023-04-22 00:23:31 +00:00
Natalia Gimelshein	bfbc4e74ab	adjust batch sizes for hf suite (#99691 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99691 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2023-04-21 23:57:53 +00:00
Xilun Wu	ce60997376	[BE][DTensor] validate the mesh argument in DeviceMesh construction (#99094 ) ## What's in this PR DeviceMesh's __init__ function now requires all calling ranks to pass the same `mesh` argument. ## Why We want to enforce SPMD style of programs using DTensor. Before this PR, 2-D Parallel API (e.g. _create_1d_device_mesh) defines different DeviceMesh on different ranks. After this PR, it defines each sub-meshes and simply perform communications on the one that it is associated with. Differential Revision: [D45165511](https://our.internmc.facebook.com/intern/diff/D45165511) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99094 Approved by: https://github.com/wanchaol	2023-04-21 23:47:51 +00:00
Lin Yang	cf357adc7e	Allow torch.fx to take Modules that return dataclass (#99576 ) Summary: Currently torch.fx support Modules with input of namedtuple/dataclass, return as namedtuple, but does not allow Module.forward to return a dataclass, running `test_trace_return_dataclass` without this change will have following error: NotImplementedError: argument of type: <class 'test_fx.TestFX.test_trace_return_dataclass.<locals>.MyOutput'> File "test_trace_return_dataclass traced_graph = symbolic_trace(module).graph File "test/__fx__/fx#link-tree/torch/fx/_symbolic_trace.py", line 1114, in symbolic_trace graph = tracer.trace(root, concrete_args) File "test/__fx__/fx#link-tree/torch/fx/_symbolic_trace.py", line 783, in trace (self.create_arg(fn(*args)),), File "test/__fx__/fx#link-tree/torch/fx/_symbolic_trace.py", line 378, in create_arg return super().create_arg(a) File "test/__fx__/fx#link-tree/torch/fx/proxy.py", line 269, in create_arg raise NotImplementedError(f"argument of type: {type(a)}") this diff handle dataclass type. Test Plan: buck test @//mode/opt @//mode/inplace //caffe2/test:fx -- test_trace_ graph(): %d : torch.Tensor [#users=1] = placeholder[target=d] %my_output : [#users=1] = call_function[target=test_fx.MyOutput](args = (), kwargs = {foo: %d, bar: %d}) return my_output Differential Revision: D44916519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99576 Approved by: https://github.com/suo	2023-04-21 23:46:49 +00:00
Horace He	547bef11ee	tweak heuristic for sdpa selection based off of data (and a decision tree) (#99644 ) High level approach: 1. I generated a bunch of data comparing FlashAttention and Cutlass implementations (https://pastebin.com/pe0j3YeK) 2. I trained a decision tree using standard train/val split methodology and hyperparameter sweeps (https://pastebin.com/fjYX1HjR). 2a. I did a bunch of feature augmentation to capture interactions between features. The heuristic I ended up with is: ``` use_flash = seq_len / (num_heads * batch_size) > 6 ``` TL;DR: On my dataset, where FlashAttention and Cutlass differ by more than 10%, the existing heuristic achieves 69% accuracy. My new heuristic achieves 94% accuracy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99644 Approved by: https://github.com/ngimel, https://github.com/drisspg	2023-04-21 23:28:44 +00:00
Edward Z. Yang	bb830224e3	Remove extra space (#99750 ) Fixes https://github.com/pytorch/pytorch/issues/99714 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99750 Approved by: https://github.com/lezcano, https://github.com/albanD	2023-04-21 23:18:52 +00:00
Chien-Chin Huang	4f62e7cb10	[FSDP][BE] Remove unused code (#99731 ) Remove the unused code. https://github.com/pytorch/pytorch/pull/99675 is duplicated and we should land this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99731 Approved by: https://github.com/wz337	2023-04-21 23:11:37 +00:00
Daniel Dale	363d530035	Fix decision logic for `should_cast_forward_inputs` in `_root_pre_forward()` and `_pre_forward()` (#99546 ) Fixes #99545 There is currently no topological constraint dictating FSDP instances own ``FlatParamHandle`` s directly. If all parameters are managed by descendant FSDP instances leaving an FSDP instance with no direct ``state._handles``, the ``should_cast_forward_inputs`` decisions below in both ``_root_pre_forward()`` and ``_pre_forward()`` respectively can return incorrect decisions [^1]. For [``_root_pre_forward()``](`436edc5ac3/torch/distributed/fsdp/_runtime_utils.py (L514)`): `436edc5ac3/torch/distributed/fsdp/_runtime_utils.py (L602-L604)` For [``_pre_forward``](`436edc5ac3/torch/distributed/fsdp/_runtime_utils.py (L384)`): `436edc5ac3/torch/distributed/fsdp/_runtime_utils.py (L420-L422)` See the [related issue](https://github.com/pytorch/pytorch/issues/99545) for reproduction. ### Remediation In this PR, I amend the two decision statements referenced above (in both `_root_pre_forward()` and `_pre_forward()`) to account for FSDP instances without direct handles: ```python should_cast_forward_inputs = len(state._handles) > 0 and all( not handle._force_full_precision for handle in state._handles ) ``` If one configures ``MixedPrecision`` in the example above with ``cast_forward_inputs=True`` and the ``should_cast_forward_inputs`` adjustment above, FSDP returns to the expected behavior and produces no error. Though the check is the same in both ``_root_pre_forward()`` and ``_pre_forward()`` and hence could be refactored into a separate function, I figured it may make sense to retain separate statements to preserve the ability for root-specific behavior in the future. Whichever approach the team prefers I can update this PR with. ### Implementation considerations and questions: 1. Rather than write a test that would arguably have a poor utility/resource usage profile, I have not added any tests associated with this PR. The new decision logic is exercised by all existing tests (which continue to pass after this PR of course) so I think the utility of new tests is fairly modest. Let me know if you think new tests should be added and I'm happy to do so. 2. As discussed above, the decision statement shared among ``_pre_forward()`` and ``_root_pre_forward()`` could be factored out into a separate function. Given the simplicity of the statement and to retain current flexibility for root-specific decisions it might not be worth the refactor so I haven't done it yet. Let me know if you'd like me to do so. 3. The note below could be updated to indicate the utility of setting ``cast_forward_inputs=True`` for the situations addressed with this PR but I haven't done so since I'm not sure it's worth complicating the current usage guidance. I'd be happy to add verbiage describing the use case if the team wants it. `cde35b4069/torch/distributed/fsdp/api.py (L175-L181)` Thanks again to the PyTorch distributed team for your immensely valuable contributions to the open-source ML community! [^1]: Though one could keep the existing decision logic and impose a new topological constraint requiring all FSDP instances have direct `_handles`, I think retaining the current wrapping flexibility is both convenient and useful enough (e.g. programmatic wrapping of modules that may or may not already have all parameters handled by descendant FSDP instances) to update the decision logic as discussed here instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99546 Approved by: https://github.com/awgu	2023-04-21 22:49:50 +00:00
Edward Z. Yang	10c938abef	Handle meta['val'] for tuple of lists. (#99724 ) Fixes https://github.com/pytorch/pytorch/issues/99356 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99724 Approved by: https://github.com/wanchaol	2023-04-21 22:33:21 +00:00
Huy Do	6c899999f4	Fix wrong path when reinstalling MacOS pip requirements (#99758 ) I force merged this https://github.com/pytorch/pytorch/pull/99506 too soon to fix MacOS flaky in trunk and forgot to set the path to the pip requirements file correctly. `popd` needs to be run before the reinstallation, so that the working directory is back to pytorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99758 Approved by: https://github.com/kit1980	2023-04-21 22:32:59 +00:00
Bin Bao	db46d9dc49	[CI] Change max-autotune's output file name (#99754 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99754 Approved by: https://github.com/huydhn	2023-04-21 21:53:45 +00:00
Zachary DeVito	8548cb3dd5	Improve OOM error message (#99699 ) This PR adds calls to nvml during an OOM to find out the total memory in use by the process and any other CUDA processes on the device. This makes it easier to identify cases where non-PyTorch libraries have allocated memory or another process (such as a data loader) has also allocated memory on the device. This also rewords the other parts of the error message to make the meaning of the memory statistics more clear with this new information: """ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 138.00 MiB. GPU 0 has a total capacty of 15.90 GiB of which 8.44 MiB is free. Process 1246069 has 577.00 MiB memory in use. Including non-PyTorch memory, this process has 15.32 GiB memory in use. Of the allocated memory 14.12 GiB is allocated by PyTorch, and 410.41 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF """ Pull Request resolved: https://github.com/pytorch/pytorch/pull/99699 Approved by: https://github.com/ngimel	2023-04-21 21:36:48 +00:00
Huy Do	c39aff1084	Disable XProtect on MacOS runner (#99506 ) A theory is that something else on the runner removes the file like Windows Defender. The number one suspect is `com.apple.XProtect.daemon.scan` https://support.apple.com/guide/security/protecting-against-malware-sec469d47bd8/web Spot checking on some runners: * On 13.x (13.3.1 and 13.2.1), the daemon is now called `com.apple.XProtect.daemon.scan` ``` sh-3.2$ sudo launchctl list \| grep -i protect 8048 -9 com.apple.XprotectFramework.PluginService 8047 -9 com.apple.XProtect.daemon.scan ``` * On 12.4, the daemon is called `com.apple.XprotectFramework` ``` sudo launchctl list \| grep -i protect - -9 com.apple.XprotectFramework.PluginService - -9 com.apple.XprotectFramework.scan ``` Looking at the list of failures in https://hud.pytorch.org/failure/ModuleNotFoundError%3A%20No%20module%20named%20'sympy', I can confirm that the issue happens with both MacOS 12 and 13 as I can find examples on both. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99506 Approved by: https://github.com/malfet	2023-04-21 21:26:49 +00:00
Edward Z. Yang	18fd6394dc	Give distinct names to __unknown_tensor (#99729 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99729 Approved by: https://github.com/albanD	2023-04-21 21:03:43 +00:00
Jason Ansel	b9da79d280	Simplify _use_grad_for_differentiable (#98706 ) This makes it so dynamo can trace through it Pull Request resolved: https://github.com/pytorch/pytorch/pull/98706 Approved by: https://github.com/janeyx99	2023-04-21 20:47:19 +00:00
Yanbo Liang	08376cc546	[Inductor] Fix rand_like with kwargs device of str type (#99673 ) Fixes #99632 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99673 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-04-21 20:33:14 +00:00
Chien-Chin Huang	7876c503b7	[FSDP][optim_state_dict] Consolidate rank0_only load logic (#99647 ) Follow up https://github.com/pytorch/pytorch/pull/99624, this PR consolidate the logic of `use_orig_params=False` with `use_orig_params=True` to use the same logic to load optimizer checkpoint when rank0_only is True. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99647 Approved by: https://github.com/wz337	2023-04-21 20:29:54 +00:00
Chien-Chin Huang	dd07dab1c7	[FSDP][optim_state_dict] Support rank0_only when use_orig_params is on (#99624 ) This PR makes `use_orig_params=True` case support rank0_only loading for optim state_dict. The implementation is different from `use_orig_params=False`. The `use_orig_params=False` implementation first flatten the parameters on rank0 and then broadcast the states while this implementation broadcast the state when doing the flattening. The implementation is slower as it broadcast the original parameters instead of the flattened ones. However, the implementation introduced by this PR is simpler. As loading is usually happen once per training life, the performance difference can be ignored. In next PR, we will consolidate the implementations in favor of the simpleness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99624 Approved by: https://github.com/wz337	2023-04-21 20:09:19 +00:00
PyTorch MergeBot	fc63d710fe	Revert "Disable XProtect on MacOS runner (#99506 )" This reverts commit 9bece55a7e620525f76177b2402b178acab66ee8. Reverted https://github.com/pytorch/pytorch/pull/99506 on behalf of https://github.com/huydhn due to Found a clue on the uploaded archive, reverting this so I can update the PR with the correct mitigation	2023-04-21 19:31:56 +00:00
wbigat	ee5f09ab80	[Feature] storage pin memory support custom device. (#99712 ) Fixes #99326 Support storage pin_memory and is_pinned for custom device, by calling dispatched tensor operations. @ezyang this pr is what we have discussed in issue #99326, would you please take a moment to review it, thanks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99712 Approved by: https://github.com/ezyang	2023-04-21 18:31:01 +00:00
Zachary DeVito	400075f733	[stacktraces] Keep addr2line procs around (#99670 ) This PR caches the addr -> Frame information across calls to symbolize, and also keeps the addr2line symbolizing processes around once requested. This makes calls to symbolize frames that have been seen before nearly instant, and makes lookup of address in libraries that have already been loaded by addr2line faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99670 Approved by: https://github.com/ezyang	2023-04-21 18:16:04 +00:00
Bin Bao	e09f785a72	[CI] Remove inductor skip list for Huggingface (#99375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99375 Approved by: https://github.com/anijain2305	2023-04-21 18:13:22 +00:00
PyTorch MergeBot	75e754800f	Revert "[quant][pt2e][refactor] Cleanup the logic for deciding whether to insert observer/fq or not (#99220 )" This reverts commit d56adb1b54c8dba3d13ecae93f81c945325bc1c7. Reverted https://github.com/pytorch/pytorch/pull/99220 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally	2023-04-21 18:04:21 +00:00
Wanchao Liang	b96bb2f1a6	[spmd] Introduce ParallelMode and add DTensorExpandMode (#98452 ) This PR introduces a ParallelMode interface to define how to do SPMD expansion and optimize the captured graph. This would be beneifical for different parallelisms to expand differently and apply different optimization passes Put DTensorExpandMode as the first parallel mode that does the existing dtensor_expand functionality. Differential Revision: [D45174399](https://our.internmc.facebook.com/intern/diff/D45174399) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98452 Approved by: https://github.com/mrshenli	2023-04-21 17:24:54 +00:00
Yanbo Liang	9244264f46	[Inductor] Fix view/reshape on tensors with shape 0 in any dimension (#99671 ) From 14k github models, not sure if this is the right way to fix this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99671 Approved by: https://github.com/ngimel	2023-04-21 17:14:50 +00:00
Jerry Zhang	d56adb1b54	[quant][pt2e][refactor] Cleanup the logic for deciding whether to insert observer/fq or not (#99220 ) Summary: Previously we have two places we need to decide whether to insert observer or fake quantizer or not: (1) input arguments of a node (2) output of a node, and right now we have separate code to do this in this PR, the logic is unified in `_needs_obs_or_fq` helper function that takes the target_dtype and is_dynamic from previous output and target_dtype and is_dynamic for the current Tensor we are looking at let's use an example for conv node: ``` conv = convolution(input, weight, bias, ...) ``` let's say we have `input_node` object for argument `input`, and `conv_node` for `conv` node in the graph (1) input arguments, e.g. `input` the target_dtype/is_dynamic from previous output is the node that produces `input`, we get this from input_node.meta["target_dtype_info"]["output_act_obs_or_fq"] the taregt_dtype/is_dynamic for the current argument `input`, comes from conv_node.meta["target_dtype_info"]["input_act_obs_or_fq"] similarly for weight it comes from conv_node.meta["target"]["weightobs_or_fq"] etc. (2) output for conv node the target_dtype/is_dynamic from previous output will be the floating point output from the fp32 convolution operator, so it is hardcoded to be (torch.float, False), however, technically we should get this from node.meta["val"], but since the current code base is shared by fx graph mode quantization and pytorch 2.0 export quantization, we cannot do that, we can revisit after we decide to deprecate fx graph mode quantization the target_dtype/is_dynamic for the current output comes from conv_node.meta["target_dtype_info"]["output_act_obs_or_fq"] there is one caveat here about dynamic quantization, that is explained in the comment, so I won't repeat here Note: also fixed some places in `_get_arg_target_dtype_as_input_to_node` and `_get_arg_target_is_dynamic_as_input_to_node` to make sure "not specified" == specifying a fp32 placeholder observer as well Next: we can merge the two get target dtype and get is_dynamic function to reduce code duplication Test Plan: python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps python test/test_quantization.py TestQuantizeFxModels python test/test_quantization.py TestQuantizePT2E python test/test_quantization.py TestQuantizePT2EModels Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D45167585](https://our.internmc.facebook.com/intern/diff/D45167585) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99220 Approved by: https://github.com/kimishpatel	2023-04-21 16:58:35 +00:00
mega-optimus	06081ac8f3	Update docstring of torch.nn.functional.normalize() (#99512 ) Fixes #99125 torch.nn.functional.normalize() already supports dim=tuple(int), but the docstring says int only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99512 Approved by: https://github.com/albanD	2023-04-21 16:45:24 +00:00
Edward Z. Yang	e9786149ab	Delete tracing_mode argument to export (#99555 ) You can have any color you want, as long as it's tracing_mode="symbolic" Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99555 Approved by: https://github.com/voznesenskym	2023-04-21 16:20:51 +00:00
Edward Z. Yang	881c57230d	Move more stuff to after_aot (#99557 ) Not sure why this didn't work first time around. Second time's a charm. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99557 Approved by: https://github.com/anijain2305	2023-04-21 16:20:40 +00:00
Edward Z. Yang	d3bb762f1e	Do not assume static by default when exporting (#99554 ) Fixes https://github.com/pytorch/pytorch/issues/99360 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99554 Approved by: https://github.com/voznesenskym	2023-04-21 15:19:47 +00:00
Nikita Shulga	6b8ef8ea4c	[BE] Build PyTorch with `-Wnewline-eof` (#99687 ) This would avoid further regressions like the ones reported in https://github.com/pytorch/pytorch/pull/96668#issuecomment-1468029259 Surround some ONNX/flatbuffer includes with `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wnewline-eof")` cone of shame Fixes https://github.com/pytorch/pytorch/issues/96747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99687 Approved by: https://github.com/kit1980	2023-04-21 14:46:47 +00:00
ts	dbf0db958f	Fix torch.nn.FractionalMaxPool2d output_size error (#99507 ) Fixes #99148 , raising an error if output_ratio's size > 2. Justification for changes: If an output size is not specified but an output ratio is, we call fractional_max_pool2d_with_indices. We then generate the value of output_size based on the first two integers of the output_ratio (line ~480 of torch.nn.functional.py). Thus, we should raise a value error in the case that the user passes an output_ratio (instead of an output_size) and the number of elements in output_ratio exceeds two. We must raise an error before calling torch._C._nn.franctional_max_pool2d as the value of output_size passed into torch._C._nn.fractional_max_pool2d is guaranteed to be of size 2 (as the existing code generates it from the first two indices of the passed in ratio). I would be happy to iterate on this if there are any issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99507 Approved by: https://github.com/mikaylagawarecki	2023-04-21 14:38:25 +00:00
PyTorch MergeBot	9861ec9785	Revert "[c10d] Faster coalescing (#98793 )" This reverts commit db456ab83da6a505dcebc128903d5ee4fc2d5712. Reverted https://github.com/pytorch/pytorch/pull/98793 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-04-21 09:15:04 +00:00
PyTorch MergeBot	da57d597e1	Revert "fix onednn ConvTranspose2d channels last issue when ic=1 (#99539 )" This reverts commit 233cc34d3b8a1b92eeeea78661463f8ec660fbcd. Reverted https://github.com/pytorch/pytorch/pull/99539 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-04-21 08:44:28 +00:00
Ke Wen	9df8b1b594	Init comm_nonblocking_ when creating AutoNcclGroup (#99679 ) A quick, trial fix for #99677. My guess is that when the code instantiates an `AutoNcclGroup` object, it comes with an uninitialized random value for member `comm_nonblocking_`. Then `if (comm_nonblocking_)` evaluates to true, and `NCCL_CHECK_TIMEOUT` triggered. This change is safe (and needed) anyway whether it indeed fixes #99677. Cc @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/99679 Approved by: https://github.com/eqy, https://github.com/awgu	2023-04-21 07:56:39 +00:00
Luca Wehrstedt	24bf15fe8d	Support record_stream in dispatch mode (#99529 ) Summary: Issuing a `t.record_stream(s)` call while a `TorchDispatchMode` is active was throwing because PyTorch was unable to convert a c10::Stream back to a Python object. It's now fixed. Fixes https://github.com/pytorch/pytorch/issues/94403 Test Plan: Added a unit test Differential Revision: D45117566 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99529 Approved by: https://github.com/albanD	2023-04-21 07:17:19 +00:00
Michael Voznesensky	0ac0d9d224	Pass locals to enum_repr to correctly make the guard str for enums (#99680 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99680 Approved by: https://github.com/jansel	2023-04-21 07:14:49 +00:00
Michael Voznesensky	8ee59280d7	Bug - check config for dynamic (#99676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99676 Approved by: https://github.com/ezyang	2023-04-21 06:40:09 +00:00
Angela Yi	a421c54753	[exir][delegate] torch.ops.call_delegate (#92562 ) Summary: Followup diffs to integrate this op into the other parts of the delegate workflow. The unittest results in the following graph: ``` graph(): %x_1 : [#users=1] = placeholder[target=x_1] %y_1 : [#users=1] = placeholder[target=y_1] %lowered_module_0 : [#users=1] = get_attr[target=lowered_module_0] %call_delegate : [#users=1] = call_function[target=torch.ops.call_delegate](args = (%lowered_module_0, forward, %x_1, %y_1), kwargs = {}) return call_delegate ``` Test Plan: buck2 run //executorch/exir/tests:delegate -- -r "test_call_delegate" Differential Revision: D42329287 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92562 Approved by: https://github.com/voznesenskym	2023-04-21 06:31:23 +00:00
Zachary DeVito	9def799097	[combined tracebacks] missing gil acquire (#99685 ) When this code was refactored, the GIL for appendSymbolized was dropped accidentally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99685 Approved by: https://github.com/davidberard98	2023-04-21 06:24:36 +00:00
Shunting Zhang	daff040886	[inductor] skip triton.Config that spills (#99385 ) TLDR, I did a quick study of register spill in max-autotune and coordesc descent tuning. The conclusion is for the pointwise/reduction kernels, register spill is rare in inductor (which means the configs we consider are relatively reasonable), but it indeed happens sometimes. TBH, this PR does not gonna help reducing compilation time for max-autotune/coordinate descent tuning much because register spilling is very rare. But this PR only contains 2 lines of significant code change, so I guess it's still good to merge it considering ROI and code complexity. # Register Spill in Max-Autotuner I ran command ``` rm -rf /tmp/torchinductor_shunting_tmp && time TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_tmp python -u benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only ${MODEL} --disable-cudagraphs --training 2>&1 \| tee /tmp/mylog ``` and then analyze the log. $ cat /tmp/mylog \| grep 'nspill' \| wc -l will show the total number of triton.Config's we benchmark; $ cat /tmp/mylog \| grep 'nspill' \| grep -v 'nspill 0' will show the number of triton.Config's that spill registers. Checked 5 models - hf_Bert 0 spills - resnet50: 2 out of 199 triton.Config's spill. For the 2 configs that spill, they are suboptimal according to the log: https://gist.github.com/shunting314/7ea30a9dafad7156919a99df5feba0ee - timm_vision_transformer: 2/77 spills. The spilled configs are again sub-optimal: https://gist.github.com/shunting314/a48cbcfb14a07c0b84555e2cf7154852 - BERT_pytorch: 0/123 spills - timm_resnest 0/255 spills # Register Spill in Coordinate Descent Tuner I ran command ``` rm -rf /tmp/torchinductor_shunting_tmp && time TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_tmp TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_PERSISTENT_REDUCTIONS=0 python -u benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only ${MODEL} --disable-cudagraphs --training 2>&1 \| tee /tmp/mylog ``` and then analyze the log. $ cat /tmp/mylog \| grep COORDESC \| wc -l shows the total number of configs considered by the coordinate descent tuner $ cat /tmp/mylog \| grep COORDESC \| grep -v 'nspill 0' shows the ones that spill. Checked 3 models - hf_Bert (log https://gist.github.com/shunting314/bd943887e77609c7c8b323fe3f554c85 ) 0/525 spills - resnet50: 0/783 spills - timm_vision_transformer: 2/380 (log https://gist.github.com/shunting314/6231f06c1398e0cddb2a96bf52389c78 ) the 2 spilled one are sub-optimal # Ignore Spilled Config With this PR, I run test tests for timm_vision_transformer and can see all 4 spilled configs (2 for max-autotune and 2 for coordinate descent tuner according to the study above) are skipped for benchmarking: ``` [2023-04-18 00:03:37,291] torch._inductor.triton_heuristics: [DEBUG] Skip config XBLOCK: 16, YBLOCK: 512, num_warps: 8, num_stages: 1 because of register spilling: 6 [2023-04-18 00:04:50,523] torch._inductor.triton_heuristics: [DEBUG] Skip config XBLOCK: 64, RBLOCK: 64, num_warps: 8, num_stages: 1 because of register spilling: 626 [2023-04-18 00:04:50,523] torch._inductor.triton_heuristics: [DEBUG] Skip config XBLOCK: 8, RBLOCK: 512, num_warps: 8, num_stages: 1 because of register spilling: 778 [2023-04-18 00:05:47,170] torch._inductor.triton_heuristics: [DEBUG] Skip config XBLOCK: 1, num_warps: 2, num_stages: 1 because of register spilling: 4 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99385 Approved by: https://github.com/jansel	2023-04-21 06:21:09 +00:00
Huy Do	9bece55a7e	Disable XProtect on MacOS runner (#99506 ) A theory is that something else on the runner removes the file like Windows Defender. The number one suspect is `com.apple.XProtect.daemon.scan` https://support.apple.com/guide/security/protecting-against-malware-sec469d47bd8/web Spot checking on some runners: * On 13.x (13.3.1 and 13.2.1), the daemon is now called `com.apple.XProtect.daemon.scan` ``` sh-3.2$ sudo launchctl list \| grep -i protect 8048 -9 com.apple.XprotectFramework.PluginService 8047 -9 com.apple.XProtect.daemon.scan ``` * On 12.4, the daemon is called `com.apple.XprotectFramework` ``` sudo launchctl list \| grep -i protect - -9 com.apple.XprotectFramework.PluginService - -9 com.apple.XprotectFramework.scan ``` Looking at the list of failures in https://hud.pytorch.org/failure/ModuleNotFoundError%3A%20No%20module%20named%20'sympy', I can confirm that the issue happens with both MacOS 12 and 13 as I can find examples on both. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99506 Approved by: https://github.com/malfet	2023-04-21 06:10:11 +00:00
Will Constable	63690afc6c	Make CI error on inductor fallback when decomp is available (#99473 ) Fixes #99446 Remove the warning, as that annoyed end-users who don't know what to do about it. Instead, try to hold the line by preventing any decomp from being added without making the corresponding change to inductor's fallbacks. Note: we probably still need to better document how to update inductor's decomps, for now it's pretty much "go ask the inductor team for advice" Pull Request resolved: https://github.com/pytorch/pytorch/pull/99473 Approved by: https://github.com/ezyang, https://github.com/ngimel, https://github.com/jansel	2023-04-21 05:47:28 +00:00
leslie-fang-intel	deaf983bdb	[Inductor][quant]Enable decomposed.quant/dequant lowering and vec code gen (#99131 ) Summary Since current quantization flow has not decomposed quant/dequant into prim ops, in this PR - We enable the quant/dequant decomposition as lowering inside inductor. - For the `decomposed.quant/dequant.tensor` overload, there are loading of scalar tensor of `zero point` and `scale`, we need to enable the vec code gen for these op overloads. - Minor change as adding `is_load_uint8_as_float` and `is_store_float_as_uint8` default value `False` into `OptimizationContext`. TestPlan ``` cd test/inductor && python -m pytest test_cpu_repro.py -k test_dequant_quant_lowering ``` co-author with @Xia-Weiwen Pull Request resolved: https://github.com/pytorch/pytorch/pull/99131 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-04-21 04:33:02 +00:00
XiaobingSuper	2a47f68586	inductor: fix onednn conv2d(transpose) packed issue when input size is three (#99601 ) Fix https://github.com/pytorch/pytorch/issues/99568, this PR adds an input size check before doing conv2d(transpose) packed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99601 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-21 04:32:35 +00:00
BowenBao	51742a467d	[ONNX] Fix missing import numpy for docs example (#99663 ) Fixes https://github.com/pytorch/pytorch/issues/99408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99663 Approved by: https://github.com/justinchuby	2023-04-21 04:06:45 +00:00
Abhishek Jindal	16a4dc0f93	Correct typo for NCCL_MAJOR (#99482 ) Correct Typo from NCCL_MACJOR to NCCL_MAJOR Fixes Typo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99482 Approved by: https://github.com/eqy, https://github.com/kwen2501	2023-04-21 03:27:08 +00:00
Will Constable	6427b849a3	Allow in graph einops operators (#99631 ) Coordinating with arogozhnikov from einops team, allowing specific operators in the dynamo graph avoids dynamo tracing problems provided the operators are screened for safety - they must not bake in unintended constants or take data-dependent control flow paths. Fixes #99031 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99631 Approved by: https://github.com/jansel	2023-04-21 03:14:38 +00:00
Will Constable	716ba6851e	Make testing._internal.common_utils safe to import (#99659 ) In edge cases in CI, SLOW_TESTS_FILE is defined but does not point to an existing file. Guessing this is due to a test case that opens a subprocses and cwd's but doesn't clean its env. We shouldn't make importing common_utils fail, so issue a warning and proceed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99659 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-04-21 02:59:10 +00:00
Jason Ansel	d168161cd3	[dynamo] Fix example_inputs with unsqueeze_ (#98696 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98696 Approved by: https://github.com/yanboliang	2023-04-21 02:54:14 +00:00
Iris	0d2b55c459	[DTensor] Change Sharding algorithm to be in line with ``torch.chunk()`` (#98722 ) As functional collective being updated, using tensor_split() as the underlying sharding algorithm would require padding and unpadding on multiple ranks. Therefore, we are changing the sharding algorithm to be in line with ``torch.chunk()`` to allow padding on the last two ranks in most of the scenarios. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98722 Approved by: https://github.com/wanchaol	2023-04-21 02:05:22 +00:00
XDaoHong	27f8eb8c2b	add storage serialization methods for privateuse1 (#98920 ) add entry for privateuse1 storage serialization register_package in _register_device_module. 1. User only need to implement `privateuse1_tag` and `privateuse1_deserialize` in the device module of open device. When registering device module, the methods are registered with _package_registry in storage serialization. 2. Provides a fixed sequence number 30 for privateuse1 in storage serialization _package_registry list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98920 Approved by: https://github.com/ezyang	2023-04-21 01:51:08 +00:00
Peter Bell	907f2dad7d	[inductor] Update triton pin (#99209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99209 Approved by: https://github.com/ngimel	2023-04-21 01:12:48 +00:00
Christian Puhrsch	fdeee43650	Disable SDPA FlashAttention backward and mem eff attention on sm86+ for head_dim above 64 (#99105 ) Expand sdpa_utils.h check to disable FlashAttention when using autograd and mem eff attention for the following cases - head_dim > 64 - sm86 or newer Previously we only disable these kernels on sm86 and for head_dim equal to 128. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99105 Approved by: https://github.com/malfet	2023-04-21 01:00:15 +00:00
Edward Z. Yang	fc8fa6c356	Require at least one tensor to be marked dynamic with --dynamic-batch-only (#99620 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99620 Approved by: https://github.com/voznesenskym	2023-04-21 00:17:08 +00:00
Edward Z. Yang	abdd1f4a38	Reuse tracing context and fake tensors from backwards in forwards (#99619 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99619 Approved by: https://github.com/wanchaol	2023-04-20 22:39:48 +00:00
Chizkiyahu Raful	bbfd577b7c	bug-report.yml fix broken link (#99425 ) fix link in ISSUE_TEMPLATE Pull Request resolved: https://github.com/pytorch/pytorch/pull/99425 Approved by: https://github.com/kit1980	2023-04-20 22:30:31 +00:00
Simon Seo	9f95032101	Fix broken links in contribution_guide.rst (#99295 ) mainly from `master` to `main` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99295 Approved by: https://github.com/kit1980	2023-04-20 22:20:56 +00:00
Yanbo Liang	c9b08a087d	[Dynamo] Merge symbolic_converter SETUP_WITH & BEFORE_WITH (#99651 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99651 Approved by: https://github.com/williamwen42	2023-04-20 22:12:38 +00:00
Sergii Dymchenko	c412056921	Temporarily move ROCm to unstable (#99579 ) CI SEV https://github.com/pytorch/pytorch/issues/99578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99579 Approved by: https://github.com/orionr	2023-04-20 20:34:54 +00:00
Nikita Shulga	37bcdb98f6	Fix buck parsing in OSS build (#99648 ) By removing `@fbsource` cell prefix from `pt_ops.bzl` Fixes https://github.com/pytorch/pytorch/issues/99642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99648 Approved by: https://github.com/kit1980, https://github.com/huydhn	2023-04-20 20:17:38 +00:00
andrewor14	22af604e1b	[quant][pt2] Add Conv + BN fusion for prepare QAT (#98568 ) Summary: This commit adds the `prepare_qat_pt2e` API and the fusion logic for Conv + BN. We use the subgraph rewriter to match and replace the pattern with the existing logic in `nniqat.ConvBn2d`. Note this is not the end-to-end flow yet. In particular, the convert flow needs to swap the new subgraph with another one that merges the batchnorm stats back into conv. The Conv + BN fusion is implemented in the following steps: 1. Annotate all nodes in the pattern `[conv - bn - getitem]` 2. Match and replace this pattern with the fused QAT pattern (note that this is a larger subgraph than the original one) 3. Copy over metadata from the original nodes to the corresponding nodes in the new subgraph, to ensure the stack traces and dtype annotations are preserved 4. Prepare will insert fake quantizes in the right places based on the annotations Test Plan: python test/test_quantization.py TestQuantizePT2E.test_qat_conv_bn_fusion Reviewers: jerryzh168, kimishpatel, yanboliang Pull Request resolved: https://github.com/pytorch/pytorch/pull/98568 Approved by: https://github.com/kimishpatel	2023-04-20 20:15:28 +00:00
Shunting Zhang	418a9fb9d8	[reland][inductor] coordinate descent tuning upon max-autotune (#99594 ) Reland https://github.com/pytorch/pytorch/pull/97203 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/99594 Approved by: https://github.com/jansel	2023-04-20 19:55:52 +00:00
Masaki Kozuki	b87c7ab6d6	Remove redundant `found_inf` recompute from `_step_supports_amp_unscaling` path (#98620 ) following https://github.com/pytorch/pytorch/pull/97415#issuecomment-1499787115. Rel: https://github.com/pytorch/pytorch/pull/98613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98620 Approved by: https://github.com/janeyx99	2023-04-20 19:24:09 +00:00
ts	a8e1893b7c	Clarify error message of torch.nn.functional.embedding_bag (#99471 ) Fixes #99221 , clarifying the error message to highlight the index from inputs which is responsible for the out-of-bounds error, while maintaining the reference to the relevant index of offsets as a secondary consideration. Also takes care of some spelling/grammatical issues with another message (primarily "yout" changed to "your"). Would be happy to iterate on this if there are any issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99471 Approved by: https://github.com/albanD	2023-04-20 19:18:46 +00:00
Jason Ansel	e68e84ef8a	[dynamo] Support BUILD_MAP_UNPACK (#98664 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98664 Approved by: https://github.com/yanboliang, https://github.com/voznesenskym	2023-04-20 18:41:50 +00:00
David Berard	c19d19f6ff	[profiler] support cuLaunchKernel (for triton kernel launches) & update kineto submodule (#99571 ) Background: Prior to this PR, traces for PT2 w/ inductor don't contain connections between CUDA kernels and the CPU launch site. This PR adds those connections. Details: Triton kernels launched by inductor use cuLaunchKernel instead of cudaLaunchKernel. cuLaunchKernel is part of the driver API, while cudaLaunchKernel is part of the runtime API. In order to support cuLaunchKernel, we added support in kineto (pytorch/kineto#752) to also start listening to driver events; hence why we need to update the kineto submodule. After the change in kineto, we just need to turn this on in the PyTorch repo by adding CUDA_DRIVER activity type to the CPU and CUDA activity type lists; then Testing: Added test/inductor/test_profiler.py to check for `cuLaunchKernel` in json trace files. Also, I ran this test: ```python import torch x = torch.rand((2, 2), device='cuda') def fn(x): return x.relu() fn_c = torch.compile(fn) fn_c(x) with torch.profiler.profile(with_stack=True) as prof: fn_c(x) prof.export_chrome_trace("relu_profile.json") ``` which generated this chrometrace: <img width="930" alt="Screenshot 2023-04-18 at 2 58 25 PM" src="https://user-images.githubusercontent.com/5067123/232966895-b65f9daf-7645-44f8-9e2b-f8c11c86ef0a.png"> in which you can see flows between a `cuLaunchKernel` on the CPU side, and the triton kernel on the GPU. Kineto Updates: To get the kineto-side changes required for cupti driver events, this PR updates the kineto pin. In that updated kineto submodule, we also have: * JSON string sanitizing for event names (likely fix for #99572) * cuda initialization fixes for multiprocessing * cuKernelLaunch events (i.e. for this PR) * DISABLE_CUPTI_LAZY_REINIT (from @aaronenyeshi) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99571 Approved by: https://github.com/ngimel, https://github.com/aaronenyeshi	2023-04-20 18:34:41 +00:00
Huy Do	5315317b7b	Skip some detectron2_maskrcnn models with KeyError _ignore_torch_cuda_oom (#99599 ) These tests are failing in trunk `233cc34d3b` with `KeyError: '_ignore_torch_cuda_oom'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99599 Approved by: https://github.com/malfet	2023-04-20 18:11:35 +00:00
Huy Do	7c3fa5c70d	Revert "Build Windows binaries with Visual Studio 2022 Build Tools (#90855 ) (#99591 ) This reverts commit a88c15a849152291b1ebdab13860726dd8be1d81. Once we have the AMI ready, we can revert this and use VS2022 again. This is to mitigate flaky Windows build in trunk https://github.com/pytorch/builder/issues/1387. Note that as VS2019 is already available in the current AMI, it won't be installed again per logic in https://github.com/pytorch/builder/blob/main/windows/internal/vs2019_install.ps1#L25-L29. Thus, this helps avoid the flaky installation issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99591 Approved by: https://github.com/kit1980, https://github.com/Blackhex, https://github.com/malfet	2023-04-20 17:57:10 +00:00
Edward Z. Yang	0a98289af3	Stop testing if CUDA is initialized on teardown (#99627 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99627 Approved by: https://github.com/jansel, https://github.com/huydhn	2023-04-20 17:54:48 +00:00
Guang Yang	aa4ed332c3	Improve torch.cond useability: Return UserError with actionable error messages (#98909 ) It's part of the effort to improve PT2 Export UX. This PR is to improve the usability of `torch.cond()` by separating user errors from the dynamo internal errors. By definition, user error means the usage of `torch.cond()` violates the restrictions of this API therefore needs users to take action and fix the error. In this notebook N3363227 we discovered a bunch of limitations of using `torch.cond(pred, true_fn, false_fn, operands)`. In summary, the limitations can be categorized as: - predicate restriction (`pred`) - operands restriction (`operands`) - branch restriction (`true_fn` & `false_fn`) The error message will be more accurate about where the (user) error is from and more actionable for users to fix it. For example, `operands` must be a list of tensors and the signature of `true_fn` and `false_fn` must match with the `operands`. If the operands contains non-tensor types, user will see error message like: ``` torch._dynamo.exc.UserError: Expected a list of tensors but got ["<class 'torch.Tensor'>", "<class 'float'>"] from user code: File "~/pytorch/test/dynamo/test_export.py", line 2504, in f_non_tensor_operands return cond(True, lambda x, a: x.sin(), lambda x, a: x.cos(), [x, a]) ``` If the signature of the branch function doesn't match with `operands`, user will see error message like: ``` torch._dynamo.exc.UserError: too many positional arguments. func = 'false_fn' ~/pytorch/test/dynamo/test_export.py:2514, args = [<class 'torch.Tensor'>, <class 'torch.Tensor'>], kwargs = {} ``` Or if the tensor returned from user defined branches has different metadata, e.g. shapes, dtypes, etc., user will see error message like: ``` TypeError: Expected each tensor to have same metadata but got: cond_true_0 returns TensorMetadata(shape=torch.Size([2, 1]), dtype=torch.int64, requires_grad=False, stride=(1, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={}) cond_false_0 returns TensorMetadata(shape=torch.Size([1]), dtype=torch.float32, requires_grad=False, stride=(1,), memory_format=torch.contiguous_format, is_quantized=False, qparams={}) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98909 Approved by: https://github.com/jansel	2023-04-20 17:20:41 +00:00
Edward Z. Yang	e47e8c9d98	Guard on default device (#99551 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99551 Approved by: https://github.com/voznesenskym, https://github.com/mlazos	2023-04-20 17:02:59 +00:00
Chien-Chin Huang	88c45a1954	[SPMD] Allow users to dynamically pass the last_iter to IterGraphModule (#99575 ) The current design of IterGraphModule requires users to specify the concrete iteration count which is not always possible and not very precise. This PR introduce `last_iter` to IterGraphModule.forward() which allows users to dynamically specify the last iteration. Differential Revision: [D45129585](https://our.internmc.facebook.com/intern/diff/D45129585/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99575 Approved by: https://github.com/lessw2020	2023-04-20 16:49:34 +00:00
Elias Ellison	7acb0bdd22	Fallback for Complex Dtypes in Inductor (#97198 ) Differential Revision: [D44257054](https://our.internmc.facebook.com/intern/diff/D44257054) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97198 Approved by: https://github.com/ngimel	2023-04-20 16:45:19 +00:00
Elias Ellison	638feec4e3	Turn on meta converter for complex (#98869 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98869 Approved by: https://github.com/ngimel	2023-04-20 16:42:38 +00:00
Luca Wehrstedt	df84d74058	Allow getting type of ScriptObject (#99542 ) Summary: A very old refactor (https://github.com/pytorch/pytorch/pull/29500) split ScriptModule into ScriptObject (base class) and ScriptModule (subclass). When moving methods around, the `_type` method was moved from ScriptModule to ScriptObject, but the type of its argument wasn't changed. Therefore, it is now impossible to invoke `_type` on a ScriptObject. The reason I need this fix is that I am using PyTorch's dispatch mode to intercept some operators that accept/return custom classes, which end up being encoded as ScriptObject, and in order to properly handle them I need to be able to verify their type. Test Plan: N/A Differential Revision: D45118675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99542 Approved by: https://github.com/albanD	2023-04-20 16:10:19 +00:00
Animesh Jain	971df458db	Reland of "Python binding to set/get CUDA rng state offset" (#99565 ) Why? * To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377 Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way. ~~~~ import torch torch.cuda.manual_seed(123) print(torch.cuda.get_rng_state()) torch.cuda.set_rng_state_offset(40) print(torch.cuda.get_rng_state()) tensor([123, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) tensor([123, 0, 0, 0, 0, 0, 0, 0, 40, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) ~~~~ Reland of https://github.com/pytorch/pytorch/pull/98965 (cherry picked from commit 8214fe07e8a200e0fe9ca4264bb6fca985c4911e) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99565 Approved by: https://github.com/anijain2305	2023-04-20 15:42:25 +00:00
Jason Ansel	f4354b2a5e	[dynamo] Support dict kwargs constructor (#98660 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98660 Approved by: https://github.com/yanboliang	2023-04-20 15:40:00 +00:00
Edward Z. Yang	c17ff0ed36	Print AOT Autograd graph name when accuracy failed (#99366 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99366 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2023-04-20 15:35:47 +00:00
Li-Huai (Allan) Lin	4721553431	[vmap] Fix searchsorted batch rule for self_logical_rank == 0 (#99526 ) Fixes #95888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99526 Approved by: https://github.com/zou3519	2023-04-20 14:59:12 +00:00
Nikita Shulga	2ad02d00b9	[BE] `stdint.h`->`cstdint` (#99570 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at cf8834b</samp> Updated `Generator.h` to use C++ fixed-width integers from `std` instead of C ones. This avoids potential conflicts with other libraries or platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99570 Approved by: https://github.com/huydhn, https://github.com/kit1980	2023-04-20 14:23:45 +00:00
PyTorch MergeBot	35ad5122d2	Revert "[vmap] Fix searchsorted batch rule for self_logical_rank == 0 (#99526 )" This reverts commit 6580b160d35a75d5ceebf376d55422376d0c0d2c. Reverted https://github.com/pytorch/pytorch/pull/99526 on behalf of https://github.com/zou3519 due to Regressed behavior	2023-04-20 13:19:49 +00:00
XiaobingSuper	ccd5ad816e	inductor(CPU): add ISA check before do cpu fx packed weight (#99502 ) 1. This PR is to fix https://github.com/pytorch/pytorch/issues/99423, which will add an ISA check before doing the bf16 weight pack. 2. Move CPU-related tests from ```test_torchinductor.py``` to ```test_cpu_repo.py``` to reduce the CI time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99502 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-04-20 08:37:15 +00:00
PyTorch MergeBot	4d8906885e	Revert "Temporarily move ROCm to unstable (#99579 )" This reverts commit d06624d3c4c2ffe0c8e1587dd9fab62a7c7a5be6. Reverted https://github.com/pytorch/pytorch/pull/99579 on behalf of https://github.com/kit1980 due to No longer needed	2023-04-20 07:10:21 +00:00
PyTorch MergeBot	21e88a543b	Revert "Fix trailing spaces lint (#99581 )" This reverts commit fbdb86c1747737c744ad79b5da6bcbd064dc982e. Reverted https://github.com/pytorch/pytorch/pull/99581 on behalf of https://github.com/kit1980 due to Reverting the previous PR	2023-04-20 07:07:51 +00:00
PyTorch MergeBot	96a262d666	Revert "Allow in graph einops operators (#99478 )" This reverts commit 309b7edfe1342197ee4f520ceebf0e15127c0f57. Reverted https://github.com/pytorch/pytorch/pull/99478 on behalf of https://github.com/kit1980 due to dynamo/test_after_aot.py::TestAfterAot::test_save_graph_repro - AssertionError, see https://github.com/pytorch/pytorch/actions/runs/4750274195/jobs/8438535867	2023-04-20 06:42:35 +00:00
Li-Huai (Allan) Lin	edd2507c73	[functorch] Prevent using for-loop for out-of-place index_fill batch rule (#99229 ) A follow-up PR for https://github.com/pytorch/pytorch/pull/91364#discussion_r1060723192 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99229 Approved by: https://github.com/kshitij12345	2023-04-20 06:40:32 +00:00
Iris	a2a4144256	[FSDP]Make param_groups optional for FSDP optim state dict (#99117 ) Make param_groups optional for FSDP optim state dict and add corresponding test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99117 Approved by: https://github.com/fegin, https://github.com/zhaojuanmao	2023-04-20 06:34:40 +00:00
Shunting Zhang	68bc0fc012	[inductor] a script to benchmark the perf impact from tensor layout (#99583 ) Follow up on Jason's idea of tensor layout tuning. Add a script to show the perf impact of layout to convolution (will add more cases like batch/layer norm, reduction to the scripts). For convolution, a quick test shows using channels last layout, we get 1.4x speedup for convolution: ``` baseline 4.509183883666992 test 3.178528070449829 speedup 1.419x ``` The speedup definitely also depends on input/weight shapes. E.g., change input channel from 3 in the test to 8, we see speedup to be 2.1x The trace shows cudnn calls different kernels when input layout changes to channels last. <img width="997" alt="Screenshot 2023-04-19 at 5 27 54 PM" src="https://user-images.githubusercontent.com/52589240/233228656-4bdcac0a-7633-416a-82e1-17d8dc8ea9a6.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99583 Approved by: https://github.com/jansel	2023-04-20 06:26:10 +00:00
shibo	da322ea874	Enable torch.jit.load for custom device (#99535 ) Fixes #ISSUE_NUMBER 1、torch.jit.load for custom device ``` # custom device named `foo` ts_model = torch.jit.script(mode.to(device="foo")) ts_model.save("./ts.pt") # it is a script model on device `foo` # and then we want to load it and run it torch.jit.load("./ts.pt") ``` 2、 add some extra key for custom device with `privateuse1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99535 Approved by: https://github.com/albanD	2023-04-20 05:37:57 +00:00
Li-Huai (Allan) Lin	6580b160d3	[vmap] Fix searchsorted batch rule for self_logical_rank == 0 (#99526 ) Fixes #95888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99526 Approved by: https://github.com/zou3519	2023-04-20 05:08:20 +00:00
Li-Huai (Allan) Lin	c0674c439c	[vmap] Add max_pool3d batch rule (#99522 ) Also add a helper to integrate `max_pool2d_with_indices` and `max_pool3d_with_indices` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99522 Approved by: https://github.com/zou3519	2023-04-20 05:08:19 +00:00
Li-Huai (Allan) Lin	d31a00e713	[vamp] Add max_pool1d batch_rule (#99517 ) Fixes #97558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99517 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2023-04-20 05:08:17 +00:00
XiaobingSuper	233cc34d3b	fix onednn ConvTranspose2d channels last issue when ic=1 (#99539 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99539 Approved by: https://github.com/mingfeima	2023-04-20 04:41:33 +00:00
XiaobingSuper	3af467eff4	inductor: support sqrt for dynamic shape (#99514 ) When running TIMM ```convit_base``` dynamic shape case, there is always has AssertionError, see https://github.com/pytorch/pytorch/issues/97877. A simple reproduce code is: ``` import torch import torch._dynamo import torch._dynamo.config as config config.dynamic_shapes=True torch._dynamo.config.assume_static_by_default=False class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() def forward(self, x): B, N, C = x.shape return self.get_rel_indices(N) def get_rel_indices(self, num_patches: int) -> torch.Tensor: img_size = int(num_patches ** .5) #rel_indices = torch.zeros(1, num_patches, num_patches, 3) ind = torch.arange(img_size) return ind model = Model().eval() opt_model = torch._dynamo.optimize('inductor')(model) x = torch.randn(8, 8, 8) ref = model(x) with torch.no_grad(): for i in range(3): out = opt_model(x) ``` After this code, the generated code will be like this: ``` kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/x5/cx5442c6dcuxsrrlnqi476yzjlgc6g53ukppuaettiyp6dszhmr4.h" extern "C" void kernel(long* out_ptr0, const long ks0) { { #pragma GCC ivdep for(long i0=static_cast<long>(0L); i0<static_cast<long>(std::floor(std::sqrt(ks0))); i0+=static_cast<long>(1L)) { auto tmp0 = static_cast<long>(i0); out_ptr0[static_cast<long>(i0)] = tmp0; } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99514 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-04-20 04:22:49 +00:00
XiaobingSuper	27a43c0242	inductor: add input type check for fuse_attention (#99296 ) For TIMM ```xcit_large_24_p8_224```, the scale factor is a tensor(https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/xcit.py#L205), and ```scaled_dot_product_attention``` doesn't support it, this PR will add a check which only does the fusion when the scale factor is float/int value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99296 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-04-20 04:05:32 +00:00
Will Constable	309b7edfe1	Allow in graph einops operators (#99478 ) Coordinating with @arogozhnikov from einops team, allowing specific operators in the dynamo graph avoids dynamo tracing problems provided the operators are screened for safety - they must not bake in unintended constants or take data-dependent control flow paths. Fixes #99031 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99478 Approved by: https://github.com/jansel	2023-04-20 03:40:50 +00:00
Justin Chu	95ca8e589d	[ONNX] Update install_onnx.sh: onnx-script -> onnxscript (#99572 ) The repository was renamed. The package is not yet renamed. We should update the package name when we bump the commit Pull Request resolved: https://github.com/pytorch/pytorch/pull/99572 Approved by: https://github.com/BowenBao	2023-04-20 01:59:56 +00:00
Yanbo Liang	789070986c	[Dynamo] Implementing generic context manager by inlining __enter__ and __exit__ (#98725 ) This is a draft version of generic context manager, I believe there are some scenarios that I didn't anticipate. I posted this draft for discussion and check if this is the right direction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98725 Approved by: https://github.com/jansel	2023-04-20 01:16:15 +00:00
Sergii Dymchenko	fbdb86c174	Fix trailing spaces lint (#99581 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99581 Approved by: https://github.com/huydhn	2023-04-20 00:38:04 +00:00
Sergii Dymchenko	d06624d3c4	Temporarily move ROCm to unstable (#99579 ) CI SEV https://github.com/pytorch/pytorch/issues/99578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99579 Approved by: https://github.com/orionr	2023-04-20 00:11:10 +00:00
Edward Z. Yang	805a6dc8d2	Add an expect test for test_save_graph_repro (#99538 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99538 Approved by: https://github.com/anijain2305	2023-04-20 00:00:40 +00:00
Jerry Zhang	36acad58b6	[quant][pt2e][refactor] Move the annotation for observer sharing ops into separate util (#99384 ) Summary: In order to keep quantizer simple, we want to move the annotation code for operators like flatten, hardtanh etc. to a separate utility function that is called after the quantizer annotation is done, this makes these ops (operator list) not configurable by user, and also makes prepare_pt2e operator aware instead of operator agnostic, this design is not final, we may change it in the future if we find there are use cases that need these to be configurable or if we feel it is important for prepare_pt2e to stay agnostic to operator/operator patterns Test Plan: python test/test_quantization.py TestQuantizePT2E.test_qnnpack_quantizer_obs_sharing_ops Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D45071006](https://our.internmc.facebook.com/intern/diff/D45071006) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99384 Approved by: https://github.com/kimishpatel	2023-04-19 23:49:33 +00:00
Elias Ellison	1b9edb680f	increment generation in run only context (#99099 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99099 Approved by: https://github.com/anijain2305	2023-04-19 23:41:08 +00:00
SiddharthIVEX	c660db2074	Adding vmap support for special bessel functions (#99543 ) Fixes #91402 ## Description This PR adds vmap support for the following bessel functions under torch.special. * special.bessel_j0 * special.bessel_y0 * special.bessel_j1 * special.modified_bessel_i0 * special.bessel_y1 * special.scaled_modified_bessel_k0 * special.scaled_modified_bessel_k1 * special.modified_bessel_i1 ## Files changed: 1. [aten/src/ATen/functorch/BatchRulesUnaryOps.cpp](https://github.com/pytorch/pytorch/pull/99543/files#diff-a629acd680b2c8639049755617fe89f803cd1001d9936e95d7bf4e388f41c6b8) 2. [test/functorch/test_vmap.py](https://github.com/pytorch/pytorch/compare/main...SiddharthIVEX:pytorch:sid/vmap_special_bessel?expand=1#diff-17b0cd027c7b1ca042fcfe21cc86284d6e58fa46039f7e4297b22b8e02b68fea) ## How was the PR tested? 1. The unit tests under `test_vmap.py` were run and all of them passed. The output is shown below. ``` configfile: pytest.ini plugins: hypothesis-6.71.0, anyio-2.2.0 collected 2003 items / 1981 deselected / 22 selected test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_bessel_j0_cpu_float32 PASSED [ 4%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_bessel_j1_cpu_float32 PASSED [ 9%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_bessel_y0_cpu_float32 PASSED [ 13%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_bessel_y1_cpu_float32 PASSED [ 18%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_modified_bessel_i0_cpu_float32 PASSED [ 22%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_modified_bessel_i1_cpu_float32 PASSED [ 27%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_modified_bessel_k0_cpu_float32 PASSED [ 31%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_modified_bessel_k1_cpu_float32 PASSED [ 36%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_scaled_modified_bessel_k0_cpu_float32 PASSED [ 40%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_scaled_modified_bessel_k1_cpu_float32 PASSED [ 45%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_spherical_bessel_j0_cpu_float32 PASSED [ 50%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_bessel_j0_cpu_float32 PASSED [ 54%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_bessel_j1_cpu_float32 PASSED [ 59%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_bessel_y0_cpu_float32 PASSED [ 63%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_bessel_y1_cpu_float32 PASSED [ 68%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_modified_bessel_i0_cpu_float32 PASSED [ 72%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_modified_bessel_i1_cpu_float32 PASSED [ 77%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_modified_bessel_k0_cpu_float32 PASSED [ 81%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_modified_bessel_k1_cpu_float32 PASSED [ 86%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_scaled_modified_bessel_k0_cpu_float32 PASSED [ 90%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_scaled_modified_bessel_k1_cpu_float32 PASSED [ 95%] test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_spherical_bessel_j0_cpu_float32 PASSED [100%] ================================================================ 22 passed, 1981 deselected in 18.42s ================================================================ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99543 Approved by: https://github.com/zou3519	2023-04-19 23:39:56 +00:00
Elias Ellison	19788002e7	Remove a couple of additional places where we would construct tensors - aliases of params, inputs (#98950 ) Removes two additional places where we would construct tensors - Non-static inputs. These are only constructed to invoke the `copy_` kernel and do not own memory so we can construct them only once - Aliases of persistent static inputs (parameters). the memory will be permanently live and is not managed by the cudagraph tapes. (also sneaking in a bug fix around unaligned static idx) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98950 Approved by: https://github.com/jansel	2023-04-19 23:35:53 +00:00
Jason Ansel	3233450d07	Add TorchXLA option to benchmark runner (#99505 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99505 Approved by: https://github.com/voznesenskym	2023-04-19 22:44:52 +00:00
Edward Z. Yang	6026caed1e	Make HAS_CPU boolean lazy, speed up import time (#99537 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99537 Approved by: https://github.com/bertmaher, https://github.com/albanD	2023-04-19 22:12:10 +00:00
Edward Z. Yang	cf354a0491	Don't eagerly initialize CUDA when importing common_cuda (#99536 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99536 Approved by: https://github.com/Chillee, https://github.com/bertmaher, https://github.com/albanD	2023-04-19 22:12:10 +00:00
Nikita Shulga	32cd05ae60	Package `torch.fx` type hints (#99541 ) <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at ca3aab4</samp> > _`fx` module traced_ > _Symbolic graphs transformed_ > _Type stubs for winter_ Fixes https://github.com/pytorch/pytorch/issues/99530 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99541 Approved by: https://github.com/kit1980, https://github.com/Chillee	2023-04-19 22:00:07 +00:00
Avik Chaudhuri	f6f35135a4	suggest constraints to specify for export based on generated shape guards (#98463 ) The design of export API expects constraints to be specified on dynamic dimensions, while assuming all other dimensions are static by default. However a user who wishes to export a model may not be fully familiar with the code to plan what to specify. This diff provides support for discovering constraints to specify. The basic idea is to take the set of generated shape guards and convert them into appropriate constraints. However, we usually generate a LOT of shape guards, and there is often a LOT of redundancy in them. Thus, we also need to simplify the guards so that our suggested constraints are concise yet capture the information content in the guards. The algorithm for simplification uses `sympy` under the hood, but very surgically to avoid any risk of blowing up. See comments inline for a full description. Briefly, 1. We consider only univariate inequalities, and among them, solve for equalities first. 2. We substitute these exact solutions to convert multivariate inequalities progressively into univariate. 3. Remaining univariate inequalities are solved using `sympy.solvers.inequalities.reduce_inequalities`. 4. As pre-processing, we also eliminate all `//` and `%` operations to generate a set of linear congruence guards, and solve these using `sympy.ntheory.modular.solve_congruence`. The results are quite dramatic. For example, an internal model produced several hundreds of guards with `dynamic_shapes=True`, which were pretty much inscrutable for humans. The summary contains around 30 dimensions that were specialized and 3 constraints on dynamic dimensions. The output format looks like this: ``` The following dimensions have been specialized and CANNOT be dynamic. NOTE: Specializations will happen by default with `assume_static_by_default=True`. L['foo']['bar'].size()[0] == 4 ... L['baz']['qux'].size()[3] == 96 The following dimensions CAN be dynamic. You can use the following code to specify the constraints they must satisfy: constraints=[ dynamic_dim(L['blah']['bleh'], 1) == dynamic_dim(L['blah']['bloh'], 1), ..., 2 <= dynamic_dim(L['blah']['bloh'], 1), ] ``` Differential Revision: [D44731747](https://our.internmc.facebook.com/intern/diff/D44731747/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98463 Approved by: https://github.com/voznesenskym, https://github.com/ezyang	2023-04-19 21:56:36 +00:00
Michael Voznesensky	04f7a2a5e1	Support module dict iter (#99503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99503 Approved by: https://github.com/Chillee, https://github.com/jansel	2023-04-19 21:54:35 +00:00
Avik Chaudhuri	c75ac11fb5	[cond] error on closed over variables (#99367 ) As reported in https://github.com/pytorch/pytorch/issues/90469, the implementation of inlining nested function branches for `cond` doesn't properly handle variables captured from outer scopes. This leads to some examples accidentally working, some others generating incorrect code that don't crash but do the wrong thing, and still others that outright crash because of references to non-existent variables. Properly supporting closed variables is tricky (see https://github.com/pytorch/pytorch/pull/91981 for an abandoned attempt). While this is definitely something we should be able to support longer term, for now it is better to explicitly error and suggest the fix to the user (amounting to rewriting branches to take closed variables explicitly). Differential Revision: [D45058621](https://our.internmc.facebook.com/intern/diff/D45058621/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99367 Approved by: https://github.com/ezyang, https://github.com/tugsbayasgalan	2023-04-19 21:54:20 +00:00
Aaron Enye Shi	237f917f5b	[Profiler][Easy] Fix typo in Profiler report input shapes (#99430 ) Summary: There are two variables for profiler input shapes: - In C++ interface: report_input_shapes - In Python interface: record_shapes Therefore record_input_shapes is a typo. We should also look to reducing redundant naming between the two. Test Plan: CI Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/99430 Approved by: https://github.com/davidberard98	2023-04-19 21:50:52 +00:00
ZhongYingMatrix	af7fed1d92	fix osd rank0_only in fsdp (#99136 ) Fixes #99135 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99136 Approved by: https://github.com/fegin	2023-04-19 21:50:38 +00:00
Zachary DeVito	2402fe5210	[memory allocator] fix ifdef typo (#99553 ) First PR went in with the expandable allocator accidentally disabled which happened trying to fix the build on weird architectures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99553 Approved by: https://github.com/ezyang, https://github.com/eellison	2023-04-19 21:45:51 +00:00
lezcano	495e1b4d0e	Add device_asserts before indirect loads and stores (#98590 ) This PR also adds a way to CSE statements (not only assignments). The tests follow the pattern from https://github.com/openai/triton/pull/1143 They take a fair amount of time to run (90s in my box). If we wanted to improve this, we could avoid testing the `ndim == 3` case. Changes like this one make me hope that we get to clean the amount of lowerings we have at some point... Generated code for `x[y]` with `x.shape == (3, 2, 4), y.ndim == 1`: With `dynamic=False`: ```python tmp0 = tl.load(in_ptr0 + (x1), xmask) tl.device_assert(((0 <= tmp0) & (tmp0 < 3)) \| (~xmask), f"index out of bounds: 0 <= tmp0 < 3") tmp1 = tl.load(in_ptr1 + (x0 + (8tmp0)), xmask) ``` With `dynamic=True`: ```python tmp0 = tl.load(in_ptr0 + (x1), xmask) tl.device_assert(((0 <= tmp0) & (tmp0 < ks3)) \| (~xmask), f"index out of bounds: 0 <= tmp0 < ks3") tmp1 = tl.load(in_ptr1 + (x0 + (ks1ks2tmp0)), xmask) ``` Generated code for `x[y+1, y+1]` with `x.shape == (3, 2, 4), y.ndim == (3, 3)`: With `dynamic=False` (note how it folds the two upper bounds to `min(3, 2) == 2` ```python tmp0 = tl.load(in_ptr0 + (x1), xmask) tmp1 = 1 tmp2 = tmp0 + tmp1 tl.device_assert(((0 <= tmp2) & (tmp2 < 2)) \| (~xmask), f"index out of bounds: 0 <= tmp2 < 2") tmp3 = tl.load(in_ptr1 + (x0 + (12tmp2)), xmask) ``` With `dynamic=True`: ```python tl.device_assert(((0 <= tmp2) & (tmp2 < min(ks2, k1))) \| (~xmask), f"index out of bounds: 0 <= tmp2 < min(ks2, ks1)") ``` The same works when the CSE'd variable appears 3 or more times, but then it generates `min(ks0, min(ks1, ks2))` Generated code for `x[y] = z` with `x.ndim = 3`, `y.ndim = 1` and dynamic shapes ```python tmp0 = tl.load(in_ptr0 + (x1), xmask) tmp1 = tl.load(in_ptr1 + (x2), xmask) tl.device_assert(((0 <= tmp0) & (tmp0 < ks3)) \| (~xmask), f"index out of bounds: 0 <= tmp0 < ks3") tl.store(out_ptr0 + (x0 + (ks1ks2tmp0) + tl.zeros([XBLOCK], tl.int32)), tmp1, xmask) ``` Fixes https://github.com/pytorch/pytorch/issues/93538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98590 Approved by: https://github.com/ngimel	2023-04-19 21:26:57 +00:00
Will Constable	9ac2b041c9	Make opacus xfail instead of skip (#99380 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99380 Approved by: https://github.com/desertfire, https://github.com/anijain2305	2023-04-19 21:09:06 +00:00
PyTorch MergeBot	cfacb5eaaa	Revert "Use correct standard when compiling NVCC on Windows (#99492 )" This reverts commit db6944562efad201c7c1dc2fc0539b1f34012666. Reverted https://github.com/pytorch/pytorch/pull/99492 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally	2023-04-19 20:51:26 +00:00
Shen Li	ca89e7942a	[SPMD][Easy] switch to tree_map_only to simplify code (#99547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99547 Approved by: https://github.com/fegin	2023-04-19 20:40:09 +00:00
Richard Barnes	db6944562e	Use correct standard when compiling NVCC on Windows (#99492 ) Test Plan: Sandcastle Reviewed By: malfet Differential Revision: D45108690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99492 Approved by: https://github.com/ezyang	2023-04-19 20:36:05 +00:00
Ke Wen	db456ab83d	[c10d] Faster coalescing (#98793 ) ### Description The PR aims at reducing CPU overhead of context manager style coalescing. By "context manager style coalescing", we mean: Sync style: ``` with _coalescing_manager(): for i in range(num_coll): dist.all_reduce(tensors[i]) ``` Async style: ``` with _coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.all_reduce(tensors[i]) cm.wait() ``` In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead. In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager. ### Tests In current PR, the "fast path" only applies to all-reduce. - Flattened 512M: 16.38 ms, including CPU time 131.21 us - Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us - New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us Hence a 4x reduction in CPU overhead (dependent on `num_coll`). Cc @mrshenli @kumpera @wanchaol @fegin Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793 Approved by: https://github.com/kumpera	2023-04-19 20:17:58 +00:00
Edward Z. Yang	bc9eaa7abf	Run post-aot compiler at compilation time, not at runtime. (#99457 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99457 Approved by: https://github.com/anijain2305	2023-04-19 19:36:09 +00:00
Zain Rizvi	7546972565	[BE] Refactoring test execution and improving comments (#99467 ) Sharing code between the code that handles test results in parallel vs serial mode. Note that the original version of this code had an inconsistency between the two versions where it would execute `print_to_stderr(err_message)` on every test that ran in parallel, but for serial tests it would only invoke `print_to_stderr(err_message)` if `continue_on_error` was also specified. By sharing code, this PR changes that behavior to be consistent between the two modes. Also adding some comments. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 029342c</samp> > _Sing, O Muse, of the skillful coder who refined_ > _The PyTorch testing script, `run_test.py`, and shined_ > _A light on its obscure logic, with docstrings and comments_ > _And made it run more smoothly, with better error contents_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/99467 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-04-19 19:29:07 +00:00
Yanli Zhao	6ca991cacf	[Composable API] Add fully_shard debug function to print sharded tree structure, module names and managed param fqns (#99133 ) Adding a fully_shard debug function to print sharded tree structure like following format, return module names and their managed parameter fqns as well. ![Screenshot 2023-04-18 at 5 14 54 PM](https://user-images.githubusercontent.com/48731194/232931628-169a63a9-b4d5-4902-9cfd-f40113f3ec98.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99133 Approved by: https://github.com/rohan-varma	2023-04-19 19:27:43 +00:00
Edward Z. Yang	6b6dc4418d	Warn if guards are added to ShapeEnv after we produced guards (#97820 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97820 Approved by: https://github.com/voznesenskym	2023-04-19 19:23:52 +00:00
ts	2aa35e6cc1	Fix Tensor.uniform_ documentation to mention generator argument (#99510 ) Fixes #98202 , mentioning support for generator arguments while providing an example of the function's use. Would be happy to iterate on this if there are any issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99510 Approved by: https://github.com/soulitzer	2023-04-19 19:23:12 +00:00
Angela Yi	d6d55f8590	[fx] Variatic arg matching (#99431 ) For cases where the pattern graph matches on x number of arguments, but the matching graph omits some of these arguments (by using the default values instead), right now SubgraphMatcher fails because these graphs have a different number of arguments. So instead in the case where we see the pattern/replacement nodes have different number of arguments, we will add the default values onto whichever argument set is lacking arguments. Note this support is only for when we are matching targets that are instances of OpOverload, which have a schema and default values tied to them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99431 Approved by: https://github.com/jerryzh168	2023-04-19 18:23:40 +00:00
XiaobingSuper	e21f648cde	improve mkldnn matmul performance when one input is contiguous tensor but the strides is not default contiguous strides (#99511 ) giving the following case: ``` import torch a= torch.empty_strided([64, 1, 33], [33, 3, 1], dtype=torch.bfloat16).fill_(1) b = torch.randn(64, 33, 256).to(dtype = torch.bfloat16) y = torch.ops.aten.bmm(a, b) ``` ```a``` is a contiguous tensor, but the strides are not defaulted contiguous strides ([33, 33, 1]), onednn matmul always running a non-optimized path: ``` onednn_verbose,exec,cpu,matmul,gemm:jit,undef,src_bf16::blocked:abc:f0 wei_bf16::blocked:abc:f0 dst_bf16::blocked:abc:f0,attr-scratchpad:user ,,64x1x33:64x33x256:64x1x256,7.28711 ``` This PR will convert the inputs' stride to deafult contiguous stride before calling onednn to running an optimization path: ``` onednn_verbose,exec,cpu,matmul,brg:avx512_core_amx_bf16,undef,src_bf16::blocked:abc:f0 wei_bf16::blocked:abc:f0 dst_bf16::blocked:abc:f0,attr-scratchpad:user ,,64x1x33:64x33x256:64x1x256,3.06396 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99511 Approved by: https://github.com/mingfeima, https://github.com/jgong5	2023-04-19 18:13:00 +00:00
mikey dagitses	efa16c20c3	make ATen/native/cuda/UpSampleNearest3d.cu data_ptr-correct (#99328 ) make ATen/native/cuda/UpSampleNearest3d.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99328 Approved by: https://github.com/ezyang	2023-04-19 17:48:54 +00:00
PyTorch MergeBot	5cb788a9a5	Revert "[cuda rng] Making offset calculation independent of device properties (#98988 )" This reverts commit 26f318574fc771bb200b99bcd87c645934c1e706. Reverted https://github.com/pytorch/pytorch/pull/98988 on behalf of https://github.com/anijain2305 due to Diagnosing if sebotnet has flakiness	2023-04-19 17:23:40 +00:00
Huy Do	5d395769a6	Skip vision_maskrcnn after #98923 (#99394 ) This is failing in trunk as documented in https://github.com/pytorch/pytorch/issues/99438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99394 Approved by: https://github.com/desertfire	2023-04-19 17:07:07 +00:00
Daniel Dale	6b8bab8e39	Fix (4 device) multi-gpu `ShardedGradScaler` Tests in `ciflow/periodic` (#99485 ) Fixes #99427 Given the provided CI logs, I ~~suspect~~[^1] `inf` is being hit with the initial (FSDP model) step of the [test in question](https://github.com/pytorch/pytorch/actions/runs/4707887920/jobs/8350225813#step:13:36189). The DDP loss is correct and indicative of two steps being taken but the FSDP loss is approximately half of the loss expected with the first step (suggesting a step was skipped and the scale was halved). I'm further reducing `init_scale` in this PR in order to ~~test the hypothesis~~[^2] (error occurs with 4 device multi-gpu tests only, not the 2 device tests I can verify locally). I'll ensure I add the label `ciflow/periodic`[^3] to future PRs I suspect could potentially exhibit divergent behavior with >2 devices. Ideally all tests would be insensitive to device scaling but I recognize for some tests imposing that design constraint might be more trouble than it's worth. @awgu @huydhn [^1]: Suspicion confirmed [^2]: The relevant periodic tests are [now passing](https://github.com/pytorch/pytorch/actions/runs/4738073998/jobs/8411862508) [^3]: Didn't know that existed, great to know! Pull Request resolved: https://github.com/pytorch/pytorch/pull/99485 Approved by: https://github.com/huydhn	2023-04-19 16:52:29 +00:00
Jerry Zhang	b0df0cd7cc	[reland][quant][fix] Compare resnet with quantizer api with the prepare_fx and decomposed convert flow (#99355 ) Summary: Using a decomposed convert to make sure we get exact match, this means the nodes in resnet are annotated correctly, reland for https://github.com/pytorch/pytorch/pull/98905 Test Plan: python test/test_quantization.py TestQuantizePT2EModels.test_resnet18_with_quantizer_api Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D45071168](https://our.internmc.facebook.com/intern/diff/D45071168) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99355 Approved by: https://github.com/kimishpatel	2023-04-19 16:47:15 +00:00
mikey dagitses	391a3add54	make ATen/native/cuda/TensorModeKernel.cu data_ptr-correct (#99330 ) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/99330). * __->__ #99330 make ATen/native/cuda/TensorModeKernel.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99330 Approved by: https://github.com/ezyang	2023-04-19 16:15:22 +00:00
mikey dagitses	8eb7743401	make ATen/native/cuda/ReflectionPad.cu data_ptr-correct (#99325 ) make ATen/native/cuda/ReflectionPad.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99325 Approved by: https://github.com/ezyang	2023-04-19 16:12:36 +00:00
mikey dagitses	4d3011b600	make ATen/native/cuda/Col2Im.cu data_ptr-correct (#99348 ) make ATen/native/cuda/Col2Im.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99348 Approved by: https://github.com/Skylion007	2023-04-19 16:06:39 +00:00
mikey dagitses	121edd2161	make ATen/native/cuda/Shape.cu data_ptr-correct (#99343 ) make ATen/native/cuda/Shape.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99343 Approved by: https://github.com/Skylion007	2023-04-19 16:06:27 +00:00
Edward Z. Yang	b01edf45f8	Add typing to debug_utils and repro (#99452 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99452 Approved by: https://github.com/anijain2305	2023-04-19 16:00:19 +00:00
Edward Z. Yang	2e25fb5d55	Refactor debug_utils into after_aot and after_dynamo modules (#99450 ) There are no code changes but I did take the opportunity to reorder and group the functions once they were placed in their respective modules. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99450 Approved by: https://github.com/anijain2305	2023-04-19 16:00:19 +00:00
Michael Voznesensky	a3ee9800ba	Codegen fixed size for static sympy values (#99362 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99362 Approved by: https://github.com/ezyang	2023-04-19 15:58:18 +00:00
Shen Li	e605b5df74	[SPMD] Add sym_stride to DSymInt (#99504 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99504 Approved by: https://github.com/fegin	2023-04-19 14:55:40 +00:00
Shen Li	2cb8a8d4cc	[SPMD] Support DSymInt for slice_backward in SPMD expansion (#99501 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99501 Approved by: https://github.com/fegin	2023-04-19 14:55:40 +00:00
Shen Li	292296141a	[SPMD] Support SymInt with non-op call_function nodes (#99420 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99420 Approved by: https://github.com/fegin	2023-04-19 14:55:37 +00:00
Shen Li	7c0c663a4c	[SPMD] Add aten.stack and aten.select to DTensor prop (#99417 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99417 Approved by: https://github.com/fegin	2023-04-19 14:55:34 +00:00
Shen Li	301be37091	Avoid import * from experimental_ops (#99363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99363 Approved by: https://github.com/fegin	2023-04-19 14:55:30 +00:00
mikey dagitses	8d3dc2131d	make ATen/native/cuda/TensorTransformations.cu data_ptr-correct (#99350 ) make ATen/native/cuda/TensorTransformations.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99350 Approved by: https://github.com/ezyang	2023-04-19 14:03:48 +00:00
Will Constable	98907589ee	Make GetItemSource(*, slice) hashable (#99379 ) All Sources must be hashable, since we are using set equality to check for duplicate sources in AOTAutograd. We should have a more systematic way of asserting this. For this PR just fix the local issue. Fixes #99145 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99379 Approved by: https://github.com/ezyang	2023-04-19 13:50:49 +00:00
Will Constable	9b909cbe9a	Update README.md to explain installing triton (#99464 ) users building from source should know how to install triton the recommended way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99464 Approved by: https://github.com/msaroufim, https://github.com/alanwaketan	2023-04-19 13:48:56 +00:00
mikey dagitses	0ae9d15543	make ATen/native/cuda/Bucketization.cu data_ptr-correct (#99334 ) make ATen/native/cuda/Bucketization.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99334 Approved by: https://github.com/ezyang	2023-04-19 13:42:58 +00:00
mikey dagitses	367d3657a4	make ATen/native/cuda/TensorFactories.cu data_ptr-correct (#99342 ) make ATen/native/cuda/TensorFactories.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99342 Approved by: https://github.com/ezyang	2023-04-19 13:42:27 +00:00
mikey dagitses	3ace394d43	make ATen/native/cuda/RangeFactories.cu data_ptr-correct (#99344 ) make ATen/native/cuda/RangeFactories.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99344 Approved by: https://github.com/ezyang	2023-04-19 13:41:33 +00:00
mikey dagitses	67db44694a	make ATen/native/cuda/Nonzero.cu data_ptr-correct (#99333 ) make ATen/native/cuda/Nonzero.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99333 Approved by: https://github.com/ezyang	2023-04-19 13:40:46 +00:00
mikey dagitses	441ac80988	make ATen/native/cuda/UpSampleNearest1d.cu data_ptr-correct (#99329 ) make ATen/native/cuda/UpSampleNearest1d.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99329 Approved by: https://github.com/ezyang	2023-04-19 13:39:39 +00:00
Edward Z. Yang	c67c16bcd2	Switch calling convention back to real tensors (#99320 ) Months ago, in order to get dynamic shapes working through to Dynamo backends, we changed the calling convention to pass fake tensors rather than real tensors as example inputs to backends. The motivation at the time was, well, backends shouldn't really be peeking at the real tensors when they are doing compilation, and so it would make more sense to hide the real tensors from backends. But there were a bunch of problems: * This interacted poorly with our accuracy minifier design: accuracy minifier needs access to the real inputs in order to run the model and figure out what happens! * The TensorRT backend required real inputs and we never figured out how to fix it. * In practice, all the backends needed to detect if they were passed real tensors, and fakeify them anyway (certainly AOTAutograd does this) * Parameters and inputs are treated non-uniformly: parameters had to be passed as real tensors, because CUDA graphs requires knowing what the actual tensors are Furthermore, there were some more problems discovered after the fact: * Backends may want to optimize on aspects of tensors which you cannot tell without having real tensors; e.g., alignment of the data pointer So, this PR decides that changing the calling convention was a bad idea, and switches back to passing real tensors. There is a problem though: AOTAutograd will perform fakeification, which means that in practice backends are still going to end up with fake tensors in the end anyway. I want to change this, but this will require some work with bdhirsh's upcoming AOTAutograd export refactor. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99320 Approved by: https://github.com/voznesenskym	2023-04-19 12:15:52 +00:00
mikey dagitses	1eb1911012	migrate cuda files to const_data_ptr (#99357 ) migrate cuda files to const_data_ptr Summary: These are all going to const_data_ptr, so they ought to all be safe. Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99357 Approved by: https://github.com/ezyang	2023-04-19 12:06:25 +00:00
mikey dagitses	1aa52fc041	make ATen/native/cuda/NaiveDilatedConvolution.cu data_ptr-correct (#99346 ) make ATen/native/cuda/NaiveDilatedConvolution.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99346 Approved by: https://github.com/ezyang	2023-04-19 12:06:17 +00:00
mikey dagitses	f17119d10c	make ATen/native/cuda/AdaptiveAveragePooling3d.cu data_ptr-correct (#99324 ) make ATen/native/cuda/AdaptiveAveragePooling3d.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99324 Approved by: https://github.com/ezyang	2023-04-19 12:05:51 +00:00
PyTorch MergeBot	bb2cd4a107	Revert "Python binding to set/get CUDA rng state offset (#98965 )" This reverts commit 8214fe07e8a200e0fe9ca4264bb6fca985c4911e. Reverted https://github.com/pytorch/pytorch/pull/98965 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-04-19 11:23:32 +00:00
mikey dagitses	33483b0be4	make ATen/native/cuda/UpSampleTrilinear3d.cu data_ptr-correct (#99349 ) make ATen/native/cuda/UpSampleTrilinear3d.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99349 Approved by: https://github.com/ezyang	2023-04-19 09:55:18 +00:00
PyTorch MergeBot	ea50d4f146	Revert "Switch calling convention back to real tensors (#99320 )" This reverts commit 780922c24ec931000cb6a67eeebd2b2288eeb7df. Reverted https://github.com/pytorch/pytorch/pull/99320 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-04-19 09:44:06 +00:00
Michael Voznesensky	6467495900	Allow split_reduction if all dyn values are static (#99475 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99475 Approved by: https://github.com/ngimel	2023-04-19 07:53:25 +00:00
Michael Voznesensky	113bd11cf4	Skip levit (#99491 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99491 Approved by: https://github.com/ezyang	2023-04-19 07:41:42 +00:00
Chien-Chin Huang	41d7969590	[SPMD] Upstream iter_move_grads_and_optimizers (#98785 ) This PR upstreams `iter_move_grads_and_optimizer` which delay some of the gradients and the corresponding optimizer to the next iteration. D44512863(credit to @lessw2020 ) is the internal implementation, which is only good for the old _SPMD expansion. This PR changes the implmentation to use the new APIs. Differential Revision: [D44836486](https://our.internmc.facebook.com/intern/diff/D44836486/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98785 Approved by: https://github.com/mrshenli	2023-04-19 06:40:33 +00:00
Rohan Varma	fcd2e8cbf4	Support bf16 searchsort op (#99426 ) Per title, needed to unblock bf16 for an ads tformer workload Differential Revision: [D45088972](https://our.internmc.facebook.com/intern/diff/D45088972/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99426 Approved by: https://github.com/XilunWu	2023-04-19 06:25:01 +00:00
Bas Aarts	b3b0fbca11	[ONNX] Export Relu6 without using Relu (#99022 ) The clamp operator used in the export of Relu6 already clamps the input value in between 0 and 6. There's no need to first perform a Relu on the input tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99022 Approved by: https://github.com/BowenBao	2023-04-19 06:18:14 +00:00
BowenBao	d41aa448b8	[ONNX] Run ONNX tests as part of standard run_test script (#99215 ) <!-- copilot:all --> ### <samp>🤖 Generated by Copilot at dcbf7e2</samp> ### Summary 📝🧹🚩 <!-- 1. 📝 for simplifying the `./scripts/onnx/test.sh` script 2. 🧹 for refactoring the `test/onnx/dynamo/test_exporter_api.py` file 3. 🚩 for adding the `--onnx` flag to `test/run_test.py` and updating the `TESTS` list --> This pull request improves the ONNX testing infrastructure in PyTorch by refactoring the test code, normalizing the scope names, adding a flag to run only the ONNX tests, and simplifying the test script. > _To export PyTorch models to ONNX_ > _We refactored some scripts and contexts_ > _We used `common_utils`_ > _And normalized the scopes_ > _And added a flag to run the tests_ ### Walkthrough * Simplify `./scripts/onnx/test.sh` to use `run_test.py` with `--onnx` flag instead of `pytest` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-0017f5b22ae1329acb0f54af8d9811c9b6180a72dac70d7a5b89d7c23c958198L44-R46)) * Remove `onnx` test from `TESTS` list in `test/run_test.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7L127-R127)). Replace with `onnx_caffe2`. * Add `onnx/test_pytorch_onnx_onnxruntime_cuda` and `onnx/test_models` tests to `blocklisted_tests` list in `test/run_test.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R154-R155)) * Add `ONNX_SERIAL_LIST` list to `test/run_test.py` to specify ONNX tests that must run serially ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R296-R301)) * Add `ONNX_TESTS` list to `test/run_test.py` to store all ONNX tests ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R370)) * Add `--onnx` flag to `parse_args` function in `test/run_test.py` to run only ONNX tests ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R920-R928)) * Include `ONNX_SERIAL_LIST` in `must_serial` function in `test/run_test.py` to run ONNX tests serially or parallelly based on memory usage ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R1120)) * Filter selected tests based on `--onnx` flag in `get_selected_tests` function in `test/run_test.py` to exclude non-ONNX tests ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R1158-R1165)) ### Other minor changes to accommodate this change * Replace `unittest` module with `common_utils.TestCase` in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L4), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L29-R28), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L71-R70), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L147-R146)) * Import `TemporaryFileName` class from `common_utils` in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L19-R18)) * Use `common_utils.TemporaryFileName` instead of `TemporaryFileName` in `TestDynamoExportAPI` class in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L92-R91), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L110-R109), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L129-R128)) * Use `common_utils.run_tests` instead of `unittest.main` in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L155-R154)) * Add `re` module to `test/onnx/test_utility_funs.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7R6)) * Add `_remove_test_environment_prefix_from_scope_name` function to `test/onnx/test_utility_funs.py` to normalize scope names of ONNX nodes ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7R32-R58)) * Use `_remove_test_environment_prefix_from_scope_name` function to compare scope names of ONNX nodes in `TestUtilityFuns` class in `test/onnx/test_utility_funs.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1099-R1133), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1119-R1152), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1170-R1188), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1181-R1199), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1220-R1239), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1235-R1258)) Fixes #98626 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99215 Approved by: https://github.com/huydhn, https://github.com/titaiwangms	2023-04-19 06:17:47 +00:00
Natalia Gimelshein	8e69879209	[inductor] adjust sliceView limits (#99447 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99447 Approved by: https://github.com/voznesenskym	2023-04-19 03:13:41 +00:00
PyTorch MergeBot	4aedb8e116	Revert "[inductor] coordinate descent tuning upon max-autotune (#97203 )" This reverts commit 52ecc3274b1c16fcca3a3d89bd261dbc6513d6ed. Reverted https://github.com/pytorch/pytorch/pull/97203 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it breaks MacOS test in trunk	2023-04-19 02:33:02 +00:00
Edward Z. Yang	e60557793f	Make hash update script more robust and run it (#99370 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99370 Approved by: https://github.com/Chillee, https://github.com/voznesenskym	2023-04-19 02:26:03 +00:00
PyTorch MergeBot	96cad5cf95	Revert "[inductor] adjust sliceView limits (#99447 )" This reverts commit 8009891be65d87ec54e3f777dbc9ccd3b5c20d6e. Reverted https://github.com/pytorch/pytorch/pull/99447 on behalf of https://github.com/ngimel due to breaks inductor	2023-04-19 01:39:26 +00:00
Animesh Jain	26f318574f	[cuda rng] Making offset calculation independent of device properties (#98988 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98988 Approved by: https://github.com/ngimel	2023-04-19 01:35:44 +00:00
PyTorch MergeBot	bb017d7671	Revert "Codegen fixed size for static sympy values (#99362 )" This reverts commit 6c5fdde881c24329b4356e085a7b171cfd68bf72. Reverted https://github.com/pytorch/pytorch/pull/99362 on behalf of https://github.com/voznesenskym due to CI Fail	2023-04-19 01:00:52 +00:00
shibo	48463f687a	refactor macro with AMP (#99285 ) Fixes #ISSUE_NUMBER as the tiltle, optimize the macro with AMP and put the macro in `.hpp` file, so that we can use it for custom device. @bdhirsh @albanD as we talked at this discuss, optimize the macros so that we can add a new macro for other devide, and move these macros to `.hpp` so that we can include these macros with custom device to configure the ops. https://dev-discuss.pytorch.org/t/improve-the-extension-with-privateuse1-for-custom-device/1196/7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99285 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2023-04-19 01:00:00 +00:00
Shunting Zhang	52ecc3274b	[inductor] coordinate descent tuning upon max-autotune (#97203 ) Command to run max autotune baseline: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only ${MODEL_NAME} --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) ``` Command to do coordinate descent autotuning: ``` TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_coordesc TORCHINDUCTOR_PERSISTENT_REDUCTIONS=0 TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only ${MODEL_NAME} --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) ``` Explanation of the envvars show up on the command: ``` - TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 : enable coordinate descent tuning - TORCHINDUCTOR_PERSISTENT_REDUCTIONS=0 : disable persistent reduction. Need do this so we can tune RBLOCK for reductions - TORCHINDUCTOR_MAX_AUTOTUNE=1: enable max autotune - TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_coordesc : use a separate cache dir for coordinate descent tuning. Optional. ``` Here are my experiments results for around 40 torchbench models: https://docs.google.com/spreadsheets/d/1G7i2whIf8Yu-HhN_WovNxwcE-iFDSAw4x3NK4uL4XhI/edit#gid=0 Some highlights - We improve 2.2% further upon max-autotune on average (geomean) - timm_resnest benefits most from coordinate descent tuning. There is 1.07x speedup - We have descent speedup on transformer models - BERT_pytorch: 1.056x - timm_vision_transformer: 1.04x - hf_Bert: 1.030x - For resnet models, it looks like we have less gain as model get larger. My guess is larger model spend more time on mm/conv, so our tuning for pointwise/reduction helps less - resnet18: 1.021x - resnet50: 1.014x - resnet152: 1.005x This kind of coordinate descent autotuning can give us 'upper bound' of the gain we can get for tuning configs for pointwise/reduction. On the other hand, by spot checking, we roughly double the compilation time compared to max-autotune. Next steps can be - we disable persistent reduction in coordinate descent autotune (it's still enabled in baseline) so we can tune RBLOCK for reduction. We can also try to use autotune to pick persistent reduction or not. - pick good config without benchmarking (e.g. Natalia mentioned checking register spill) - try the idea on matmul so we know what's the potential there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97203 Approved by: https://github.com/ngimel	2023-04-19 00:17:10 +00:00
Nikita Shulga	3fef633333	Add CUDA-12.1 manywheel build to trunk (#99458 ) As CUDA-11.7 is getting deprecated anyway. Also, fix the problem when script actually generated the same workflow twice, overriding 11.8 ones with 11.7+11.7-with-pypi <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 0c6c182</samp> > _Oh we are the PyTorch crew and we have a job to do_ > _We build and test the manywheel package with CUDA 11.8_ > _So heave away, me hearties, heave away with all your might_ > _We'll smoke the Linux binary and make sure it runs all right_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/99458 Approved by: https://github.com/dagitses, https://github.com/atalman	2023-04-19 00:13:32 +00:00
Natalia Gimelshein	8009891be6	[inductor] adjust sliceView limits (#99447 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99447 Approved by: https://github.com/voznesenskym	2023-04-19 00:08:27 +00:00
Richard Zou	44b09bf673	Reland "Simple Custom Operator API, V0 (#98440 )" (#99416 ) See the original PR (#98440) for the description. It broke internal builds due to proxy_tensor.py not importing torch._dynamo, which is being fixed in the previous PR in the stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99416 Approved by: https://github.com/soulitzer, https://github.com/bdhirsh	2023-04-18 23:48:33 +00:00
Richard Zou	840431fa59	Fix test/test_proxy_tensor (#99415 ) test_proxy_tensor fails when run by itself (python test/test_proxy_tensor.py -v), but not when all of the tests are run together. The cause is that torch._dynamo isn't imported in torch/fx/experimenta/proxy_tensor.py but it is using functions from the torch._dynamo package. The fix in this PR is to add the import statements. In the future we can consider always importing torch._dynamo on `import torch` or moving the import to the top of the file, but there are some serious circular dependencies to be worked out. NB: an import in the middle of the file makes the function a bit slow the first time the import happens but all subsequent calls are fast. Test Plan: - python test/test_proxy_tensor.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/99415 Approved by: https://github.com/soulitzer	2023-04-18 23:48:33 +00:00
Nikita Shulga	8a89eec2f8	[BE] Do not use unicode quotes (#99446 ) They are mostly used in commented code examples, but even Python-3.12 does not recognize `“foobar”` as valid string literal I.e. just `s/[“”]/"/` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99446 Approved by: https://github.com/huydhn, https://github.com/ezyang	2023-04-18 22:59:56 +00:00
Zain Rizvi	2b49a7330b	Make lintrunner work with new main branch (#99466 ) We've renamed the `master` branch to `main`. Lintrunner should check for a merge base from this new branch now <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 9743d70</samp> Updated the linter configuration to reflect the new default branch name. Changed `merge_base_with` from `origin/master` to `origin/main` in `.lintrunner.toml`. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 9743d70</samp> > _`merge_base_with` changed_ > _`origin/main` is the new_ > _branch name for pytorch_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/99466 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-04-18 22:58:50 +00:00
Vasiliy Kuznetsov	5ff2ad6fc1	torch._int_mm: fix triton kernel caching (#99283 ) Summary: A fix to ensure that kernels generated for `torch._int_mm` can be cached. We can remove this hack one eager mode `torch._int_mm` is better supported. Let me know if something more proper is needed instead of the hack. Test plan: ``` // running the script below led to two compilations of triton // int8,int8->int32 kernel before this PR, and only has // one compilation which is reused after this PR import torch import torch.nn as nn x = torch.randint(-128, 127, (32, 32), device='cuda', dtype=torch.int8) y = torch.randint(-128, 127, (32, 32), device='cuda', dtype=torch.int8) class M(nn.Module): def forward(self, x): x = torch._int_mm(x, y) x = x.to(torch.int8) x = torch._int_mm(x, y) return x m = M().cuda().half() m = torch.compile(m, options={"max-autotune": True}) z = m(x) z = m(x) ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99283 Approved by: https://github.com/nmacchioni, https://github.com/janeyx99	2023-04-18 22:07:02 +00:00
Michael Voznesensky	6c5fdde881	Codegen fixed size for static sympy values (#99362 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99362 Approved by: https://github.com/ezyang	2023-04-18 20:34:26 +00:00
Elias Ellison	60c8a75a7e	[EASY] Turn on slow path assertions but only on first run (#98945 ) We should at least run the assertions on the first run, because it helps find workspace allocations like in cublas. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98945 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-04-18 19:57:44 +00:00
mikey dagitses	93347dde22	make ATen/native/cuda/DistanceKernel.cu data_ptr-correct (#99327 ) make ATen/native/cuda/DistanceKernel.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99327 Approved by: https://github.com/ezyang	2023-04-18 19:55:29 +00:00
mikey dagitses	cf97b820c1	make magmaLdlHermitian data_ptr-correct (#99361 ) make magmaLdlHermitian data_ptr-correct Summary: See https://icl.utk.edu/projectsfiles/magma/doxygen/group__magma__hetrf.html#ga470ab07a6d12b662e260eceecf552d26 for a description of which pointer parameters are outputs (all of them). Test Plan: Rely on CI. --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/99361). * __->__ #99361 * #99358 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99361 Approved by: https://github.com/lezcano	2023-04-18 19:45:04 +00:00
Elias Ellison	472f46635e	Cache output tensors on execution (#98944 ) Caches output tensors for the common case when the output Tensor storage is unaliased for all graph outputs in all paths. For these persisted tensors we adjust the liveness tracking by also checking that the output tensor does not have an additional python reference. I limit cached output tensors to be unaliased. If a descendent node discovers it has an alias of a prior output, then the aliased output will no longer be persisted in the ancestor. The large majority of tensors are unaliased, and preserving aliased output tensors would add significant additional complexity with marginal gains. For instance, when do checkpointing and re-recordings, we need to remove the persisted tensors otherwise it would prevent memory from being reclaimed. If a single persisted tensor was present in multiple paths then that would create an inter-path dependence which adds complexity. Additionally, each further caching of the output would affect the reference count of the other caches, and that reference count would also need to be adjusted depending on if a node was checkpointed. Still need to do a complete a run but for the models I tried makes the performance extremely close between trees and non trees impl. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98944 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-04-18 19:44:47 +00:00
Elias Ellison	93b64f0ad3	[Easy] Remove C++ call now that it wont be on hot path (#98943 ) Since we will be caching output tensors, it is no longer necessary for this logic to be in C++ Pull Request resolved: https://github.com/pytorch/pytorch/pull/98943 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-04-18 19:28:37 +00:00
PyTorch MergeBot	bce21ee06a	Revert "Fix bug in check required output size in _as_strided_scatter_meta (#98483 )" This reverts commit 5b692fd819f1428fc070c3ec3a0cde5d4b83dd03. Reverted https://github.com/pytorch/pytorch/pull/98483 on behalf of https://github.com/malfet due to Broke inductor, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor%2C%201%2C%201	2023-04-18 18:59:47 +00:00
PyTorch MergeBot	19501b254f	Revert "Codegen fixed size for static sympy values (#99362 )" This reverts commit d8d479c854c622a6ef21190b7e62c9a76f0ea2a7. Reverted https://github.com/pytorch/pytorch/pull/99362 on behalf of https://github.com/malfet due to Reverting, as it breaks lint	2023-04-18 18:55:48 +00:00
Michael Voznesensky	d8d479c854	Codegen fixed size for static sympy values (#99362 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99362 Approved by: https://github.com/ezyang	2023-04-18 17:38:09 +00:00
Serkan Karakulak	deec8bd522	[fix] Disable EMA if ema_alpha is set to None (#98992 ) Summary: changing the condition to enable ema updates in training, and disable it if the "ema_alpha" value is None Test Plan: f427638974 Differential Revision: D44937126 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98992 Approved by: https://github.com/msaroufim	2023-04-18 17:28:07 +00:00
Ian Graves	24f882369a	[EdgeML] Remove dependency on all_mobile_model_configs.yaml from pt_operator_library BUCK rule (#99122 ) Summary: Removes the dependency on the unified YAML file Test Plan: Smoke test via some caffe2 tests. ``` buck2 run xplat/caffe2:supported_mobile_models_test ``` Build a major FoA app that uses model tracing and confirm it still works. ``` buck2 build fb4a ``` CI/CD for the rest. If operator tracing / bundling was broken, I'd hope in the 1000+ tests spawned by this change should catch it. Differential Revision: D44946368 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99122 Approved by: https://github.com/dhruvbird	2023-04-18 17:19:55 +00:00
Kimish Patel	c0be06667f	[PT2E][Quant] Support for embedding op quantization via ExecuTorchNativeQuantizer (#99106) ExecuTorchNativeQuantizer ExecuTorchNativeQuantizer is a terribly name, I admit, however lets fix it once we align on what the quantized kernel lib within executorch runtime should be called Differential Revision: [D44986258](https://our.internmc.facebook.com/intern/diff/D44986258/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44986258/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/99106 Approved by: https://github.com/jerryzh168	2023-04-18 16:59:37 +00:00
Huy Do	06f19fdbe5	Turn off Windows Defender in temp folder on binary build workflow (#99389 ) This issue starts to show up recently https://github.com/pytorch/pytorch/actions/runs/4724983231/jobs/8385139626 and I'm pretty sure that the root cause is Windows Defender as I did a similar fix on Windows CI a while ago https://github.com/pytorch/pytorch/pull/96931. Without this, Windows binary build could fail flakily when Windows Defender chooses to delete/quarantine a file in the temp folder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99389 Approved by: https://github.com/weiwangmeta	2023-04-18 16:45:38 +00:00
mikey dagitses	00f76dbaaf	add comment indicating that `maskPrefixSum` is mutated (#99309 ) add comment indicating that `maskPrefixSum` is mutated Summary: From review of #99158. Test Plan: No op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99309 Approved by: https://github.com/Skylion007	2023-04-18 15:57:24 +00:00
mikey dagitses	e51c6c19c0	make ATen/native/cuda/DilatedMaxPool3d.cu data_ptr-correct (#99319 ) make ATen/native/cuda/DilatedMaxPool3d.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99319 Approved by: https://github.com/ezyang	2023-04-18 15:39:44 +00:00
mikey dagitses	a387ac41fb	make ATen/native/cuda/DilatedMaxPool2d.cu data_ptr-correct (#99321 ) make ATen/native/cuda/DilatedMaxPool2d.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99321 Approved by: https://github.com/ezyang	2023-04-18 15:36:23 +00:00
Edward Z. Yang	d69a1a4491	In detect_fake_mode, assert that all detected fake modes are consistent (#99392 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99392 Approved by: https://github.com/eellison	2023-04-18 15:35:05 +00:00
Edward Z. Yang	a8a1c57664	Reset joint graph fake mode earlier, and more comprehensively (#99391 ) This bug was discovered by a stronger assert (which I will be posting in a follow up PR.) The explanation for this change is a bit long and windy, and I am not sure I entirely understand the situation myself. But here's what I think is going on. jansel's joint graph pattern matcher does something fairly unusual: in order to initialize the pattern in question, it (lazily) runs an aot_function invocation in order to trace out what the joint graph of a given pattern looks like (we ought not use aot_function, but we can't really do this until bdhirsh lands AOT Autograd export properly). However, this lazy initialization occurs within the context of a separate compilation, which has its own tracing context, and importantly, fake tensor mode. What we would like, is the pattern matcher lazy initialization fake tensor mode to be unrelated to whatever the ambient fake tensor mode of the graph we were actually compiling. We want these to be independent, because we don't really care what the current compiled graph is; this is a lazy init function, it could have gotten initialized during any compilation, it just happens to be initialized on this one. To prevent us from picking up the ambient fake mode, we have to do two things: we have to remove the tracing context (which stores a fake mode), and we have to also disable the ambiently active fake mode. In https://github.com/pytorch/pytorch/pull/99377 eellison proposed an alternative approach, where we reuse the fake mode. While this probably won't cause any errors, it's morally not the right thing to do, because you'll end up polluting the enclosing fake tensor mode with tensors that have nothing to do with the mode itself. This might fix https://github.com/pytorch/pytorch/issues/99286 but it's also possible that https://github.com/pytorch/pytorch/pull/99320 fixed it already. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99391 Approved by: https://github.com/bdhirsh	2023-04-18 15:35:05 +00:00
Rodrigo Kumpera	38e964056b	Reland python ops (#99170 ) Waiting for the revert to land. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99170 Approved by: https://github.com/albanD	2023-04-18 15:15:46 +00:00
mikey dagitses	5327dad617	make ATen/native/cuda/ForeachReduceOp.cu data_ptr-correct (#99318 ) make ATen/native/cuda/ForeachReduceOp.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99318 Approved by: https://github.com/ezyang	2023-04-18 15:11:27 +00:00
Kazuaki Ishizaki	e7a5cb99e2	[CI] Fix test failures at TestTensorCreationCPU.test_float_to_int_conversion_finite_cpu_uint8 (#98916 ) This PR fixes divergent value issues in converting float32 to uint8. The failures of `TestTensorCreationCPU.test_float_to_int_conversion_finite_cpu_uint8` came from the divergent values of PyTorch and numpy among platforms. This PR adds two items: - Enhance `_float_to_int_conversion_helper()` to have given reference values to provide the stable reference value - Omit a test for `float.max` since the results on PyTorch are divergent (e.g. `float.max` -> `uint8` is 0 on x86_64, or 255 on s390x). Fixes #97794 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98916 Approved by: https://github.com/dagitses	2023-04-18 15:05:12 +00:00
mikey dagitses	24d20ea194	make ATen/native/cuda/ConvolutionMM2d.cu data_ptr-correct (#99323 ) make ATen/native/cuda/ConvolutionMM2d.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99323 Approved by: https://github.com/ezyang	2023-04-18 14:01:02 +00:00
Edward Z. Yang	7880f9e7e3	Fix isinstance on SymInt in dynamo (#99393 ) Fixes https://github.com/pytorch/pytorch/issues/99291 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99393 Approved by: https://github.com/albanD	2023-04-18 14:00:27 +00:00
Richard Zou	57e1a50da3	Fix FakeTensor printing (#99205 ) I got too confused by the FakeTensor printing, so this PR fixes it to print normally. Before: ``` with FakeTensorMode(): x = torch.empty(2, 2, device="cpu") print(x) # FakeTensor(FakeTensor(..., device='meta', shape=(2, 2)), cpu) ``` After (Tensor printing doesn't print the default device): ``` FakeTensor(..., shape=(2, 2)) ``` Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/99205 Approved by: https://github.com/eellison	2023-04-18 13:26:27 +00:00
mikey dagitses	20a90a1f80	make ATen/native/cuda/UpSampleBilinear2d.cu data_ptr-correct (#99313 ) make ATen/native/cuda/UpSampleBilinear2d.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99313 Approved by: https://github.com/ezyang	2023-04-18 13:24:54 +00:00
mikey dagitses	cee9d86d20	make ATen/native/cuda/AmpKernels.cu data_ptr-correct (#99312 ) make ATen/native/cuda/AmpKernels.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99312 Approved by: https://github.com/ezyang	2023-04-18 13:24:28 +00:00
Bin Bao	46b9377190	[CI] Collect inductor max-autotune performance every Sunday (#99387 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99387 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-04-18 13:20:13 +00:00
PyTorch MergeBot	ce7c4ba11d	Revert "Mark doctr_det_predictor as broken on master (#99370 )" This reverts commit b290381e09e59aadca73be91c601e049c029aa03. Reverted https://github.com/pytorch/pytorch/pull/99370 on behalf of https://github.com/ezyang due to malfet already directly fixed it	2023-04-18 13:18:10 +00:00
PyTorch MergeBot	f497031df9	Revert "Simple Custom Operator API, V0 (#98440 )" This reverts commit 0157b2d722b3411721814cf92467dc32f16f5a56. Reverted https://github.com/pytorch/pytorch/pull/98440 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-04-18 13:04:27 +00:00
PyTorch MergeBot	1c042a2137	Revert "Reland python ops (#99170 )" This reverts commit d4de64ae8d5587ed4a4a9d6ce9555a9a7976866d. Reverted https://github.com/pytorch/pytorch/pull/99170 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-04-18 11:37:43 +00:00
leslie-fang-intel	c97dd8e134	Fix the pt2e UT path after refactor (#99402 ) Summary After https://github.com/pytorch/pytorch/pull/99064 and https://github.com/pytorch/pytorch/pull/99065 merged, the pt2e UT path has changed, also need to change the module path in `test/test_quantization.py`. Then we can run these tests in top level's test directory. Test Plan ``` cd test && python -u -m pytest test_quantization.py -k TestQuantizePT2E cd test && python -u -m pytest test_quantization.py -k TestQuantizePT2EModels cd test && python -u -m pytest test_quantization.py -k TestQuantizePT2EFX cd test && python -u -m pytest test_quantization.py -k TestQuantizePT2EFXX86Inductor cd test && python -u -m pytest test_quantization.py -k TestQuantizePT2EFXModels ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99402 Approved by: https://github.com/jerryzh168	2023-04-18 10:48:52 +00:00
William Wen	88c8c2b71b	[dynamo 3.11] implement 3.11 exceptiontable (#96511 ) Summary of changes: - Add CPython exceptiontable parsing/assembling functions in torch/_dynamo/bytecode_transformation.py, based on https://github.com/python/cpython/blob/3.11/Objects/exception_handling_notes.txt. - Add optional `exn_tab_entry` field to dynamo `Instruction`s in torch/_dynamo/bytecode_transformation.py in order to virtualize exception table entries (start, end, target instructions). - Add checks guarding against duplicate instructions in dynamo, so that jump/exceptiontable targets are unambiguous. See `get_indexof` in torch/_dynamo/bytecode_analysis.py. Ensure that bytecode generation throughout dynamo does not generate duplicate instructions. - Allow dynamo bytecode generation logic to generate nested exception table entries for developer convenience. CPython expects entries to not overlap, so we flatten nested entries during assembly in torch/_dynamo/bytecode_transformation.py:compute_exception_table. - Simulate the block stack in torch/_dynamo/symbolic_convert.py. CPython removed the block stack in 3.11, but dynamo needs it in order to keep track of active contexts. So we simulate the block stack as before by looking at exceptiontable entries in order to determine the current blocks. - Update context codegen in torch/_dynamo/resume_execution.py. The `SETUP_FINALLY` bytecode, which conveniently had a jump target to the finally block, was removed in 3.11, so we need to keep track of the jump target of the finally block using exceptiontables. Generating resume functions is more difficult since the original exceptiontable entries pointing to old cleanup code need to be modified to point to new cleanup code. - Fix a push_null bug in torch/_dynamo/variables/functions.py introduced by https://github.com/pytorch/pytorch/pull/98699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96511 Approved by: https://github.com/jansel, https://github.com/yanboliang, https://github.com/albanD	2023-04-18 07:53:24 +00:00
Animesh Jain	8214fe07e8	Python binding to set/get CUDA rng state offset (#98965 ) Why? * To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377 Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way. ~~~~ import torch torch.cuda.manual_seed(123) print(torch.cuda.get_rng_state()) torch.cuda.set_rng_state_offset(40) print(torch.cuda.get_rng_state()) tensor([123, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) tensor([123, 0, 0, 0, 0, 0, 0, 0, 40, 0, 0, 0, 0, 0, 0, 0], dtype=torch.uint8) ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/98965 Approved by: https://github.com/kulinseth, https://github.com/ezyang	2023-04-18 07:52:21 +00:00
Edward Z. Yang	b290381e09	Mark doctr_det_predictor as broken on master (#99370 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99370 Approved by: https://github.com/Chillee, https://github.com/voznesenskym	2023-04-18 06:58:47 +00:00
Eddie Yan	5c5ad53517	[CUBLAS] Specify alignment for `cuBlasLt` `addmm` (#98975 ) Fixes the underlying issue previously addressed in #92201 by specifying minimum alignments explicitly to `cuBLAS` rather than relying on a handcrafted rule. ~~We're still investigating some potential failure modes on `sm80` and `sm90` but those would be real `cuBlasLt` heuristics bugs rather than being caused by underspecifying constraints to the heuristics.~~ According to the `cuBLAS` docs the default alignment is 256 bytes so that is the current maximum that is currently being checked: https://docs.nvidia.com/cuda/cublas/ CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/98975 Approved by: https://github.com/ngimel	2023-04-18 06:19:30 +00:00
lantiankaikai	5b692fd819	Fix bug in check required output size in _as_strided_scatter_meta (#98483 ) Original Issue from #92670 pytest ./generated/test_XuyangBai_PointDSC.py -k test_004 ==> RuntimeError: as_strided_scatter: sizes [4], strides [85], storage offset 256 and itemsize 4 requiring a storage size of 2048 are out of bounds for storage of size 1024 Repro: ``` class Model(nn.Module): def __init__(self): super(Model, self).__init__() def forward(self, x): x[1].fill_diagonal_(0) # this check size failed device = torch.device("cpu") model = Model() model.to(device) torch._dynamo.reset() compiled_model = torch._dynamo.optimize("inductor")(model) arg = [torch.rand([4, 1, 1])] compiled_model(*arg) ``` The error was raised at the checking required size in as_strided_scatter. https://github.com/pytorch/pytorch/blob/master/torch/_prims/__init__.py#L1818 In the case of input is a tensor with storage offset(a view), when compute input's storage length, should also take input's base tensor's size/stride/offset into account instead of compare it with number of element of input. This diff fix the bug and add test. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/98483 Approved by: https://github.com/ngimel	2023-04-18 05:07:57 +00:00
Yanbo Liang	2611fccfed	[Inductor] Unify Inductor CUDA & CPUT unit tests input clone function (#99118 ) Inductor CUDA unit tests doesn't preserve ```storage_offset``` when cloning input, this PR fixed it by making both CUDA and CPU tests use the same helper function ```clone_preserve_strides```. This was found by @lantiankaikai when he was working on #98483, but he can't test it due to lack of CUDA env. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99118 Approved by: https://github.com/ngimel	2023-04-18 03:42:44 +00:00
Xilun Wu	964c7e3e85	[BE][DTensor] fix DTensor equal op (#99014 ) ## What problem this PR solves? #97170 fixed `equal` operator return type (old: Tensor, now: bool) by giving it the correct sharding propagation. This is consistent with the `aten::equal` op. However, the correctness only stays at the local result level: * `equal` op returns True if the local copy of dtensor A equals to the the local copy of dtensor B This is not the correct semantic of `equal` which should return True if all local copies of A are equal to the corresponding local copies of B. ## What is this PR? 1. For non-participating ranks, if the return type is scalar, `local_results` is set to `None` which means the default value is a reduced result of participating ranks only. 2. For all ranks, if the return type is scalar and the `op_call` is `aten::equal`(because `aten::equal` is the only function that returns scalar value and needs communication), all gather the `local_results` within the `default pg` and reduce on them with `operator.and_`. The result will be the new `local_result`. ## Result/Impact For non-participating ranks and the return type is scalar: 1. op is `aten::equal`, the return value is same with all other ranks 2. op is not `aten::equal`, the return value is None. Before this PR, this will raise "NotImplementedError" but has not been tested. For participating ranks and the return type is scalar: 1. op is `aten::equal`, the return value is the equality of two dtensor operands - True if all copies are equal, False otherwise. 2. op is not `aten::equal`, simply the local computation result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99014 Approved by: https://github.com/wanchaol	2023-04-18 03:22:44 +00:00
Will Constable	e6aa8e0729	Test and document dynamo backward hooks support (#99382 ) No new support added, but backward hooks are working and now there is a test and some documentation about the limitations (hooks firing after whole graph). Pull Request resolved: https://github.com/pytorch/pytorch/pull/99382 Approved by: https://github.com/yanboliang	2023-04-18 03:03:29 +00:00
David Berard	cde35b4069	[JIT] clarify errors due to non-literal indexing into ModuleList, ModuleDict (#98606 ) TorchScript only supports indexing into ModuleLists with integer literals. The error message already warns about this; but this PR adds clarifications around what a "literal" is. I'm adding this PR because, in my opinion, it's not obvious what a "literal" is and how strict its definition is. The clarification provided in this PR should make it easier for users to understand the issue and how to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98606 Approved by: https://github.com/eellison, https://github.com/gmagogsfm	2023-04-18 02:53:53 +00:00
Nikita Shulga	a763d948d7	[CI] Move last iOS job to periodic (#99388 ) There wasn't any failures on trunk, so not sure why it needs to be run all the time. Also M1 builds/tests is a pretty good proxy for iOS builds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99388 Approved by: https://github.com/huydhn	2023-04-18 02:10:59 +00:00
Nikita Shulga	4ffd407d12	[CI] Update torchbench pin (#99386 ) To fix `doctr` installation for torchbench tests, see https://github.com/pytorch/benchmark/pull/1555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99386 Approved by: https://github.com/kit1980, https://github.com/weiwangmeta	2023-04-18 02:10:40 +00:00
Edward Z. Yang	780922c24e	Switch calling convention back to real tensors (#99320 ) Months ago, in order to get dynamic shapes working through to Dynamo backends, we changed the calling convention to pass fake tensors rather than real tensors as example inputs to backends. The motivation at the time was, well, backends shouldn't really be peeking at the real tensors when they are doing compilation, and so it would make more sense to hide the real tensors from backends. But there were a bunch of problems: * This interacted poorly with our accuracy minifier design: accuracy minifier needs access to the real inputs in order to run the model and figure out what happens! * The TensorRT backend required real inputs and we never figured out how to fix it. * In practice, all the backends needed to detect if they were passed real tensors, and fakeify them anyway (certainly AOTAutograd does this) * Parameters and inputs are treated non-uniformly: parameters had to be passed as real tensors, because CUDA graphs requires knowing what the actual tensors are Furthermore, there were some more problems discovered after the fact: * Backends may want to optimize on aspects of tensors which you cannot tell without having real tensors; e.g., alignment of the data pointer So, this PR decides that changing the calling convention was a bad idea, and switches back to passing real tensors. There is a problem though: AOTAutograd will perform fakeification, which means that in practice backends are still going to end up with fake tensors in the end anyway. I want to change this, but this will require some work with bdhirsh's upcoming AOTAutograd export refactor. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99320 Approved by: https://github.com/voznesenskym	2023-04-18 02:09:57 +00:00
Edward Z. Yang	a109453df4	Delete use_functionalize feature flag (#99317 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99317 Approved by: https://github.com/voznesenskym	2023-04-18 02:09:57 +00:00
Edward Z. Yang	17d7be68ee	Delete functorch use_fake_tensor and debug_fake_cross_ref (#99314 ) Using fake tensor with AOTAutograd is now mandatory, simplifying our logic. Unfortunately, this means debug_fake_cross_ref must go, but I don't think anyone has used it recently. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99314 Approved by: https://github.com/eellison, https://github.com/zou3519	2023-04-18 02:09:54 +00:00
Edward Z. Yang	2471eac618	Make run_fwd_maybe_bwd work with int inputs (#99365 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99365 Approved by: https://github.com/voznesenskym	2023-04-18 02:05:26 +00:00
Kevin Tse	3d8498f926	[DataLoader] Add missing documentation for arg in DataLoader (#99371 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99371 Approved by: https://github.com/janeyx99	2023-04-18 02:03:47 +00:00
BowenBao	436edc5ac3	[ONNX] Retire 'DynamoOptimizeExporter' (#99202 ) <!-- copilot:all --> ### <samp>🤖 Generated by Copilot at f2ccd03</samp> ### Summary 🗑️📝🛠️ <!-- 1. 🗑️ - This emoji represents the removal of unused or unnecessary code, such as the class `DynamoOptimizeExporter` and some imports and decorators. 2. 📝 - This emoji represents the improvement of code readability and consistency, such as replacing `skip_fx_exporters` with `xfail` and using more descriptive names for the FX exporters. 3. 🛠️ - This emoji represents the simplification and refactoring of the code, such as removing some FX exporters and reducing the number of arguments and conditions in the tests. --> Removed unused code and simplified test logic for FX to ONNX conversion. This involved removing `skip_fx_exporters` and `DynamoOptimizeExporter`, and using `xfail` instead of `skip_fx_exporters` in `pytorch_test_common.py` and `test_fx_to_onnx_with_onnxruntime.py`. > _Some FX exporters were not in use_ > _So they were removed without excuse_ > _The tests were updated_ > _With `xfail` annotated_ > _To make the ONNX logic more smooth_ ### Walkthrough * Remove unused imports of `Mapping`, `Type`, and `exporter` from `test/onnx/pytorch_test_common.py` ([link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-26ce853445bf331686abb33393ee166726923ce36aa2a8de98ac7a2e3bc5a6d8L9-R9), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-26ce853445bf331686abb33393ee166726923ce36aa2a8de98ac7a2e3bc5a6d8L16-R16)) * Replace custom `skip_fx_exporters` function with standard `xfail` decorator in `test/onnx/pytorch_test_common.py` and `test/onnx/test_fx_to_onnx_with_onnxruntime.py` to simplify test skipping logic and mark tests as expected to fail ([link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-26ce853445bf331686abb33393ee166726923ce36aa2a8de98ac7a2e3bc5a6d8L209-R220), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL319-R288), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL375-R343), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL619-R563), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL721-R656), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL788-R718)) * Remove unused `DynamoOptimizeExporter` class from `torch/onnx/_internal/fx/dynamo_exporter.py` and remove references to it in `test/onnx/test_fx_to_onnx_with_onnxruntime.py` to simplify FX exporter logic and remove unused code ([link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-3ecf10bc5f6eb95a19441118cb947bd44766dc5eb9b26346f922759bb0f8c9f2L16-L85), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL37-R37), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL411-L415), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL452-L454)) * Remove unused variables and parameters related to different FX exporters in `test/onnx/test_fx_to_onnx_with_onnxruntime.py` and use `torch.onnx.dynamo_export` directly to simplify code ([link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL50-R47), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL191-R188), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL245-R224), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL265-R237), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL279), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL296)) * Replace `skip` decorators with `xfail` decorators in `test/onnx/test_fx_to_onnx_with_onnxruntime.py` to mark tests as expected to fail instead of skipping them unconditionally ([link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL524-R471), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL665-R600), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL748-R675), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL767-R696)) * Replace `skip_fx_exporters` decorator with `skip_dynamic_fx_test` decorator in `test/onnx/test_fx_to_onnx_with_onnxruntime.py` to skip tests only for dynamic shapes instead of a specific FX exporter ([link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL591-R541), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL831-R761)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99202 Approved by: https://github.com/abock	2023-04-18 01:40:47 +00:00
Shunting Zhang	694ed70e01	[inductor][easy] create a wrap for triton do_bench function (#99216 ) triton PR https://github.com/openai/triton/pull/1513 change the interface of do_bench function. The quantile fields name is changed from 'percentiles' to 'quantiles' and its default value is changed from from (0.5, 0.2, 0.8) to None. This break some inductor code since a caller expects a tuple may get a item. Add a wrapper to maintain the same behavior for inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99216 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-04-18 00:52:00 +00:00
Huy Do	063173cb46	Skip sccache initialization on MacOS (#99121 ) Now that the cache is used on MacOS, I'm seeing some flaky timeout when starting the server https://github.com/pytorch/pytorch/actions/runs/4703387666/jobs/8341817701. This step is optional, so we could just skip it (like what Linux workflow does). Pull Request resolved: https://github.com/pytorch/pytorch/pull/99121 Approved by: https://github.com/clee2000	2023-04-18 00:46:45 +00:00
Edward Z. Yang	6df87b2e9b	Rename sym_shapes logger to dynamic (#99335 ) This matches the logging with the user facing UX dynamic=True, rather than a new abbreviation that shows up no where else in the codebase. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99335 Approved by: https://github.com/Skylion007, https://github.com/mlazos, https://github.com/voznesenskym	2023-04-18 00:45:39 +00:00
PyTorch MergeBot	6212a85af8	Revert "Skip sccache initialization on MacOS (#99121 )" This reverts commit c2fd198cafaac3e5de6d72a80cd4ad19da042ce5. Reverted https://github.com/pytorch/pytorch/pull/99121 on behalf of https://github.com/huydhn due to Revert to reland this as this miss one change in mac build workflow. My mistake from rebasing from master into main	2023-04-18 00:14:34 +00:00
haozhe.zhu	59e343b12c	enable data type propagation (#98065 ) Enable data type propagation in schedule node level. Propagation policy: (1) ops with dtype args [constant, load, rand, randn] -> direct use dtype as node dtype (2) ops semantics decide output dtype -> using output dtype All `override_return_dtype` in https://github.com/pytorch/pytorch/blob/master/torch/_inductor/lowering.py. (3) other ops: perform promote on input nodes dtype. ADD(BF16, FP32) -> FP32 output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98065 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/jgong5	2023-04-18 00:07:35 +00:00
Chien-Chin Huang	01fdcbdcc9	[FSDP][optim_state_dict][Easy] Temporarily disable rank0_only=True for use_orig_paramscaseEas (#99354 ) Summary: We have not made use_orig_params=True support rank0_only optimizer_state_dict. Test Plan: CI Reviewed By: wz337 Differential Revision: D45054041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99354 Approved by: https://github.com/wz337	2023-04-18 00:02:09 +00:00
Zachary DeVito	7ff1f3f3f6	Revert "Revert "Expandable blocks in allocator (#96995 )"" (#99275 ) This reverts commit 851e89c8e817f28270e0fc21d74ced9446bea747. Differential Revision: [D45034526](https://our.internmc.facebook.com/intern/diff/D45034526) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99275 Approved by: https://github.com/eellison	2023-04-17 23:46:08 +00:00
Ikko Eltociear Ashimine	99c6d46cf7	fix typo in gen_functionalization_type.py (#99303 ) propogate -> propagate Pull Request resolved: https://github.com/pytorch/pytorch/pull/99303 Approved by: https://github.com/kit1980	2023-04-17 22:59:40 +00:00
Syed Tousif Ahmed	b003000f41	Updates NCCL to 2.17.1 (#97843 ) Re-open of #97407. NCCL 2.17.1 sometimes fails to send a FIN packet and causes hangs. This PR updates NCCL to 2.17.1 that includes a patch for socket shutdown. NCCL 2.18 will also include this patch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97843 Approved by: https://github.com/kwen2501	2023-04-17 22:53:54 +00:00
Huy Do	c2fd198caf	Skip sccache initialization on MacOS (#99121 ) Now that the cache is used on MacOS, I'm seeing some flaky timeout when starting the server https://github.com/pytorch/pytorch/actions/runs/4703387666/jobs/8341817701. This step is optional, so we could just skip it (like what Linux workflow does). Pull Request resolved: https://github.com/pytorch/pytorch/pull/99121 Approved by: https://github.com/clee2000	2023-04-17 22:18:46 +00:00
Chien-Chin Huang	bdaf32261f	[FSDP] Ensure that customized non tensor optimizer state can be saved (#99214 ) The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test. This PR will solve https://github.com/pytorch/pytorch/issues/99079 Differential Revision: [D45021331](https://our.internmc.facebook.com/intern/diff/D45021331/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99214 Approved by: https://github.com/awgu, https://github.com/awaelchli	2023-04-17 21:54:16 +00:00
Rodrigo Kumpera	d4de64ae8d	Reland python ops (#99170 ) Waiting for the revert to land. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99170 Approved by: https://github.com/albanD	2023-04-17 21:53:41 +00:00
Nikita Karetnikov	106ccf4a2a	[pt2] add meta function for `linalg.cross` (#99279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99279 Approved by: https://github.com/ezyang	2023-04-17 21:21:45 +00:00
Nikita Karetnikov	6f7b434f7b	[pt2] add `SymInt` support for `column_stack` (#99276 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99276 Approved by: https://github.com/ezyang	2023-04-17 21:21:45 +00:00
Nikita Shulga	ccc5d1daec	Revert D44897935: Multisect successfully blamed D44897935 for test or build failures (#99353 ) Summary: This diff is reverting D44897935 D44897935: [FSDP] Include duplicate parameters and modules when calling named_parameters and named_modules (#98912) by fegin has been identified to be causing the following test or build failures: Tests affected: - [caffe2/torch/fb/module_factory/sync_sgd/tests:test_pyper_data_parallel_wrapper - caffe2.torch.fb.module_factory.sync_sgd.tests.test_pyper_data_parallel_wrapper.PyPerDataParallelWrapperTest: test_fsdp_submodules_pyper](https://www.internalfb.com/intern/test/562950025957458/) Here's the Multisect link: https://www.internalfb.com/multisect/1893714 Here are the tasks that are relevant to this breakage: We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. If you believe this diff has been generated in error you may Commandeer and Abandon it. Test Plan: NA Reviewed By: fegin Differential Revision: D45027286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99353 Approved by: https://github.com/izaitsevfb, https://github.com/fegin	2023-04-17 20:53:10 +00:00
Michael Lazos	a6a90eaf28	Remove unnecessary check when logging artifacts (#99260 ) Removes a check which would sometimes allow `off_by_default` artifacts to be logged if logged at a higher level. This change will only allow artifact messages to be displayed if the artifact is enabled, regardless of level. closes #99144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99260 Approved by: https://github.com/lezcano	2023-04-17 20:47:25 +00:00
PyTorch MergeBot	ab08284225	Revert "Disable dynamo tracing torchrec.distributed (#97824 )" This reverts commit df216b5736624e611cbc2cb048ba29c66edb3aed. Reverted https://github.com/pytorch/pytorch/pull/97824 on behalf of https://github.com/izaitsevfb due to back out diff that doubles memory consumption for multitask FAIM flows. See D44978317	2023-04-17 20:34:01 +00:00
PyTorch MergeBot	08dd4ad0b9	Revert "[pt2] add `SymInt` support for `column_stack` (#99276 )" This reverts commit 775dd869d0188dde4e5da27142960760e4f9a1c2. Reverted https://github.com/pytorch/pytorch/pull/99276 on behalf of https://github.com/ezyang due to reverting this one too for safety	2023-04-17 19:37:58 +00:00
Shen Li	62a6d81143	[SPMD][Easy] Making typing consistent by replacing object with Any (#99332 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99332 Approved by: https://github.com/dracifer	2023-04-17 19:33:45 +00:00
Shen Li	19c2804614	[SPMD][EASY] Remove unnecessary torch.ops prefix (#99331 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99331 Approved by: https://github.com/dracifer	2023-04-17 19:33:45 +00:00
PyTorch MergeBot	f957334c2b	Revert "[pt2] add meta function for `linalg.cross` (#99279 )" This reverts commit efc3887ea508b3cfd94603fd8afe4e8cf6dce7b7. Reverted https://github.com/pytorch/pytorch/pull/99279 on behalf of https://github.com/ezyang due to Apparently this is breaking inductor on master? So weird	2023-04-17 19:33:16 +00:00
Bug Hunter Yan	2b54d673fc	Add custom backend case for storage and automatically generate storage attributes. (#98478 ) Currently storage only considers partial backend. We want storage to create on custom backend by key PrivateUse1. It also provides an easy automatic generation of storage-related attributes. When the user registers a new backend, the corresponding methods and attributes can be automatically generated. Do this code. `torch.utils.rename_privateuse1_backend('foo')` `torch.utils.generate_storage_for_privateuse1_backend()` Then, get the following methods and attributes. `torch.TypedStorage.is_foo` `torch.TypedStorage.foo()` `torch.UntypedStorage.is_foo` `torch.UntypedStorage.foo()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98478 Approved by: https://github.com/albanD	2023-04-17 19:18:39 +00:00
Nikita Shulga	8efc965e05	Update FBGEMM submodule (#99315 ) To `e07dda2d50` that among other things includes https://github.com/pytorch/FBGEMM/pull/1648 Fixes https://github.com/pytorch/pytorch/issues/99223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99315 Approved by: https://github.com/albanD	2023-04-17 18:44:56 +00:00
maxren	80eab63587	[Quant][pt2e] torch.mean and ReLU6 (#98984 ) Add nn.Module ReLU6 in addition to functional relu6. Also add torch .mean to quantization config Differential Revision: [D44901038](https://our.internmc.facebook.com/intern/diff/D44901038/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98984 Approved by: https://github.com/jerryzh168	2023-04-17 18:33:04 +00:00
maxren	444a9769ae	[quant][pt2e] QAT Linear (#98897 ) Differential Revision: [D44901039](https://our.internmc.facebook.com/intern/diff/D44901039/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98897 Approved by: https://github.com/tiandiao123, https://github.com/manuelcandales	2023-04-17 18:27:39 +00:00
maxren	568935caca	[quant][pt2e] QAT conv + bn + relu (#98896 ) Differential Revision: [D44901040](https://our.internmc.facebook.com/intern/diff/D44901040/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98896 Approved by: https://github.com/manuelcandales	2023-04-17 18:24:08 +00:00
Tugsbayasgalan Manlaibaatar	7401f0f8ce	Add unbacked symbool support (#98877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98877 Approved by: https://github.com/ezyang	2023-04-17 17:45:10 +00:00
PyTorch MergeBot	08ffe34621	Revert "Skip sccache initialization on MacOS (#99121 )" This reverts commit 70ec347f0628c3ae795591b87076aefa1fe9c2ec. Reverted https://github.com/pytorch/pytorch/pull/99121 on behalf of https://github.com/huydhn due to The cache still timeout and there is no way to increase the timeout value to be more than 10s looking at sccache code `6bbef54b88/src/commands.rs (L48)`. This needs reworks	2023-04-17 17:36:05 +00:00
Kimish Patel	cdab6c8df9	[PT2E][Quant] Support specifying None for obs_or_fq_ctr in target_dtype_info (#99071 ) It is cleaner for quantizer to say what does not need observation instead of putting fp32 observers. This diff add support for that by checking if target_dtype_info contains none for specific observers and if so skip inserting observers for those. Differential Revision: [D44971357](https://our.internmc.facebook.com/intern/diff/D44971357/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99071 Approved by: https://github.com/jerryzh168	2023-04-17 16:37:16 +00:00
Kimish Patel	36a95625da	[PT2E][Quant][BE] Refactor observer code (#99066 ) Combine per channel and per tensor observer code Differential Revision: [D44918494](https://our.internmc.facebook.com/intern/diff/D44918494/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99066 Approved by: https://github.com/jerryzh168	2023-04-17 16:17:36 +00:00
Kimish Patel	4f4e0db5bd	[PT2E][Quant][BE] Split short term and long term tests in different files (#99065 ) Just for better organization Differential Revision: [D44918492](https://our.internmc.facebook.com/intern/diff/D44918492/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44918492/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/99065 Approved by: https://github.com/jerryzh168	2023-04-17 16:12:47 +00:00
Kimish Patel	bcf6393024	[PT2E][Quant][BE] Move pt2e quantization test to separate folder (#99064 ) Move it out of fx for better code organizations Differential Revision: [D44918496](https://our.internmc.facebook.com/intern/diff/D44918496/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44918496/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/99064 Approved by: https://github.com/jerryzh168	2023-04-17 16:07:03 +00:00
Jeff Daily	0711bff9aa	[ROCm] add skipCUDAIfVersionLessThan to unskip test_jiterator for ROCm (#99197 ) This unskips 121 tests that the decorator `@skipCUDAIf(_get_torch_cuda_version() < (11, 6))` was unintentionally skipping for ROCm. Other decorators such as `skipCUDAVersionIn` will only activate for the CUDA device, not the CPU or ROCm-as-CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99197 Approved by: https://github.com/ngimel	2023-04-17 16:05:16 +00:00
Li-Huai (Allan) Lin	e549ad0046	Add log_sigmoid_backward forward-AD (#99288 ) Fixes #95057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99288 Approved by: https://github.com/kshitij12345, https://github.com/albanD	2023-04-17 15:45:20 +00:00
Aidyn-A	dede0bb065	[NCCL] Use OptionalCUDAGuard in ProcessGroupNCCL::WorkNCCL::synchronizeInternal (#98895 ) Using `CUDAGuard` does redundant `set_device` in the following loop: ```C++ { for (auto& device : devices_) { at::cuda::CUDAGuard gpuGuard(device); // set device // ... // ~gpuGuard() sets original device } // ... } ``` It would be more efficient to use `OptionalCUDAGuard` as follows: ```C++ { at::cuda::OptionalCUDAGuard gpuGuard; for (auto& device : devices_) { gpuGuard.set_index(device.index()); // set device // ... } // ... // ~gpuGuard() sets original device } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98895 Approved by: https://github.com/mrshenli	2023-04-17 14:16:38 +00:00
mikey dagitses	ed5395dbef	make aten/src/ATen/native/cuda/Indexing.cu data_ptr-correct (#99154 ) make aten/src/ATen/native/cuda/Indexing.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99154 Approved by: https://github.com/ezyang	2023-04-17 13:37:12 +00:00
mikey dagitses	d44c5e3639	make ATen/native/cuda/IndexKernel.cu data_ptr-safe (#99158 ) make ATen/native/cuda/IndexKernel.cu data_ptr-safe Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99158 Approved by: https://github.com/ezyang	2023-04-17 13:30:10 +00:00
mikey dagitses	63a09a588d	make ATen/native/cuda/UniqueCub.cu data_ptr-correct (#99150 ) make ATen/native/cuda/UniqueCub.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99150 Approved by: https://github.com/ezyang	2023-04-17 13:15:51 +00:00
XiaobingSuper	55ed2b8a32	inductor: add device and dtype check before doing cpu fx packed weight (#99028 ) fix https://github.com/pytorch/pytorch/issues/96406 and https://github.com/pytorch/pytorch/issues/98979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99028 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-17 13:10:59 +00:00
Richard Zou	0157b2d722	Simple Custom Operator API, V0 (#98440 ) This PR introduces CustomOp, a wrapper around a dispatcher operator that allows users to define custom operators. It adds the skeleton for CustomOp and some very simple behavior: as of this PR: - one can create a CustomOp for an operator that does not have inplace or aliasing - give it CPU/CUDA and Meta implementations - and trace it into a graph via make_fx. The design follows https://docs.google.com/document/d/19Uc5OUCA187q9BZggJb70RT2ZoSTDoG5QQkJkZwd25M/edit Concretely, we implement the following things mentioned in the doc in this PR: - Entrypoint 1 (CustomOp.define, creating a new custom operator) - impl (to define device-specific code) and impl_meta (to define meta formulas) The goal for the short term is to get the code to a state where it can be trialed by the export folks. On top of this PR, the blockers are: - adding Entrypoint 3 (CustomOp.from_existing) - adding a way to do data-dependent shape formulas These will come in future PRs since this one is getting long. Things that will come in the longer-near-term (before 2.1): - adding the other entrypoints mentioned in the doc (2 & 3) - more safety checks and better error messages - support for views and mutation - support for defining autograd formulas - support for functionalization - making this API public (it's private right now). Test Plan: - added a new test case, TestCustomOp. It mostly tests a bunch of error cases. - added OpInfos for custom operators and hooked these up to test_proxy_tensor to test that they work with make_fx. These custom operators were based off of the ones in the autograd_function_db. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98440 Approved by: https://github.com/ezyang	2023-04-17 12:17:32 +00:00
mikey dagitses	503104ce31	make ATen/native/cuda/LegacyThrustHelpers.cu data_ptr-correct (#99273 ) make ATen/native/cuda/LegacyThrustHelpers.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99273 Approved by: https://github.com/ezyang	2023-04-17 11:45:51 +00:00
mikey dagitses	e9fef4a70c	make gemv calls data_ptr_correct (#99161 ) make gemv calls data_ptr_correct Summary: The following link for dgemv, for example, shows that all arguments except for `y` are exclusively input arguments. https://netlib.org/lapack/explore-html/d7/d15/group__double__blas__level2_gadd421a107a488d524859b4a64c1901a9.html Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99161 Approved by: https://github.com/ezyang	2023-04-17 11:45:42 +00:00
mikey dagitses	46c0912ca7	make ATen/native/cuda/Blas.cpp data_ptr-correct (#99274 ) make ATen/native/cuda/Blas.cpp data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99274 Approved by: https://github.com/ezyang	2023-04-17 11:02:31 +00:00
Chien-Chin Huang	148d49260a	[SPMD] Implement split_fused_optimizer to split one fused_optimizer node to two (#98784 ) Several optimization passes requires the ability to split the fused_optimizer. This PR adds the API to support the use cases. Differential Revision: [D44806450](https://our.internmc.facebook.com/intern/diff/D44806450/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98784 Approved by: https://github.com/mrshenli	2023-04-17 10:02:07 +00:00
mikey dagitses	2fc7f984e5	make vol2col data_ptr-correct (#99152 ) make vol2col data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99152 Approved by: https://github.com/ezyang	2023-04-17 08:53:01 +00:00
mikey dagitses	ecf4400b3a	make radix_sort_pairs data_ptr-correct (#99153 ) make radix_sort_pairs data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99153 Approved by: https://github.com/ezyang	2023-04-17 08:52:04 +00:00
mikey dagitses	98b62f7c12	make remaining gemm calls data_ptr-correct (#99156 ) make remaining gemm calls data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99156 Approved by: https://github.com/ezyang	2023-04-17 08:51:58 +00:00
mikey dagitses	306594b2b0	make ATen/native/cuda/AdaptiveMaxPooling2d.cu data_ptr-correct (#99164 ) make ATen/native/cuda/AdaptiveMaxPooling2d.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99164 Approved by: https://github.com/ezyang	2023-04-17 08:51:48 +00:00
mikey dagitses	314cba9641	make ATen/native/cuda/SegmentReduce.cu data_ptr-correct (#99163 ) make ATen/native/cuda/SegmentReduce.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99163 Approved by: https://github.com/ezyang	2023-04-17 08:51:42 +00:00
mikey dagitses	777a666a60	make ATen/native/cuda/Unique.cu data_ptr-correct (#99240 ) make ATen/native/cuda/Unique.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99240 Approved by: https://github.com/ezyang	2023-04-17 08:51:18 +00:00
mikey dagitses	31393ea457	make ATen/native/cuda/MultinomialKernel.cu data_ptr-correct (#99241 ) make ATen/native/cuda/MultinomialKernel.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99241 Approved by: https://github.com/ezyang	2023-04-17 08:51:11 +00:00
mingfeima	a8429342df	fix mul/div overflow issue on CPU float16 (#98820 ) Fix https://github.com/pytorch/pytorch/issues/63482 and https://github.com/pytorch/pytorch/issues/98691 The above two issues have the same root cause: binary_ops will create TensorIterator with the flag `promote_inputs_to_common_dtype` on, which will convert both input tensors to the common_dtype_ (the logic is bypassed on CUDA), which might overflow on Half. If one of the inputs is a scalar with abs value larger than ~65000, it will overflow. This patch will try to fetch the scalar value from the `original_tensor_base` which records the original scalar input value, then in the `cpu_kernel_vec` the TensorIterator is treated as an unary Op. So previously, CPU and CUDA would have different behaviors for such scenario. This is aligned with this patch, test cases added for both CPU and CUDA device. The following is the results: #### before: ``` >>> torch.tensor([3388.], dtype=torch.half).div(524288.0) tensor([0.], dtype=torch.float16) >>> torch.tensor([0.01], dtype=torch.float16) * torch.tensor(65536, dtype=torch.float32) tensor([inf], dtype=torch.float16) ``` #### after: ``` >>> torch.tensor([3388.], dtype=torch.half).div(524288.0) tensor([0.0065], dtype=torch.float16) >>> torch.tensor([0.01], dtype=torch.float16) * torch.tensor(65536, dtype=torch.float32) tensor([655.5000], dtype=torch.float16) ``` Also need to update `RRelu` implementation, to use float to store the intermediate results, otherwise the following test case would fail: ``` . build/bin/test_api --gtest_filter=ModulesTest.RReLU ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98820 Approved by: https://github.com/jgong5, https://github.com/ngimel	2023-04-17 07:12:53 +00:00
Nikita Karetnikov	efc3887ea5	[pt2] add meta function for `linalg.cross` (#99279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99279 Approved by: https://github.com/ezyang	2023-04-17 03:05:20 +00:00
Nikita Karetnikov	775dd869d0	[pt2] add `SymInt` support for `column_stack` (#99276 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99276 Approved by: https://github.com/ezyang	2023-04-17 03:05:20 +00:00
Nikita Shulga	2418b94576	Rename default branch to `main` (#99210 ) Mostly `s/@master/@main` in numerous `.yml` files. Keep `master` in `weekly.yml` as it refers to `xla` repo and in `test_trymerge.py` as it refers to a branch PR originates from.	2023-04-16 18:48:14 -07:00
Kimish Patel	31f311a816	[PT2E][Quantization] Refactor Quantizer and QNNPACKQuantizer (#99063 ) This diff renames quantization spec/config and operator config. It moves these datastructures to base quantizer. Base quantizer API now has get_supported_operators that returns list of patterns that a quantizer quantizes. There are two choices being debated for how to convey to user what a particular quantizer will quantize. 1. Modules. We just convey what nn.Modules will be quantized. Of course that does not mean that equivalent functional variants wont be quantized, however for simplifity we just use nn.Module. If certain ops are quatnzied in fused manner then that will considered internal details. Pros and cons of this approach pros: - Simple. Only nn Modules are listed. - User does not have to see fusion patterns. Cons: - confusing perhaps because it is not clear if supported = nn.Conv2d also means that the quantizer supported functional.conv2d - Hiding fusion pattern means user has no say in not fusing. Meaning if conv2d + relu is fused and user configures to quantize only conv, quantizer will also quantize the following relu as if conv2d + relu are fused. 2. Patterns. Be explicit about what is supported and enumerate all possible compbinations. Pros: - it is very clear what quantizer will do. no surprises. Cons: - It is not simple to parse. - It can be argued taht fusion is internal detail of the quantizer. So some quantizer implementation may chose to expose fusion patterns, while others may not and may not even provide any configurability. One option is to move set_supported_operators/modules out of base quantizer and let each quantizer define its own way of communicating what is supported. Issue with this is that when we want to "Compose" multiple quantizers there is no way for user to define the order of composition if user does not know what a quantizer supports. For exampl quantizer A may quantizer conv + relu while B only conv, but B's implementation is fast. In that case you may compose (B, A) such B quantizes conv and A quantizes relu. Not knowning what A and B support, makes such composition harder Differential Revision: [D44895547](https://our.internmc.facebook.com/intern/diff/D44895547/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44895547/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/99063 Approved by: https://github.com/jerryzh168	2023-04-17 00:34:18 +00:00
Natalia Gimelshein	888c65b6a4	fix fake tensor propagation for cross_entropy with smoothing (#99255 ) Fixes #99250, unfortunately I haven't figured out how to handle cross-entropy with smooth loss and weights. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99255 Approved by: https://github.com/jansel, https://github.com/malfet	2023-04-17 00:31:26 +00:00
Natalia Gimelshein	fa502ab034	simplify codegen for integer min/max since they don't need to propaga… (#99249 ) …te nan Pull Request resolved: https://github.com/pytorch/pytorch/pull/99249 Approved by: https://github.com/jansel, https://github.com/malfet	2023-04-17 00:28:23 +00:00
Jason Ansel	9ab5fdff81	Remove obsolete HAS_PRIMS_REFS (#99252 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99252 Approved by: https://github.com/ngimel	2023-04-17 00:27:37 +00:00
PyTorch MergeBot	20a1788136	Revert "[quant][fix] Compare resnet with quantizer api with the prepare_fx and decomposed convert flow (#98905 )" This reverts commit 9e0df2379b9e13a36c59bb0d55f4922de8305bd6. Reverted https://github.com/pytorch/pytorch/pull/98905 on behalf of https://github.com/izaitsevfb due to Conflicts with D44918496 landed internally, blocks diff train import	2023-04-17 00:17:10 +00:00
mikey dagitses	be0b12ece5	make untemplated gemm calls data_ptr-correct (#99184 ) make untemplated gemm calls data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99184 Approved by: https://github.com/ezyang	2023-04-16 20:11:13 +00:00
mikey dagitses	29ff5a0c91	make ATen/native/cuda/Embedding.cu data_ptr-correct (#99183 ) make ATen/native/cuda/Embedding.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99183 Approved by: https://github.com/ezyang	2023-04-16 20:11:07 +00:00
wbigat	b08c384106	Add parameter for pin memory of storage to support other devices. (#98692 ) Fixes #ISSUE_NUMBER Add parameter for pin memory of storage to support other devices. In the future, other backends will provide their own allocators to create pin memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98692 Approved by: https://github.com/ezyang	2023-04-16 20:06:27 +00:00
PyTorch MergeBot	851e89c8e8	Revert "Expandable blocks in allocator (#96995 )" This reverts commit 6a50b83b739c2d37d0f518f98b8e624eca0ea153. Reverted https://github.com/pytorch/pytorch/pull/96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests	2023-04-16 19:23:37 +00:00
Li-Huai (Allan) Lin	6f181aae7c	[vmap] Register decomposition for huber_loss_backward (#99236 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99236 Approved by: https://github.com/kshitij12345	2023-04-16 18:50:45 +00:00
Edward Z. Yang	e2923b521b	Further improve symbolic shapes logging (#99159 ) * Introduce a frame counter which lets us uniquely identify frames. This makes it easier to tell if you are recompiling the same frame * Shorten evaluate_expr to eval for more visual distinctiveness Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99159 Approved by: https://github.com/Skylion007	2023-04-16 12:06:38 +00:00
Animesh Jain	fdbc8625a1	Functionalization of torch.rand/rand_like ops (#97377 ) This PR introduces the functionalization of RNG ops. Key points are * Introduces a new `philox_rand` prim operator that accepts seed, offset. * Adds decompositions for random operators that use these philox_rand prims * Adds a PhiloxStateTracker to track the offset for each occurence of rand ops * Changes calling convention of AOT Autograd and adds <fwd_seed, fwd_base_offset> and <bwd_seed, bwd_base_offset> * Monkeypatches set_rng_state and get_rng_state while AOT Autograd tracing to record the rng state behavior * Raises assertion for CPU because CPU does not Philox RNG. Not dealt in this PR * dropout op - offset calculation is different * other distributions like normal, poisson etc * Inductor support * Cudagraph support * Dynamic shape support An example ~~~ class Custom(torch.autograd.Function): @staticmethod def forward(ctx, x): ctx.save_for_backward(x) a = torch.rand_like(x) * x a = torch.rand_like(x) * a return a @staticmethod def backward(ctx, grad_out): x, = ctx.saved_tensors return grad_out * torch.rand_like(grad_out) * torch.cos(x) ====== Forward graph 0 ====== def forward(self, fwd_seed_1: i64[], fwd_base_offset_1: i64[], primals_1: f32[16, 16]): # No stacktrace found for following nodes add: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 0) philox_rand: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add, [16, 1], device(type='cuda', index=0), torch.float32); add = None mul: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand, primals_1); philox_rand = None add_1: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 4); fwd_base_offset_1 = None philox_rand_1: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add_1, [16, 1], device(type='cuda', index=0), torch.float32); fwd_seed_1 = add_1 = None mul_1: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand_1, mul); philox_rand_1 = mul = None return [mul_1, primals_1] ====== Backward graph 0 ====== def forward(self, bwd_seed_1: i64[], bwd_base_offset_1: i64[], primals_1: f32[16, 16], tangents_1: f32[16, 16]): # No stacktrace found for following nodes add_2: i64[] = torch.ops.aten.add.Tensor(bwd_base_offset_1, 0); bwd_base_offset_1 = None philox_rand_2: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], bwd_seed_1, add_2, [16, 1], device(type='cuda', index=0), torch.float32); bwd_seed_1 = add_2 = None mul_2: f32[16, 16] = torch.ops.aten.mul.Tensor(tangents_1, philox_rand_2); tangents_1 = philox_rand_2 = None cos: f32[16, 16] = torch.ops.aten.cos.default(primals_1); primals_1 = None mul_3: f32[16, 16] = torch.ops.aten.mul.Tensor(mul_2, cos); mul_2 = cos = None return [mul_3] ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/97377 Approved by: https://github.com/ezyang	2023-04-16 09:55:56 +00:00
Jason Ansel	6e1e27fc4e	[inductor] Refactor pre-grad passes into inductor.fx_passes (#99130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99130 Approved by: https://github.com/ngimel	2023-04-16 04:05:56 +00:00
leslie-fang-intel	91279f9471	[inductor][quant]Enable inductor vec codegen for quantization (#98489 ) Summary Enable the `decomposed dequant - pointwise ops - decomposed quant` vectorization code gen inside inductor. Here is the example in the UT and the generated code: Example: * `decomposed dequant - relu - decomposed quant` pattern. * Using `uint8` as the quantized tensor data type. Generated Code: ``` kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_root/hw/chwr6vy6e6sd25sfh42qtywkuf2emodexm2aomp3lbrcxwznfwyi.h" extern "C" void kernel(const unsigned char* in_ptr0, unsigned char* out_ptr0) { #pragma omp parallel num_threads(56) { { #pragma omp for for(long i0=static_cast<long>(0); i0<static_cast<long>(27); i0+=static_cast<long>(1)) { auto tmp0 = at::vec::load_uint8_as_float(in_ptr0 + static_cast<long>(16i0)); auto tmp1 = (tmp0); auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100)); auto tmp3 = tmp1 - tmp2; auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.01)); auto tmp5 = tmp3 tmp4; auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0)); auto tmp7 = at::vec::Vectorized<float>(static_cast<float>(100.0)); auto tmp8 = tmp6 * tmp7; auto tmp9 = tmp8.round(); auto tmp10 = tmp9 + tmp2; auto tmp11 = at::vec::Vectorized<float>(static_cast<float>(0)); auto tmp12 = at::vec::maximum(tmp10, tmp11); auto tmp13 = at::vec::Vectorized<float>(static_cast<float>(255)); auto tmp14 = at::vec::minimum(tmp12, tmp13); auto tmp15 = (tmp14); tmp15.store_float_as_uint8(out_ptr0 + static_cast<long>(16i0)); } #pragma omp for simd simdlen(8) for(long i0=static_cast<long>(432); i0<static_cast<long>(441); i0+=static_cast<long>(1)) { auto tmp0 = in_ptr0[static_cast<long>(i0)]; auto tmp1 = static_cast<float>(tmp0); auto tmp2 = static_cast<float>(100); auto tmp3 = tmp1 - tmp2; auto tmp4 = static_cast<float>(0.01); auto tmp5 = tmp3 tmp4; auto tmp6 = tmp5 * (tmp5>0); auto tmp7 = static_cast<float>(100.0); auto tmp8 = tmp6 * tmp7; auto tmp9 = std::nearbyint(tmp8); auto tmp10 = tmp9 + tmp2; auto tmp11 = static_cast<float>(0); auto tmp12 = (tmp11 != tmp11) ? tmp11 : std::max(tmp10, tmp11); auto tmp13 = static_cast<float>(255); auto tmp14 = (tmp13 != tmp13) ? tmp13 : std::min(tmp12, tmp13); auto tmp15 = static_cast<unsigned char>(tmp14); out_ptr0[static_cast<long>(i0)] = tmp15; } } } } ''') ``` Test Plan ``` cd test/inductor && python -m pytest test_cpu_repro.py -k test_decomposed_dequant_relu_quant ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98489 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-16 03:48:52 +00:00
Edward Z. Yang	039faf0dbf	Add invariant that all symbolic shapes must be bound in graph (#99089 ) Previously, we had a problem when partitioning forward-backward dynamic graphs, which is that we could end up with a backward graph that mentions a symbol in an input tensor (e.g., `f32[s0 + s1]`), but without this symbol being otherwise bound elsewhere. When this happens, we have no way of actually deriving the values of `s0` and `s1`. Our fix for this in https://github.com/pytorch/pytorch/pull/93059 was to just retrace the graph, so that s0 + s1 got allocated a new symbol s2 and everything was happy. However, this strategy had other problems, namely (1) we lost all information from the previous ShapeEnv, including guards and (2) we end up allocating a LOT of fresh new symbols in backwards. With this change, we preserve the same ShapeEnv between forward and backwards. How do we do this? We simply require that every symbol which may be present inside tensors, ALSO be a plain SymInt input to the graph. This invariant is enforced by Dynamo. Once we have done this, we can straightforwardly modify the partitioner to preserve these SymInt as saved for backwards, if they are needed in the backwards graph to preserve the invariant as well. This apparently breaks yolov3, but since everything else is OK I'm merging this as obviously good and investigating later. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99089 Approved by: https://github.com/voznesenskym	2023-04-16 01:48:19 +00:00
Shen Li	c69d54885a	[SPMD][BE] Generalize factory ops support in SPMD expansion (#99233 ) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D45028740](https://our.internmc.facebook.com/intern/diff/D45028740) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99233 Approved by: https://github.com/yifuwang	2023-04-16 00:07:27 +00:00
Shen Li	6bb20822f5	[SPMD][BE] Remove deprecated aten.sym_numel branch (#99232 ) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D45028732](https://our.internmc.facebook.com/intern/diff/D45028732) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99232 Approved by: https://github.com/yifuwang	2023-04-16 00:07:27 +00:00
Shen Li	39be994913	[SPMD][BE] Consolidate DSymInt Branches (#99231 ) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D45028726](https://our.internmc.facebook.com/intern/diff/D45028726) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99231 Approved by: https://github.com/yifuwang	2023-04-16 00:07:24 +00:00
Shen Li	544cd8e134	[SPMD] Refactor DSize to DSymInt to enable sym_numel (#99206 ) This commit uses `aten.arange.default` and `aten.arange.start` to test `aten.sym_numel`. Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D45028715](https://our.internmc.facebook.com/intern/diff/D45028715) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99206 Approved by: https://github.com/yifuwang	2023-04-16 00:07:21 +00:00
Shen Li	bafb984022	[SPMD] Enable aten.full.default with SymInt on sharded dims (#99190 ) Differential Revision: [D45028686](https://our.internmc.facebook.com/intern/diff/D45028686) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99190 Approved by: https://github.com/yifuwang	2023-04-16 00:07:18 +00:00
Edward Z. Yang	d350646ff6	SymIntify randint and randperm (#98968 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98968 Approved by: https://github.com/xw285cornell	2023-04-15 22:43:51 +00:00
Edward Z. Yang	756a86d52c	Support large negative SymInt (#99157 ) The strategy is that we will heap allocate a LargeNegativeIntSymNodeImpl whenever we have a large negative int, so that we can keep the old `is_symbolic` test (now called `is_heap_allocated`) on SymInt. Whenever we need to do something with these ints, though, we convert them back into a plain `int64_t` (and then, e.g., wrap it in whatever user specificed SymNodeImpl they need.) We cannot wrap directly in the user specified SymNodeImpl as we generally do not know what the "tracing context" is from C++. We expect large negative ints to be rare, so we don't apply optimizations like singleton-ifying INT_MIN. Here's the order to review: * c10/core/SymInt.h and cpp * `is_symbolic` renamed to `is_heap_allocated` as I needed to audit all use sites: the old `is_symbolic` test would return true for large negative int, but it would be wrong to then try to dispatch on the LargeNegativeIntSymNodeImpl which supports very few operations. In this file, I had to update expect_int, * If you pass in a large negative integer, we instead heap allocate it in `promote_to_negative`. The function is written in a funny way to keep compact constructor code for SymInt (the heap allocation happens out of line) * clone is now moved out-of-line * New method maybe_as_int which will give you a constant int if it is possible, either because it's stored inline or in LargeNegativeIntSymNodeImpl. This is the preferred replacement for previous use of is_symbolic() and then as_int_unchecked(). * Rename toSymNodeImpl to toSymNode, which is more correct (since it returns a SymNode) * Complete rewrite of `normalize_symints.cpp` to use new `maybe_as_int`. Cannot easily use the old code structure, so it's now done doing a macro and typing out each case manually (it's actually not that bad.) * Reimplementations of all the unary operators by hand to use `maybe_as_int`, relatively simple. * c10/core/LargeNegativeIntSymNodeImpl.h - Just stores a int64_t value, but it has to be big and negative. Most methods are not implemented, since we will rewrap the large negative int in the real SymNodeImpl subclass before doing operations with it * The rest of the files are just rewriting code to use `maybe_as_int`. There is a nontrivial comment in c10/core/SymIntArrayRef.h Very minor test adjustment in c10/test/core/SymInt_test.cpp . Plan to exercise this properly in next PR. Companion XLA PR: https://github.com/pytorch/xla/pull/4882 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99157 Approved by: https://github.com/albanD	2023-04-15 22:43:51 +00:00
Li-Huai (Allan) Lin	5c062e8bb4	[vmap] Add vmap support for nn.functional.huber_loss (#99235 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99235 Approved by: https://github.com/kshitij12345	2023-04-15 22:19:35 +00:00
mikey dagitses	c9403f128b	fix breakage from #99027 (#99245 ) fix breakage from #99027 Summary: There is a conditional branch that is not tested by OSS CI. Test Plan: Run nightly binary builds that include CUDA-12 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99245 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/jansel, https://github.com/mihaimaruseac	2023-04-15 21:26:26 +00:00
Christian Puhrsch	3cde50e3fa	Update NT error message (#99166 ) https://github.com/pytorch/nestedtensor has been archived. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99166 Approved by: https://github.com/jbschlosser	2023-04-15 21:18:03 +00:00
Jason Ansel	47c685def3	[dynamo] Support DELETE_ATTR (#98698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98698 Approved by: https://github.com/yanboliang	2023-04-15 20:31:40 +00:00
Yanbo Liang	15fe5a0798	[Dynamo] Fix benchmark --verbose error (#99224 ) Dynamo benchmark --verbose is broken: ``` Traceback (most recent call last): File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/torchbench.py", line 400, in <module> torchbench_main() File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/torchbench.py", line 396, in torchbench_main main(TorchBenchmarkRunner(), original_dir) File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/common.py", line 1967, in main return maybe_fresh_cache( File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/common.py", line 993, in inner return fn(args, *kwargs) File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/common.py", line 2135, in run torch._dynamo.config.log_level = logging.DEBUG File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/config_utils.py", line 67, in __setattr__ raise AttributeError(f"{self.__name__}.{name} does not exist") AttributeError: torch._dynamo.config.log_level does not exist ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99224 Approved by: https://github.com/voznesenskym	2023-04-15 20:18:50 +00:00
Nikita Karetnikov	21681f36f4	[pt2] add `SymInt` support for fft ops (#99115 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99115 Approved by: https://github.com/ezyang	2023-04-15 18:01:39 +00:00
Nikita Karetnikov	f89b7c2bec	[pt2] add `SymInt` support for `roll` (#99114 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99114 Approved by: https://github.com/ezyang	2023-04-15 18:01:39 +00:00
Michael Voznesensky	d5f7ec8a31	Apply dynamic shapes policy correctly to _base tensor (#99211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99211 Approved by: https://github.com/ezyang	2023-04-15 17:18:45 +00:00
Aaron Gokaslan	85f38b8a33	[BE] Update flake8-comprehensions and adapt to rule C418 (#99178 ) Applies rule C418 and fixes all instances of it. Also updates flake8-comprehension Pull Request resolved: https://github.com/pytorch/pytorch/pull/99178 Approved by: https://github.com/ezyang	2023-04-15 15:33:42 +00:00
mikey dagitses	506bd05752	make ATen/native/cuda/NLLLoss2d.cu data_ptr-correct (#99179 ) make ATen/native/cuda/NLLLoss2d.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99179 Approved by: https://github.com/ezyang	2023-04-15 14:02:24 +00:00
mikey dagitses	e9201ab690	make ATen/native/cuda/AdaptiveMaxPooling3d.cu data_ptr-correct (#99185 ) make ATen/native/cuda/AdaptiveMaxPooling3d.cu data_ptr-correct Test Plan: Rely on CI. --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/99185). * #99186 * __->__ #99185 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99185 Approved by: https://github.com/ezyang	2023-04-15 14:02:11 +00:00
Bin Bao	34f681c13b	[CI] Remove inductor skip list for timm_models (#98840 ) Summary: check against the expected csv file instead of skipping tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/98840 Approved by: https://github.com/ezyang	2023-04-15 13:54:41 +00:00
Bin Bao	a595a50653	[CI] Use expected accuracy csv files to check benchmark test status (#98839 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98839 Approved by: https://github.com/ezyang	2023-04-15 13:54:41 +00:00
Nikita Vedeneev	1adb6fa922	nn.Linear: dispatch to bsr_dense_mm for half and bfloat16 (#94825 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94825 Approved by: https://github.com/albanD, https://github.com/cpuhrsch	2023-04-15 13:38:42 +00:00
mikey dagitses	b69f0480a5	make ATen/native/cuda/MaxUnpooling.cu data_ptr-correct (#99189 ) make ATen/native/cuda/MaxUnpooling.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99189 Approved by: https://github.com/ezyang	2023-04-15 13:37:07 +00:00
mikey dagitses	60f914e77e	make ATen/native/cuda/UpSampleNearest2d.cu data_ptr-correct (#99186 ) make ATen/native/cuda/UpSampleNearest2d.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99186 Approved by: https://github.com/ezyang	2023-04-15 13:25:10 +00:00
Yanbo Liang	05809c7d3b	[Dynamo] No graph break for explicit calling Conv{1/2/3}d.forward & ConvTranspose{1/2/3}d.forward (#99015 ) Before this PR, if users call ```Conv2d(x)```, dynamo handles it well(no graph break) and puts a ```call_module``` op in the FX graph. However, if users explicitly call ```Conv2d.forward(x)``` in another ```forward``` function, the inlining would be failed(caused graph break). This PR fixed this issue by translating the explicit ```Conv2d.forward(x)``` to ```Conv2d(x)```. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99015 Approved by: https://github.com/jansel, https://github.com/wconstab	2023-04-15 08:04:13 +00:00
Daniel Dale	157c869026	Enable FSDP ``use_orig_params=True`` mixed precision training when some ranks have no (non-zero sized) parameter shards (#99175 ) Fixes #99174 ## Enable FSDP ``use_orig_params=True`` mixed precision training when some ranks have no (non-zero sized) parameter shards ### The issue Now that ``use_orig_params=True`` allows non-uniform ``requires_grad`` (🎉 🚀 thanks @awgu!!!) with [#98221](https://github.com/pytorch/pytorch/pull/98221), there will be circumstances wherein some ranks have no (non-zero sized) local shards of the original parameters (and hence no associated gradients). ### Use Cases For a simple Transformer case, imagine a user wraps all encoder layers in separate FSDP instances but allows the classifier head to be wrapped in the same FSDP instance as the relatively large embeddings layers. While this is a sub-optimal wrapping strategy for most use-cases, I believe it is expected to be supported (full precision training works in that context). I originally encountered this issue while extending a package I maintain, leveraging the relaxed ``requires_grad`` contstraint to simplify multi-phase scheduled fine-tuning FSDP configuration, so a [concrete example is there](https://finetuning-scheduler.readthedocs.io/en/latest/advanced/fsdp_scheduled_fine_tuning.html#basic-scheduled-fine-tuning-with-fsdp). ### Reproduction and Remediation Currently, ``ShardedGradScaler`` does not accommodate these situations, failing to initialize ``optimizer_state["found_inf_per_device"]`` when ``unscale_`` is called. In this PR, I extend the existing ``ShardedGradScaler`` tests with an ``use_orig_params=True`` dimension added to the parameterization and test scenarios wherein one rank possesses no (non-zero sized) parameter shards. The relevant issue can be reproduced with the tests I'm adding in this PR. The current (pre-PR) execution of these tests fail in ``use_orig_params=True`` mode with this error: ```python ./test_fsdp_sharded_grad_scaler.py::TestShardedGradScalerParityWithDDP::test_fsdp_ddp_parity_with_grad_scaler_offload_false_none_mixed_precision_use_orig_params Failed with Error: Process 0 exited with error code 10 and exception: Traceback (most recent call last): File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_distributed.py", line 657, in run_test getattr(self, test_name)() File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_distributed.py", line 543, in wrapper fn() File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_utils.py", line 259, in instantiated_test test(self, *param_kwargs) File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_distributed.py", line 174, in wrapper return func(args, *kwargs) File "/home/speediedan/repos/pytorch/test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py", line 187, in test_fsdp_ddp_parity_with_grad_scaler self._test_fsdp_parity( File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_fsdp.py", line 1152, in _test_fsdp_parity fsdp_loss = self._train_for_several_steps( File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_fsdp.py", line 1016, in _train_for_several_steps sharded_grad_scaler.step(optim) File "/home/speediedan/repos/pytorch/torch/distributed/fsdp/sharded_grad_scaler.py", line 291, in step return super().step(optimizer, args, **kwargs) File "/home/speediedan/repos/pytorch/torch/cuda/amp/grad_scaler.py", line 368, in step assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer." AssertionError: No inf checks were recorded for this optimizer. ``` A few implementation notes/considerations and questions: 1. Rather than just initialize ``per_device_found_inf``, one could disable the grad scalar altogether for relevant ranks, altering ``unscale_`` to reduce with a subgroup or some rank mask construct to avoid the ``all_reduce`` s in ``distributed/fsdp/sharded_grad_scaler.py:unscale_()`` from hanging. Given that users may subsequently add parameter groups to an optimizer that would require re-enabling the scaler and the complexity associated with maintaining a separate mask construct or process subgroup, I thought this implementation was cleaner. 2. I extended ``_train_for_several_steps`` and ``_test_fsdp_parity`` in ``/torch/testing/_internal/common_fsdp.py`` with the ability to configure ``sharded_grad_scaler_kwargs`` for future testing flexibility. 3. Should the user be warned that no parameter shards were associated with a given rank? My initial thought is that this should be considered an implementation detail, part of supporting ``use_orig_params`` with heterogeneous ``requires_grad``, and therefore should be transparently handled by PyTorch. Should a DEBUG level message be added? If so, likely further upstream rather than at the scaler step level. 4. Rather than extend the existing ``ShardedGradScaler`` tests with an ``use_orig_params=True`` dimension added to the parameterization, let me know if you prefer that I instead narrow the scope of the new testing to a single additional test, e.g.: ```python # from typing import Optional from typing import Optional, List # ... # use_orig_params = ["enable_use_orig_params", None] use_orig_params: List[Optional[str]] = [None] # ... configs = list(itertools.product(cpu_offload_config, sharding_strategy_config, mixed_precision, use_orig_params)) configs.append((CPUOffload(offload_params=False), None, "enable_mixed_precision", "enable_use_orig_params")) ``` Thanks as always to the PyTorch distributed team for your astonishingly impressive and valuable contributions to the open-source ML engineering community! Pull Request resolved: https://github.com/pytorch/pytorch/pull/99175 Approved by: https://github.com/awgu	2023-04-15 05:13:23 +00:00
Jason Ansel	e9be0b0fb9	[dynamo] Support functools.wraps (#98699 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98699 Approved by: https://github.com/yanboliang, https://github.com/voznesenskym	2023-04-15 03:24:06 +00:00
PyTorch MergeBot	b9426ded8d	[vision hash update] update the pinned vision hash (#99212 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99212 Approved by: https://github.com/pytorchbot	2023-04-15 02:48:35 +00:00
Will Constable	3c4622c0ec	Patch failing slow-test logic for inductor-dynamic (#99182 ) Fixes #98954 But.. I'm not sure what the right fix is Pull Request resolved: https://github.com/pytorch/pytorch/pull/99182 Approved by: https://github.com/huydhn	2023-04-15 02:09:10 +00:00
Will Constable	6eab5e88c8	Graph-break on allowed modules if they have hooks (#97184 ) Allowed modules are stuck into dynamo's fx graph as call_module nodes, without dynamo doing any tracing of the module. This means during AOT trace time, hooks will fire during tracing when the call_module is executed, but the hooks themselves will disappear after that and not be present in the compiled program. (worse, if they performed any tensor operations, those would get traced so you could end up with part of the hook's functionality). To circumvent this, there are two options for 'allowed modules' with hooks. 1) don't treat them as 'allowed' - trace into them 2) graph-break, so the module is no longer part of the dynamo trace at all (1) will fail for users that opted into allowed modules becuase they know their module has problems being traced by dynamo. (2) causes graph breaks on common modules such as nn.Linear, just because they are marked as 'allowed'. It would help matters if we could differentiate between types of allowed modules (A) allowed to avoid overheads - used for common ops like nn.Linear (B) allowed to avoid dynamo graphbreaks caused by unsupported code Ideally, we'd use method (1) for group (A) and (2) for (B). For now, graph-break on all cases of allowed modules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97184 Approved by: https://github.com/jansel	2023-04-15 01:46:15 +00:00
BowenBao	55c71cf91f	[ONNX] Support aten.stack for dynamo_export (#99191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99191 Approved by: https://github.com/justinchuby	2023-04-15 01:13:00 +00:00
BowenBao	606ce5b653	[ONNX] Introduce Input/Ouptut adapter; Switch to 'DynamoExporter' (#98421 ) Summary * Introduce input/output adapter. Due to design differences, input/output format between PyTorch model and exported ONNX model are often not the same. E.g., `None` inputs are allowed for PyTorch model, but are not supported by ONNX. Nested constructs of tensors are allowed for PyTorch model, but only flattened tensors are supported by ONNX, etc. The new input/output adapter is exported with the model. Providing an interface to automatically convert and validate inputs/outputs format. * As suggested by #98251, provide extension for unwrapping user defined python classes for `dynamo.export` based exporter. Unblock huggingface models. * Re-wire tests to run through `DynamoExporter` w/ `dynamo_export` api. Kept `DynamoOptimizeExporter` in the tests for now for coverage of this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98421 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms, https://github.com/thiagocrepaldi	2023-04-15 01:13:00 +00:00
Rohan Varma	ef11966aff	[composable] Enable replicate + trec_shard overall (#98890 ) replicate + trec_shard works if we shard / replicate individually, such as follows: ``` m = TestSparseNN() shard(m.sparse) replicate(m.dense) ``` but does not work if users do the following: ``` m = TestSparseNN() shard(m, sharders=[...]) replicate(m) ``` Many upstream trainers use the latter use case, as sharding is not done on individual module level but rather overall module by specifying planners that contain logic for how to shard different embedding table types. This diff enables the latter approach (while keeping the former intact), but users need to specify `ignored_modules` to ignore embedding tables in replicate(). This is similar to FSDP (class based and composable) and DDP today. Differential Revision: [D44899155](https://our.internmc.facebook.com/intern/diff/D44899155/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98890 Approved by: https://github.com/mrshenli, https://github.com/yhcharles	2023-04-15 01:09:00 +00:00
Sudarshan Raghunathan	e45fa1a581	Back out "[core][pruning][be] rename BaseSparsifier to BasePruner (#98747 )" (#99171 ) Summary: Back out D44856390 since renaming the type breaks backwards compatibility of existing models used in integration tests and likely in prod as well. Test Plan: buck2 run //aiplatform/modelstore/model_generation/integration_tests:cogwheel_igr_tab_offline_and_recurring_model_generation_v1_api_test-launcher -- --build-fbpkg --run-disabled --run-harness-in-tupperware Now fails with an OOM: https://www.internalfb.com/servicelab/experiment/100000000259121/trial/100000000331723/run It was failing with an import error without this revert. Differential Revision: D44991351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99171 Approved by: https://github.com/izaitsevfb, https://github.com/osalpekar	2023-04-15 00:37:45 +00:00
Aleksei Nikiforov	c130b8a716	Reintroduce s390x SIMD support (#99057 ) Reintroduce s390x SIMD support Use vectorized FMA to fix test precision failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/99057 Approved by: https://github.com/malfet	2023-04-15 00:24:44 +00:00
Brian Hirsh	7cb581d42f	aot_autograd: more logging on metadata asserts (#99177 ) Summary: add better logging to aot autograd asserts to debug internal model issues Test Plan: let CI run Differential Revision: D45006044 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99177 Approved by: https://github.com/bertmaher	2023-04-15 00:22:00 +00:00
Catherine Lee	06ad8d6d5f	Remove filter step (#98969 ) remove filter steps from linux, rocm, and mac tests theres still some filter jobs from other places like bazel Pull Request resolved: https://github.com/pytorch/pytorch/pull/98969 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-04-15 00:08:20 +00:00
Manuel Candales	cb23191523	[Vulkan] rewrite quantized add, mul, conv2d and conv2d_relu ops (#97468 ) Summary: This diffs registers the vulkan quantized binary ops (add/sub/mul/div), and adds graph rewrites for quantized add, mul, conv2d and conv2d_relu. The rewrites for conv2d and conv2d_relu make use of the convert_qconv2d_context introduced in D41595032 Test Plan: export quantized mcs model to vulkan Reviewed By: SS-JIA Differential Revision: D44189363 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97468 Approved by: https://github.com/SS-JIA	2023-04-15 00:08:11 +00:00
Rodrigo Kumpera	a910045add	[PATCH] Back out "Move functional collectives implementation to python. (#98595 ) (#99168 ) Summary: Original commit changeset: ba36f8751adc Original Phabricator Diff: D44788697 Test Plan: model loading is fine after reverting the diff Reviewed By: zyan0, sayitmemory Differential Revision: D44921259 --- Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99168 Approved by: https://github.com/izaitsevfb	2023-04-14 23:48:19 +00:00
Michael Voznesensky	20019f7c56	Fix bug in symbolic shape messages (#99169 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99169 Approved by: https://github.com/anijain2305	2023-04-14 23:18:29 +00:00
Huy Do	70ec347f06	Skip sccache initialization on MacOS (#99121 ) Now that the cache is used on MacOS, I'm seeing some flaky timeout when starting the server https://github.com/pytorch/pytorch/actions/runs/4703387666/jobs/8341817701. This step is optional, so we could just skip it (like what Linux workflow does). Pull Request resolved: https://github.com/pytorch/pytorch/pull/99121 Approved by: https://github.com/clee2000	2023-04-14 23:10:55 +00:00
Jason Ansel	c0d9a0268d	[inductor] Use FakeTensorMode() when creating patterns (#99128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99128 Approved by: https://github.com/ngimel	2023-04-14 22:53:28 +00:00
feifan	bd07f8d2e0	DDP forward support custom stream accelerated copy. (#98723 ) At present, DDP forward uses `_get_stream` to get a stream,which is cudaStream. If the custom module already registered to torch, I can use `getattr` to get it and it's stream. Then, the custom stream is used to copy the tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98723 Approved by: https://github.com/ezyang	2023-04-14 20:19:56 +00:00
Ilia Taraban	a1074ddf51	Enable cadd_sparse for BFloat16 on CPU (#96767 ) Enabling cadd_sparse operation for BFloat16 on CPU to support BFloat16 operations in GNN libraries. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96767 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2023-04-14 19:50:49 +00:00
Angela Yi	2d542d36a8	[dynamo] FakeTensor comparison with "is" instead of "==" (#99134 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/99134 Approved by: https://github.com/yanboliang	2023-04-14 19:36:00 +00:00
Mitchell, Frost	b9d691040a	Update InternalMatch in subgraph_rewriter after repeated replacements (#99039 ) Fixes #98974 When `torch.fx.subgraph_rewriter._replace_pattern` is used to remove nodes from a graph, if there are two adjacent matches then after the first removal, the nodes in `InternalMatch.nodes_map` and `placeholder_nodes` become outdated because they contain nodes that were just removed from the graph. This fix is to update the `match.nodes_map` and `match.placeholder_nodes` using the node changes stored in `match_changed_node`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99039 Approved by: https://github.com/angelayi	2023-04-14 19:35:38 +00:00
andrewor14	651c1be885	Recompute flat_arg_fake_tensors after fakification (#98769 ) Summary: This fixes the case when some of the input tensors were real tensors and fakified in `validate_and_convert_non_fake_tensors`, but `flat_arg_fake_tensors` would not contain all the inputs because it was computed before the fakification. We fix this by recomputing `flat_arg_fake_tensors` after fakification as well. Test Plan: python test/dynamo/test_export.py ExportTests.test_mixed_real_and_fake_inputs Reviewers: Chillee, voznesenskym Pull Request resolved: https://github.com/pytorch/pytorch/pull/98769 Approved by: https://github.com/voznesenskym	2023-04-14 19:14:29 +00:00
Lu Fang	df43fef87f	Support >4GB strings in the TorchScript model (#99104 ) Summary: The support of BINUNICODE8 is missing. So adding it. So we can support attributes > 4GB. For example, for very large model, we save the lowered model in the EngineHolder using a string attribute. Test Plan: buck2 test mode/opt //caffe2/test:jit -- --exact 'caffe2/test:jit - test_save_load_large_string_attribute (jit.test_save_load.TestSaveLoad)' Differential Revision: D44905770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99104 Approved by: https://github.com/qihqi	2023-04-14 18:46:19 +00:00
Jason Ansel	6b9a52d1a4	[inductor] Refactor post_grad.py (#99127 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99127 Approved by: https://github.com/ngimel	2023-04-14 18:24:24 +00:00
Kiersten Stokes	ae55619a2b	Add check for same dtype in tensordot implementation (#98938 ) Fixes #77517 I believe[ the first bullet point in this comment](https://github.com/pytorch/pytorch/issues/77517#issuecomment-1129233539) from the linked issue is no longer a concern, but please let me know if I'm incorrect here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98938 Approved by: https://github.com/lezcano	2023-04-14 16:57:35 +00:00
Jerry Zhang	9e0df2379b	[quant][fix] Compare resnet with quantizer api with the prepare_fx and decomposed convert flow (#98905 ) Summary: Using a decomposed convert to make sure we get exact match, this means the nodes in resnet are annotated correctly Test Plan: python test/test_quantization.py TestQuantizePT2EModels.test_resnet18_with_quantizer_api Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/98905 Approved by: https://github.com/andrewor14	2023-04-14 16:25:15 +00:00
Jason Ansel	baa06790f8	Unbreak torch.compile on macos (#99119 ) It seems like #96980 made torch.compile() completely ignore the `backend=` arg on macos rendering the entire API useless even if the user wasn't using mps tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99119 Approved by: https://github.com/msaroufim	2023-04-14 15:30:27 +00:00
Felix Divo	70072c926e	Fix MHA doc string (#99146 ) This was missed in #97046 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99146 Approved by: https://github.com/albanD	2023-04-14 15:19:13 +00:00
Chien-Chin Huang	286b618b7d	[SPMD] Move some functions to IterGraphModule.setup() (#99076 ) Since users will have to call these steps before calling `setup()`, moving these steps to `setup()` can reduce the API usage complexity. Differential Revision: [D44973726](https://our.internmc.facebook.com/intern/diff/D44973726/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99076 Approved by: https://github.com/lessw2020	2023-04-14 14:41:10 +00:00
Chien-Chin Huang	d863876545	[SPMD] Remove the unused code (#99075 ) Remove the unused code Differential Revision: [D44973692](https://our.internmc.facebook.com/intern/diff/D44973692/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99075 Approved by: https://github.com/lessw2020	2023-04-14 14:35:55 +00:00
mikey dagitses	9642eb59ad	make ATen/native/cuda/Normalization.cuh data_ptr-correct (#99044 ) make ATen/native/cuda/Normalization.cuh data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99044 Approved by: https://github.com/ezyang	2023-04-14 14:24:13 +00:00
mikey dagitses	46cfde4645	make ATen/native/cuda/MultiLabelMarginCriterion.cu data_ptr-correct (#99077 ) make ATen/native/cuda/MultiLabelMarginCriterion.cu data_ptr-correct Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/99077 Approved by: https://github.com/ezyang	2023-04-14 14:17:11 +00:00
mikey dagitses	4d1297cae8	trivially convert memcpy sources to use const_data_ptr (#98754 ) Differential Revision: [D44834491](https://our.internmc.facebook.com/intern/diff/D44834491/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98754 Approved by: https://github.com/Skylion007	2023-04-14 14:03:16 +00:00
mikey dagitses	3a510e3911	trivially convert all memcpy destinations to mutable_data_ptr (#98753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98753 Approved by: https://github.com/Skylion007	2023-04-14 14:00:00 +00:00
Shen Li	40aaacd4fa	Respect sharded dimensions when aten expaned/view consumes SymInt values (#99058 ) Currently, aten.expand always expands to the global dimension. Then, it introduces additional slice and clone ops before running compute on the expanded tensor with a local tensor. In this commit, if we detect the op consumes a SymInt size, it respects both local size and the dimension placements from where the SymInt was extracted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99058 Approved by: https://github.com/wanchaol	2023-04-14 13:54:05 +00:00
Kunal Bhalla	d365d9ed29	[torch package][easy] Make all the save/load tests use buffers (#98798 ) Summary: Make it a bit easier to run the tests anywhere/avoid skipping the tests by using buffers instead of temporary files. [Er, still figuring out how the sync tooling works, I'll send this against github once the first diff is landed] Test Plan: buck2 test Reviewed By: fluckydog232 Differential Revision: D44818261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98798 Approved by: https://github.com/ezyang	2023-04-14 13:52:17 +00:00
Li-Huai (Allan) Lin	210354620c	[MPS] Fix macOS build (#99139 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99139 Approved by: https://github.com/albanD	2023-04-14 13:26:31 +00:00
mikey dagitses	05a55b96d2	make ATen/native/cuda/group_norm_kernel.cu data_ptr-correct (#99041 ) make ATen/native/cuda/group_norm_kernel.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99041 Approved by: https://github.com/ezyang	2023-04-14 13:20:21 +00:00
yhl48	298cc5c611	Add vmap support for `torch.nn.functional.smooth_l1_loss` (#98357 ) Partially fixes #97246 and #97558. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98357 Approved by: https://github.com/kshitij12345	2023-04-14 12:29:53 +00:00
lezcano	1e78a2edcc	Make summarize_perf.py work with perf-compare (#99095 ) [perf-compare](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-compare.yml) has a different structure than that of the nightlies. For these files, the script now generates: ``` # cuda float32 training performance results ## Geometric mean speedup huggingface timm_models torchbench -------- ------------- ------------- ------------ inductor 1.46 1.4 1.17 ## Mean compilation time huggingface timm_models torchbench -------- ------------- ------------- ------------ inductor 57.85 97.63 60.18 ## Peak memory compression ratio huggingface timm_models torchbench -------- ------------- ------------- ------------ inductor 1.06 1.01 0.83 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99095 Approved by: https://github.com/ezyang	2023-04-14 12:10:54 +00:00
Edward Z. Yang	ca735ac856	Don't specialize when indexing by SymInt (#99123 ) Fixes https://github.com/pytorch/pytorch/issues/99091 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99123 Approved by: https://github.com/msaroufim	2023-04-14 11:39:43 +00:00
Edward Z. Yang	0963e1187a	Put GraphArg on the node meta. (#99068 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99068 Approved by: https://github.com/voznesenskym	2023-04-14 11:34:28 +00:00
Zachary DeVito	6a50b83b73	Expandable blocks in allocator (#96995 ) Common advice we give for handling memory fragmentation issues is to allocate a big block upfront to reserve memory which will get split up later. For programs with changing tensor sizes this can be especially helpful to avoid OOMs that happen the first time we see a new largest input and would otherwise have to allocate new segments. However the issue with allocating a block upfront is that is nearly impossible to correctly estimate the size of that block. If too small, space in the block will run out and the allocator will allocate separate blocks anyway. Too large, and other non-PyTorch libraries might stop working because they cannot allocate any memory. This patch provides the same benefits as using a pre-allocating block but without having to choose its size upfront. Using the cuMemMap-style APIs, it adds the ability to expand the last block in a segment when more memory is needed. Compared to universally using cudaMallocAsync to avoid fragmentation, this patch can fix this common fragmentation issue while preserving most of the existing allocator behavior. This behavior can be enabled and disabled dynamically. This should allow users to, for instance, allocate long-lived parameters and state in individual buffers, and put temporary state into the large expandable blocks, further reducing fragmentation. See inline comments for information about the implementation and its limitations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995 Approved by: https://github.com/eellison	2023-04-14 09:49:11 +00:00
Li-Huai (Allan) Lin	2494e62599	[MPS] Add ASSERT_ONLY_METHOD_OPERATORS Finish (#99021 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99021 Approved by: https://github.com/albanD	2023-04-14 09:09:18 +00:00
Li-Huai (Allan) Lin	7ddeb8d320	[MPS] Add ASSERT_ONLY_METHOD_OPERATORS Part 5 (#99020 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99020 Approved by: https://github.com/albanD	2023-04-14 09:09:17 +00:00
Li-Huai (Allan) Lin	70120f2f92	[MPS] Add ASSERT_ONLY_METHOD_OPERATORS Part 4 (#99019 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99019 Approved by: https://github.com/albanD	2023-04-14 09:09:15 +00:00
Li-Huai (Allan) Lin	f043ff2cec	[MPS] Add ASSERT_ONLY_METHOD_OPERATORS Part 3 (#99018 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99018 Approved by: https://github.com/albanD	2023-04-14 09:09:11 +00:00
Li-Huai (Allan) Lin	be50c1c48e	[MPS] Add ASSERT_ONLY_METHOD_OPERATORS Part 2 (#99017 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99017 Approved by: https://github.com/albanD	2023-04-14 09:09:06 +00:00
Li-Huai (Allan) Lin	4ddaab845c	[MPS] Add ASSERT_ONLY_METHOD_OPERATORS Part 1 (#99016 ) Summary: 1. Part 1~4 add `TORCH_ASSERT_ONLY_METHOD_OPERATORS` to files in the MPS codebase and replace `empty_mps`with `empty`. Also exclude `OperationUtils.h` from the assert as at this stage we still need `<ATen/ATen.h>` to get CI to pass. 2. Part 5 removes `<ATen/ATen.h>` include in `OperationUtils.h` and adds method operator headers to all mps files. 3. The last one moves `TORCH_ASSERT_ONLY_METHOD_OPERATORS` to the top of files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99016 Approved by: https://github.com/albanD	2023-04-14 09:09:02 +00:00
mikey dagitses	27049f3ff2	make native/cuda/EmbeddingBackwardKernel.cu data_ptr-correct (#99027 ) Test Plan: Rely on CI. Reviewers: ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/99027 Approved by: https://github.com/ezyang	2023-04-14 08:47:04 +00:00
Chien-Chin Huang	bce7308881	[SPMD] Upstream partial_lower (#99069 ) Several ops cannot be lowered to the Inductor. This PR copies the internal implementation of partial_lower (credit to @yifuwang ) to torch.distributed._spmd to unblock the OSS usage. The internal version will be kept until it is mature and will replace this version. Differential Revision: [D44970278](https://our.internmc.facebook.com/intern/diff/D44970278/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99069 Approved by: https://github.com/mrshenli, https://github.com/lessw2020	2023-04-14 08:32:05 +00:00
Michael Voznesensky	10fbdcf72c	Re-PR of 90269 - Force all nn_module associated tensors to be static (#99108 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99108 Approved by: https://github.com/ezyang	2023-04-14 05:53:48 +00:00
Wanchao Liang	b3e63baf58	[spmd][easy] delete unused optim states during compile (#99061 ) delete a unused states Pull Request resolved: https://github.com/pytorch/pytorch/pull/99061 Approved by: https://github.com/mrshenli	2023-04-14 05:14:28 +00:00
Wanchao Liang	55a1dc7f88	[dtensor] redistributed by default take self mesh instead (#99060 ) This PR switches redistribute to default use self mesh instead of the global mesh, which is more user friendly Pull Request resolved: https://github.com/pytorch/pytorch/pull/99060 Approved by: https://github.com/mrshenli	2023-04-14 05:14:28 +00:00
blzheng	cdef4f073c	inductor: fix timeout in ModularIndexing (#98841 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98841 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-14 03:58:58 +00:00
YJ Shi	0a7baabafb	[torch.compile] Add sympy.core.relational.Relational to inductor.ir (#98971 ) Fix #98879 from running gpt-neo-125m on inductor with `dynamic=True`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98971 Approved by: https://github.com/ezyang	2023-04-14 03:53:47 +00:00
BowenBao	9d62f771eb	[ONNX] Remove duplicated code from previous rebase (#99072 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99072 Approved by: https://github.com/titaiwangms, https://github.com/justinchuby	2023-04-14 03:19:43 +00:00
Edward Z. Yang	cd078d376e	GraphArg is always length one, adjust APIs accordingly (#99059 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99059 Approved by: https://github.com/voznesenskym	2023-04-14 03:11:25 +00:00
Edward Z. Yang	e613a419ed	Remove dead wrap_sym (#99049 ) I'm pretty sure this isn't used by anything Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99049 Approved by: https://github.com/Skylion007, https://github.com/voznesenskym	2023-04-14 03:11:25 +00:00
Edward Z. Yang	21ed07ceb9	Delete dead orig_graphargs (#99047 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99047 Approved by: https://github.com/voznesenskym	2023-04-14 03:11:25 +00:00
Edward Z. Yang	cc345d181a	Change unspec ints to not be duck-sized (#99010 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99010 Approved by: https://github.com/janeyx99	2023-04-14 03:09:05 +00:00
mikey dagitses	a6a3df08e6	make ATen/native/cuda/Loss.cu data_ptr-correct (#99073 ) make ATen/native/cuda/Loss.cu data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99073 Approved by: https://github.com/ezyang	2023-04-14 03:04:13 +00:00
mikey dagitses	8382e91b9c	make ATen/native/cuda/MultiTensorApply.cuh data_ptr-correct (#99081 ) make ATen/native/cuda/MultiTensorApply.cuh data_ptr-correct Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/99081 Approved by: https://github.com/ezyang	2023-04-14 03:03:41 +00:00
PyTorch MergeBot	13e4cc962a	[vision hash update] update the pinned vision hash (#99109 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99109 Approved by: https://github.com/pytorchbot	2023-04-14 03:00:22 +00:00
Manuel Candales	d5abc7bfee	[Vulkan] Move convert_qconv2d_context to custom ops (#98548 ) Summary: Move convert_qconv2d_context to it's own custom op library Test Plan: ```buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2/fb/custom_ops/vulkan_quantized:pt_vulkan_quantized_test_binAppleMac\#macosx-arm64``` Differential Revision: D44688797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98548 Approved by: https://github.com/kirklandsign	2023-04-14 03:00:09 +00:00
mikey dagitses	ece497b681	make cublasSgemmStridedBatched data_ptr-const (#99085 ) make cublasSgemmStridedBatched data_ptr-const Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99085 Approved by: https://github.com/ezyang	2023-04-14 02:58:02 +00:00
mikey dagitses	64a61fc7c3	make at::cuda::blas::gemm calls data_ptr-correct (#99087 ) make at::cuda::blas::gemm calls data_ptr-correct Test Plan: Rely on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99087 Approved by: https://github.com/ezyang	2023-04-14 02:54:30 +00:00
Edward Z. Yang	979c5b4cf8	Move torchdynamo start tracing message earlier (#98990 ) Currently, it lives inside run(), but this is too late; we do a lot of work initializing OutputGraph and those log messages will show up before "start tracing". This is bad. Now the start of tracing is InstructionTranslator construction, which ensures we catch these sites. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98990 Approved by: https://github.com/yanboliang	2023-04-14 02:15:53 +00:00
feifan	ee1f28cd15	Fix the bug of comm headers. (#98658 ) `comm.hpp` is exposed to the developer under `torch/include/torch/csrc/distributed/c10d/`. But when I use it, the following error occurs : `undefined symbol:xxx`. So I want it can expose to developer. ![image](https://user-images.githubusercontent.com/37650440/230697500-ec095103-3566-4415-88df-17491f01846e.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98658 Approved by: https://github.com/ezyang	2023-04-14 02:06:51 +00:00
Eddie Yan	c4f81cb6f4	[NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking (#95715 ) Support for nonblocking NCCL communicators/fault tolerance/checking which was added in 2.14 as an experimental feature. Enabled via the environment variable: ``` TORCH_NCCL_USE_COMM_NONBLOCKING=1 ``` CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/95715 Approved by: https://github.com/kwen2501	2023-04-14 02:03:33 +00:00
BJ Hargrave	006e6f1a05	Fix CPU vectorized eq and ne operations for complex types (#97374 ) The comparison of both the real and imag parts need to be combined and then the result must have the real number be 1 for true and 0 for false while the imag number must always be 0. Fixes https://github.com/pytorch/pytorch/issues/75950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97374 Approved by: https://github.com/jgong5, https://github.com/kit1980	2023-04-14 02:02:16 +00:00
mantaionut	5e1ac1bb83	Fix visual studio generator (#98605 ) If `CMAKE_GENERATOR=Visual Studio 16 2019` then the build will fail if `USE_NINJA=False` not set. This PR changes that if CMAKE_GENERATOR is set an not equal to ninja then it won't use Ninja. This is just for easier setting another generator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98605 Approved by: https://github.com/kit1980	2023-04-14 01:46:46 +00:00
Han Qi (qihqi)	06d8e231d5	Make sure that while caching values we don't invoke any Aten operator (#99050 ) Summary: title also change catch to catch all so we can make it wont fail Test Plan: existing tests Reviewed By: harishs88ss Differential Revision: D44945942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99050 Approved by: https://github.com/angelayi	2023-04-14 01:36:18 +00:00
Andrew Gu	0a98d94357	[FSDP] Auto-pad for no `pad()` in post-bwd hook (`use_orig_params=True`) (#99054 ) This avoids the post-backward `F.pad()` call before reduce-scatter for `use_orig_params=True`. It is pretty cool that we built up all of the necessary infra in past PRs so that this change is simple. We simply append one more padding tensor to pad out the `FlatParameter` numel to be divisible by the world size. This causes the flat gradient to be computed directly with the padded size, removing the need for the explicit `F.pad()` call. Because the infra is built out right now for `use_orig_params=True`, we only add this auto-pad logic for that path. We can add it for `use_orig_params=False` if needed in follow-up work. I confirmed in local tests that this removes the pad call. Before (yes `aten::pad`): ![Screenshot 2023-04-13 at 1 38 21 PM](https://user-images.githubusercontent.com/31054793/231840432-e0875972-6546-4cf1-aaaa-bc3949050519.png) After (no `aten::pad`): ![Screenshot 2023-04-13 at 1 38 29 PM](https://user-images.githubusercontent.com/31054793/231840422-8dd6f5ab-0a7a-4393-a835-42009948eb62.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99054 Approved by: https://github.com/fegin, https://github.com/zhaojuanmao	2023-04-14 01:30:23 +00:00
Xilun Wu	49cd650e2b	[BE][DTensor] merge random init test to test_random_ops.py (#98874 ) Random Ops behavior becomes different on CUDA device and others in PR https://github.com/pytorch/pytorch/pull/98199 . As a result, test of DTensor random initialization in test_init.py was skipped. This PR fixes it by having different testing assertion for different types of device and also merges existing duplicated test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98874 Approved by: https://github.com/wanchaol	2023-04-14 01:20:39 +00:00
Nikita Shulga	36f52cc099	[BuildSpeed] Limit `Logcumsumexp` complex to OSS builds only (#98957 ) As it takes ridiculous amount of time to build with complex times on CUDA-11.4. Build speeds for a single gpu architecture (`sm_80`) on 3Ghz 8275CL Intel Xeon: - 143 sec to compile for all dtypes using CUDA-11.6 - 351 sec to compile for all dtypes using CUDA-11.4 - 24 sec to compile for only floating dtypes using CUDA-11.6 - 52 sec to compile for only floating dtypes using CUDA-11.4 Tweak code a bit to make it compilable with MSVC, which is having trouble with nested preprocessor directives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98957 Approved by: https://github.com/r-barnes, https://github.com/ngimel	2023-04-14 00:47:00 +00:00
PyTorch MergeBot	e778bcec05	Revert "fix allgather func collective to use maybe_wrap_tensor (#98866 )" This reverts commit ada7dfff717ab588ed46347093181bb2eccdc854. Reverted https://github.com/pytorch/pytorch/pull/98866 on behalf of https://github.com/izaitsevfb due to Conflicts with the co-dev diff D44921259, reverting to unblock the diff train	2023-04-14 00:30:16 +00:00
Jason Ansel	f84078b40b	[dynamo] Remove pointless graphs from with no_grad() (#98956 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98956 Approved by: https://github.com/voznesenskym	2023-04-14 00:25:40 +00:00
Shen Li	02d1cf51b6	[Easy] Clean up args remap for DTensor expansion (#99040 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99040 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-04-14 00:23:00 +00:00
mikey dagitses	fd7eaf79de	cmake will only run properly named c10 tests (#98710 ) For the purpose of our Bazel and Meta-internal macros tests, we want to create a single binary that can verify the different configurations. CMake would see this file and try to run it and fail in Windows which uses different values. But we don't care about verifying this in CMake since it's not part of the build unification effort. In order to do this, we have to rename the SmallVectorTest to match the naming convention of the rest of the c10 tests. Differential Revision: [D44823440](https://our.internmc.facebook.com/intern/diff/D44823440/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98710 Approved by: https://github.com/PaliC	2023-04-13 23:32:42 +00:00
AllenTiTaiWang	93d75568c7	[ONNX] Refactor ShapeInferenceWithFakeTensor to fill metavalue into the original gm (#98760 ) From https://github.com/pytorch/pytorch/pull/97494#discussion_r1160068456, the passes should modify gm inplace, but before this PR, `ShapeInferenceWithFakeTensor` utilizes Transform.transform() to make a copy of the gm, and rely on the assumption that the topological order of two graphs should be the same. This PR addresses the issue by saving another metavalue `static_shape` into gm for op_level_debug, instead of overwriting `val`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98760 Approved by: https://github.com/BowenBao	2023-04-13 22:47:03 +00:00
Thiago Crepaldi	d5aa4cec57	Delay torch.onnx import to after all dynamo [sub]components (#99070 ) ONNX is taking a lot of dependencies on dynamo, which is causing frequent circular dependencies issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/99070 Approved by: https://github.com/BowenBao, https://github.com/malfet	2023-04-13 22:22:34 +00:00
AllenTiTaiWang	8062735f78	[ONNX] Support aten::unflatten in torchscript exporter (#99056 ) Fixes #98857 Fixes #98190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99056 Approved by: https://github.com/BowenBao	2023-04-13 22:19:02 +00:00
Peter Bell	7b91bd2a7b	[primTorch] Add count_nonzero (#98995 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98995 Approved by: https://github.com/lezcano	2023-04-13 22:08:19 +00:00
Peter Bell	7d74dca780	[primTorch] Add rad2deg and deg2rad (#98994 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98994 Approved by: https://github.com/lezcano	2023-04-13 22:08:19 +00:00
fakeYan	668c578083	Automatically generate attributes and methods for custom backends. (#98066 ) Fixes #ISSUE_NUMBER #97593 A new extension mechanism has been added. When the user registers a new backend, the corresponding methods and attributes can be automatically generated. Do this code. `torch.utils.rename_privateuse1_backend('foo')` `torch.utils.generate_for_privateuse1_backend()` Then, get the following methods and attributes. `torch.Tensor.is_foo` `torch.Tensor.foo()` `torch.nn.Module.foo()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98066 Approved by: https://github.com/albanD	2023-04-13 22:04:05 +00:00
Jerry Zhang	09ebdf44fa	[quant][pt2e] Fix a bug in reference quantized module (decomposed mode) (#98903 ) Summary: Fixed quant_min/quant_max for per channel quantized weight for reference quantized module in decomposed mode, this bug is triggered while onboard an internal model Test Plan: python test/test_quantization.py TestQuantizeFx.test__convert_to_reference_decomposed_fx_per_channel_quant_module Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/98903 Approved by: https://github.com/andrewor14	2023-04-13 21:55:45 +00:00
mikey dagitses	6f07ad6cbf	more trivial mutable_data_ptr from at::empty (#98750 ) Differential Revision: [D44834054](https://our.internmc.facebook.com/intern/diff/D44834054/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98750 Approved by: https://github.com/ezyang	2023-04-13 21:32:24 +00:00
Will Constable	b8580b0897	Fix lazy_modules while enabling Unspecialized '__call__' tracing (#98516 ) This fixes a regression added in the following PR to graph-break on allowed modules with hooks, but has its own problems. - following #97184 PR makes 'allowed modules' with hooks graph-break, and lazy modules are allowed. (should we just make lazy modules not allowed ?) - graph-breaks at lazy modules fail the lazy module unit tests which assert no graphbreaks - this PR attempts to always 'initialize' lazy modules before tracing/calling into their __call__, and initializing a lazy module should delete all its hooks after firing them once, making the above issue go away Pull Request resolved: https://github.com/pytorch/pytorch/pull/98516 Approved by: https://github.com/yanboliang, https://github.com/jansel	2023-04-13 21:23:56 +00:00
Angela Yi	1d077f28ed	[export] Constraints API (#98433 ) Wrapper for users to insert constraints into model code. The constraints will not be maintained in the graph after tracing through make_fx so retracing with dynamo/make_fx will not work. This will be supported after torch._assert supported is implemented. Then we can convert the constrain_range calls to torch._asserts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98433 Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan	2023-04-13 21:20:10 +00:00
mikey dagitses	4d3d3317eb	make ATen/native/cuda/LossCTC.cu data_ptr-correct (#99030 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99030 Approved by: https://github.com/Skylion007	2023-04-13 20:42:37 +00:00
mikey dagitses	9a04482a74	make ATen/native/cuda/SoftMax.cu data_ptr-const (#99029 ) Test Plan: Rely on CI. Reviewers: ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/99029 Approved by: https://github.com/Skylion007	2023-04-13 20:38:50 +00:00
mikey dagitses	5d758ea952	make ATen/native/cuda/MultiMarginLoss.cu data_ptr-correct (#99008 ) Test Plan: Rely on CI. Reviewers: ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/99008 Approved by: https://github.com/Skylion007	2023-04-13 20:38:27 +00:00
Chien-Chin Huang	8e328762ff	[FSDP] Include duplicate parameters and modules when calling named_parameters and named_modules (#98912 ) The default option of `named_parameters` and `named_modules` is to remove the duplicated parameters and modules. However, in FSDP, we need to know what parameters are shared. As a result, setting `remove_duplicate` to False is required in FSDP. Without setting `remove_duplicate` to False, FSDP won't be able to discover shared weights in some cases (e.g., the shared weights are in the same module or there are shared modules). Differential Revision: [D44897935](https://our.internmc.facebook.com/intern/diff/D44897935/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98912 Approved by: https://github.com/awgu	2023-04-13 20:37:11 +00:00
mikey dagitses	a44813e6d7	trivial data reads to const_data_ptr (#99004 ) Summary: Test Plan: Rely on CI. Reviewers: ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/99004 Approved by: https://github.com/Skylion007	2023-04-13 20:30:16 +00:00
Lucas Pasqualin	35c6547f02	Adds 3D attn_mask support to merge_masks() for Multihead Attention fast path (#98991 ) Fixes #97409 Adds support for 3D attn_mask by always expanding attn_mask to 4D as per https://github.com/pytorch/pytorch/pull/98375#issuecomment-1499504721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98991 Approved by: https://github.com/jbschlosser	2023-04-13 20:29:57 +00:00
mikey dagitses	bae304a5fc	make ATen/native/cuda/WeightNorm.cu data_ptr-correct (#99006 ) Test Plan: Rely on CI. Reviewers: ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/99006 Approved by: https://github.com/Skylion007	2023-04-13 20:21:59 +00:00
Rohan Varma	bba2090831	Enable fused optimizer for DP (#98270 ) Differential Revision: [D42714482](https://our.internmc.facebook.com/intern/diff/D42714482/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D42714482/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/98270 Approved by: https://github.com/awgu	2023-04-13 20:16:32 +00:00
blorange-amd	079452ea0f	Enable test_matmul_cuda UTs for ROCm (#98797 ) test_file \| test_name \| test_class -- \| -- \| -- test_matmul_cuda \| test_cublas_addmm_size_10000_cuda_bfloat16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_10000_cuda_float16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_10000_cuda_float32 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_1000_cuda_bfloat16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_1000_cuda_float16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_1000_cuda_float32 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_100_cuda_bfloat16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_100_cuda_float16 \| (__main__.TestMatmulCudaCUDA) test_matmul_cuda \| test_cublas_addmm_size_100_cuda_float32 \| (__main__.TestMatmulCudaCUDA) This PR is the same fix as https://github.com/pytorch/pytorch/pull/88888. Creating this new PR to sanitize the history. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98797 Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet	2023-04-13 19:36:07 +00:00
Elias Ellison	fc53472ce4	Move/Fix FakeTensor logic for detecting multiple fake modes (#97186 ) This was leftover for when we had more logic in the FakeTensor and not FakeTensorMode, and wasn't firing correctly. It also makes more sense for it to be in the other validation function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97186 Approved by: https://github.com/bdhirsh	2023-04-13 19:20:01 +00:00
Zachary DeVito	82b8764b75	[unwind] clarify warnings (#99005 ) This PR defers warnings about potentially missing symbols until we hit a situation where we can find a symbol. It also hardens some of the logic around addresses that might be out of the range of known unwind logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99005 Approved by: https://github.com/tugsbayasgalan	2023-04-13 19:06:06 +00:00
Rohan Varma	d8b09b0139	[FSDP] Full precision in eval mode (#97645 ) If model.eval() is true, then runs the model in full precision. Changes: - Changed _force_full_precision to check self.is_training - Check for _force_full_precision when casting gradients to reduced dtype - Small change when accessing _full_prec_param_padded - tests for class based and fully_shard APIs Differential Revision: [D43933690](https://our.internmc.facebook.com/intern/diff/D43933690/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97645 Approved by: https://github.com/awgu	2023-04-13 18:38:22 +00:00
Radek Bartoň	c74310616d	_mm_prefetch is for Intel, changed to __prefetch for Arm64 (#96638 ) The current master build on Windows Arm64 is broken on this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96638 Approved by: https://github.com/malfet	2023-04-13 18:11:14 +00:00
mikey dagitses	7a77961d63	trivially migrate std::transform output to mutable_data_ptr (#98756 ) Differential Revision: [D44834598](https://our.internmc.facebook.com/intern/diff/D44834598/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98756 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2023-04-13 17:47:10 +00:00
Yanbo Liang	e20981bda9	[Dynamo] Fix Lazy Module initialization with constant arg (#98996 ) Fixes Meta internal user case Pull Request resolved: https://github.com/pytorch/pytorch/pull/98996 Approved by: https://github.com/williamwen42	2023-04-13 17:37:25 +00:00
kshitij12345	7692243e40	[functorch] typo in merge rule's github handle (#99052 ) Github handle was spelled incorrectly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99052 Approved by: https://github.com/kit1980	2023-04-13 17:09:54 +00:00
PyTorch MergeBot	dda7ce4bb3	Revert "[core][pruning][be] Rename sparsifier folder to pruner (#98758 )" This reverts commit 778fd1922ae127250126e845ecd4a1cb9e335fb5. Reverted https://github.com/pytorch/pytorch/pull/98758 on behalf of https://github.com/jcaip due to https://www.internalfb.com/diff/D44905951 need to fix broken import in fbcode	2023-04-13 16:30:47 +00:00
Bin Bao	e5501a967e	[inductor] Support IndexPutFallback in cpp_wrapper (#98972 ) Summary: 1) Make the fallback index_put generate the right cpp code in cpp_wapper 2) Add a --cpp-wrapper option to common.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/98972 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-13 15:41:03 +00:00
Brian Hirsh	670c5cf962	AOTAutograd: fix 'Trying to backward through the graph a second time' error (#98960 ) Fixes https://github.com/pytorch/pytorch/issues/97745. See discussion and comment in the PR for more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98960 Approved by: https://github.com/bertmaher, https://github.com/albanD	2023-04-13 10:25:07 +00:00
mikey dagitses	39438c6803	trivially convert std::copy output to mutable_data_ptr (#98755 ) Differential Revision: [D44834597](https://our.internmc.facebook.com/intern/diff/D44834597/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98755 Approved by: https://github.com/ezyang	2023-04-13 09:48:58 +00:00
Michael Lazos	3a400a5adc	Enable passing a dict of module names: log level to set_logs python api (#98989 ) Adds "module" kwarg to set_logs to allow a user to pass a dict of module qualified names to log level to the API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98989 Approved by: https://github.com/ezyang	2023-04-13 09:42:32 +00:00
Jerry Zhang	6a568779b6	[quant][pt2e][improvement] Remove the need to annotate all nodes with default annotation (#99001 ) Summary: This PR changes prepare to use some default observer/fq constructor when "target_dtype_info" is not set, this allows user to not initialize all nodes to default observer/fq constructor. Note we may still need to annotate intermediate node after this PR, there will be a follow up PR to allow users to only annotate things they want to quantize Test Plan: python test/test_quantization.py TestQuantizePT2E python test/test_quantization.py TestQuantizePT2EModels Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/99001 Approved by: https://github.com/kimishpatel, https://github.com/andrewor14	2023-04-13 09:31:51 +00:00
Vivek Khandelwal	f501234be0	Add test for squeeze.dims shape function (#98144 ) Signed-Off By: Vivek Khandelwal <vivek@nod-labs.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98144 Approved by: https://github.com/davidberard98	2023-04-13 08:43:55 +00:00
kshitij12345	2c337dd934	[fix] update the condition for aliveness of TensorWrapper (#98748 ) Fixes https://github.com/pytorch/pytorch/issues/95561 Fixes https://github.com/pytorch/pytorch/issues/98021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98748 Approved by: https://github.com/zou3519	2023-04-13 08:17:20 +00:00
Xiaodong Wang	0ff3059ad0	[pt2] recursive IR check (#98887 ) Summary: IR check needs to be recursive to accommodate Tuple[Tensor, Tuple[Tensor]] schema Test Plan: Run the repro cmd and make sure it no longer fails TORCH_SHOW_CPP_STACKTRACES=1 TORCH_LOGS="+dynamo,aot,inductor" buck2 run mode/opt scripts/ml_model_exploration/coffee:defi_local -- --baseline_model_entity_id 421946503 --meta_ids '{"union_meta":422685721}' -g -t -l --model_type mimo_ctr_mbl_feed Differential Revision: D44809096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98887 Approved by: https://github.com/wconstab	2023-04-13 06:57:01 +00:00
mikey dagitses	1854e8ac5f	convert trivial assignments to use mutable_data_ptr (#98752 ) Differential Revision: [D44834422](https://our.internmc.facebook.com/intern/diff/D44834422/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98752 Approved by: https://github.com/ezyang	2023-04-13 06:15:54 +00:00
Huy Do	388d269234	Use the same python version in MacOS workflows and add more debug messages (#98902 ) After https://github.com/fairinternal/pytorch-gha-infra/pull/139 (https://github.com/fairinternal/pytorch-gha-infra/actions/runs/4683157903/jobs/8297905750), the flaky issue on MacOS points to sccache setup. There are several issues there: * sccache is downloaded to `/usr/local/bin/sccache`. Surprisingly, the build script doesn't find it in some cases (probably the new runners), for example https://github.com/pytorch/pytorch/actions/runs/4681216666/jobs/8293519052 has `which sccache` returns nothing despite that the binary is there. In such case, `/usr/local/bin` is not in GitHub path. * Once sccache is used. We need to use the correct sccache binary arch. Using sccache for x86-64 on M1 would end up with a x86-64 torch binary, i.e. `01e011b07c` * We don't need to set the AWS secret key on MacOS runner anymore. The AWS M1 runner has access to S3 cache via its IAM profile while GitHub x86-64 runner uses GitHub cache https://github.com/pytorch/pytorch/pull/96142 Other minor changes: * The same python version is used in both MacOS build and test jobs. This is set by the workflow via `python-version` parameter * Add some debug information about the python version is used to run the test Pull Request resolved: https://github.com/pytorch/pytorch/pull/98902 Approved by: https://github.com/seemethere, https://github.com/malfet	2023-04-13 06:03:14 +00:00
Wen Chen	8155b72c15	[ROCm] Sync updates in hipify_torch to Pytorch hipify utils for ROCm. (#93169 ) This PR intends to sync the updates in the hipify_torch project (https://github.com/ROCmSoftwarePlatform/hipify_torch) to the hipify util used in Pytorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93169 Approved by: https://github.com/malfet	2023-04-13 04:59:31 +00:00
Jason Ansel	8a6dd0dc97	Disable logging in pattern matcher calls to AotAutograd (#98936 ) Fixes #98778 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98936 Approved by: https://github.com/wconstab	2023-04-13 04:51:08 +00:00
XiaobingSuper	8a3f1be809	inductor: relax the dynamic variable check for cpu dynamic test case (#98815 ) For the following code of dynamic shape case: ``` #pragma GCC ivdep for(long i1=static_cast<long>(0); i1<static_cast<long>(8); i1+=static_cast<long>(1)) { #pragma GCC ivdep for(long i2=static_cast<long>(0); i2<static_cast<long>(4ks2ks3); i2+=static_cast<long>(1)) { auto tmp0 = in_ptr2[static_cast<long>(i2 + (4i1ks2ks3) + (32i0ks2ks3))]; out_ptr2[static_cast<long>(i1 + (8i2) + (32i0ks2ks3))] = tmp0; } } ``` we don't need to check each loop has a dynamic variable. This PR relaxes the check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98815 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-13 04:35:12 +00:00
Will Constable	a408ed24ba	Support module hooks in UnspecializedNNModuleVar (#98540 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98540 Approved by: https://github.com/yanboliang	2023-04-13 04:32:50 +00:00
PyTorch MergeBot	731590bae5	Revert "[quant][fix] Compare resnet with quantizer api with the prepare_fx and decomposed convert flow (#98905 )" This reverts commit 43146bd49087bac9c58a274a06f52301ae8d1f7f. Reverted https://github.com/pytorch/pytorch/pull/98905 on behalf of https://github.com/jerryzh168 due to breakage due to the previous PR being reverted	2023-04-13 04:21:51 +00:00
Will Constable	296822c475	Make update_expected not fail on one missing file (#98982 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98982 Approved by: https://github.com/voznesenskym	2023-04-13 03:59:20 +00:00
Jerry Zhang	43146bd490	[quant][fix] Compare resnet with quantizer api with the prepare_fx and decomposed convert flow (#98905 ) Summary: Using a decomposed convert to make sure we get exact match, this means the nodes in resnet are annotated correctly Test Plan: python test/test_quantization.py TestQuantizePT2EModels.test_resnet18_with_quantizer_api Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/98905 Approved by: https://github.com/andrewor14	2023-04-13 03:35:37 +00:00
Rohan Varma	51ff9ce997	[Replicate] Simplify code a bit (#98889 ) Simplifies the code, such as making self.modules not a list and only a single module. Differential Revision: [D44899281](https://our.internmc.facebook.com/intern/diff/D44899281/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98889 Approved by: https://github.com/mrshenli, https://github.com/yhcharles	2023-04-13 03:21:06 +00:00
Yanli Zhao	cfd1b4df94	[Composable] add checking key for check_fqn function (#98961 ) add checking key for check_fqn function ghstack-source-id: d856f560f1fc449a316135e3844609d0baaf6d66 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96705 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/98961 Approved by: https://github.com/awgu	2023-04-13 03:16:14 +00:00
Michael Voznesensky	ccc9a3d726	Automatic Dynamic Shapes (#98923 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98923 Approved by: https://github.com/ezyang	2023-04-13 02:39:23 +00:00
PyTorch MergeBot	46a31e9bab	Revert "[quant][pt2e] Fix a bug in reference quantized module (decomposed mode) (#98903 )" This reverts commit a2e809f29bd66a0f314edeb602f37b4de05e5c41. Reverted https://github.com/pytorch/pytorch/pull/98903 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it breaks Windows tests on trunk `a2e809f29b`	2023-04-13 01:58:27 +00:00
Justin Chu	c80592ff9c	[ONNX] Remove torch dependencies in _beartype (#98958 ) Fix circular dependencies Fixes #98959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98958 Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi	2023-04-13 00:54:52 +00:00
Shen Li	75f55ca63b	Support FQN as SPMD module override key (#98966 ) Differential Revision: [D44940232](https://our.internmc.facebook.com/intern/diff/D44940232) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98966 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-04-13 00:45:48 +00:00
PaliC	6ebeefb4b0	remove merging label when merge is cancelled (#98967 ) Adds a script to get rid of the "merging" label when a job is cancelled. At the moment this can create a race condition is someone cancels a job and starts a new one, though these cases should be pretty rare especially in cases where its from a new merge command. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98967 Approved by: https://github.com/malfet	2023-04-13 00:45:28 +00:00
shibo	9a2a6fcfa5	add get_device_index for custom device (#98804 ) Fixes #ISSUE_NUMBER as the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98804 Approved by: https://github.com/ngimel	2023-04-12 23:58:31 +00:00
Peter Bell	c3186dc85e	[inductor] Support integer pow (#88938 ) This allows the `pow_recursive` form to be used for any integer power greater than 0, or for tensor cases to fallback to ATen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88938 Approved by: https://github.com/lezcano	2023-04-12 23:51:45 +00:00
gusty1g	efc90c797d	improvements to torch.gradient docs (#98824 ) Fixes #98693 Clarified docs for `torch.gradient` on `h_l` and how the gradient is computed. For the mathematical equations, I followed this reference: https://www.dam.brown.edu/people/alcyew/handouts/numdiff.pdf. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98824 Approved by: https://github.com/ngimel, https://github.com/kit1980	2023-04-12 23:43:40 +00:00
Nikita Shulga	a8f5d72edf	Guard color diagnostics opts by compiler type (#98952 ) On Linux system where `/usr/bin/c++` is not a symlink to either `g++` or `clang++`, `try_compile` can still incorrectly identify `gcc` as supporting `-fcolor-diagnostics` flag. Rather than introducing a super complex condition (i.e. `USE_CCACHE` and `LINUX` ...) just guard the checks specific to compiler identifier. See https://github.com/ccache/ccache/issues/1275 Fixes https://github.com/pytorch/pytorch/issues/83500 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98952 Approved by: https://github.com/albanD	2023-04-12 23:39:37 +00:00
PyTorch MergeBot	ab761605ae	Revert "[export] Constraints API (#98433 )" This reverts commit 1510eb4072b103ef2a10d415067e9d70954efb64. Reverted https://github.com/pytorch/pytorch/pull/98433 on behalf of https://github.com/izaitsevfb due to Breaks internal tests, asked by author to revert	2023-04-12 23:37:19 +00:00
Chien-Chin Huang	99aacf5c68	[SPMD] Expedite the allreduce call before doing comm_fusion (#98922 ) The allreduce call order and gradients order may be different and can interfere the benefit of comm_fusion. This PR reorders the graph so that all the allreduce calls happen right after its last input. Differential Revision: [D44900738](https://our.internmc.facebook.com/intern/diff/D44900738/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98922 Approved by: https://github.com/mrshenli	2023-04-12 23:26:37 +00:00
Yanbo Liang	78ff7ca24a	[Dynamo] Fix Sequential nn module with duplicated submodule (#98880 ) Fixes #98852 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98880 Approved by: https://github.com/ngimel	2023-04-12 23:09:50 +00:00
Nikita Karetnikov	8db04e080c	[pt2] add `SymInt` support for `cdist` (#98881 ) Fixes #98853. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98881 Approved by: https://github.com/ezyang	2023-04-12 23:06:40 +00:00
Jerry Zhang	a2e809f29b	[quant][pt2e] Fix a bug in reference quantized module (decomposed mode) (#98903 ) Summary: Fixed quant_min/quant_max for per channel quantized weight for reference quantized module in decomposed mode, this bug is triggered while onboard an internal model Test Plan: python test/test_quantization.py TestQuantizeFx.test__convert_to_reference_decomposed_fx_per_channel_quant_module Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/98903 Approved by: https://github.com/andrewor14	2023-04-12 22:35:24 +00:00
Andrew Gu	3c5a825f3c	[AOTAutograd] Fix is-duplicate check in de-dup guard logic (#98932 ) Context The existing check to see if an arg is duped is `if dupe_arg_pos != kept_pos:`. However, this incorrectly considers every arg after a true duped arg to also be a duped arg. Consider `flat_args = [a, b, b, c]`, where indices `1` and `2` are duped. - `add_dupe_map = {0: 0, 1: 1, 2: 1, 3: 2}` - For `dupe_arg_pos=2, kept_pos=1`, `2 != 1`, so the check correctly identifies the second `b` to be a duped arg. - For `dupe_arg_pos=3, kept_pos=2`, `3 != 2`, so the check incorrectly identifies the `c` to be a duped arg. Indeed, if there were more args like `[a, b, b, c, d, e, ...]`, every arg after the second `b` will be considered a duped arg since its `kept_pos` will always be 1 lower than its `dupe_arg_pos`. Overview This PR changes `add_dupe_map` to be implemented as a `List[int]`, where the list index implicitly represents the `dupe_arg_pos` and the list element represents the `kept_pos`. We use a list to have stable in-order iteration and because we know the keys to be in `{0, 1, ..., len(flat_args) - 1}`. With `add_dupe_map` as a list, the `is_dupe_arg` condition is whether the entry in `add_dupe_map` shows a new not-yet-seen index in the iteration. One way to do this is to count the number of unique args so far and compare against that. This closes https://github.com/pytorch/pytorch/issues/98883, where now the guards change from ``` GUARDS ___guarded_code.valid and ___check_type_id(L['self'], 93996836333040) and ___check_obj_id(L['self'], 140119034997536) and not ___are_deterministic_algorithms_enabled() and ___check_tensors(L['x']) and L['self']._buf is L['self']._buf_module._buf and L['self']._buf_module._buf is L['self']._param ``` to without the final incorrect `L['self']._buf_module._buf is L['self']._param` guard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98932 Approved by: https://github.com/ezyang	2023-04-12 22:22:50 +00:00
Vivek Khandelwal	bb4998b531	Add shape function for `aten::cross_entropy_loss` (#97875 ) Signed-Off By: Vivek Khandelwal <vivek@nod-labs.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97875 Approved by: https://github.com/davidberard98	2023-04-12 22:11:56 +00:00
Edward Z. Yang	5c38c4cfa4	Improve symbolic shapes guard logging (#98941 ) Billing of changes: * Get rid of `print_guards`; instead, you control this with `TORCH_LOGS=torch.fx.experimental.symbolic_shapes`, debug logging toggles stack traces * Don't incorrectly report the tracing context frame when we're compiling; we just don't have this info anymore! (TODO: use the saved frames instead). This is via a new TracingContext.clear_frame context manager * Add TracingContext.extract_stack() which gives you the tracing context stack. * Add ShapeEnvLoggingAdapter to report which ShapeEnv any given operation is from (this is helpful for debugging situations when there are too many ShapeEnvs floating around) * Tweak create_symbol log message to also report Source * Add a debug log whenever duck sizing occurs * Report an excerpt of both the user and system backtrace whenever a guard is added in INFO mode. I found this is a good balance of "where did the guard come from" without full backtrace verbosity. Example log output with the new output: ``` [2023-04-12 08:25:49,003] torch.fx.experimental.symbolic_shapes: [INFO] 0: create_env [2023-04-12 08:25:49,021] torch.fx.experimental.symbolic_shapes: [INFO] 0: create_symbol s0 = 32 for L['x'].size()[0] [2023-04-12 08:25:50,154] torch.fx.experimental.symbolic_shapes: [INFO] 0: evaluate_expr s0 < 128 [guard added] at w.py:11 in forward2 (_dynamo/variables/tensor.py:476 in evaluate_expr) [2023-04-12 08:25:52,057] torch.fx.experimental.symbolic_shapes: [INFO] 0: evaluate_expr Eq(Mod(s0, 16), 0) [guard added] (_inductor/codegen/triton.py:77 in is_aligned) ``` from running ``` import torch import torch._dynamo def f(x, y): return x + y def forward(x, y): return forward2(x, y) def forward2(x, y): if x.size(0) < 128: x = x * 2 else: x = x * 3 r = f(x, y) r = r * y return r def woof(): fn_compiled = torch.compile(forward, dynamic=True) x = torch.randn(32, device='cuda') y = torch.randn(32, device='cuda') print(fn_compiled(x, y)) woof() ``` (To induce the Triton guard, I synthetically reverted https://github.com/pytorch/pytorch/pull/98471) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98941 Approved by: https://github.com/wconstab	2023-04-12 21:58:59 +00:00
PyTorch MergeBot	1149ba5553	Revert "[NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking (#95715 )" This reverts commit a33eac398881cfa9aad679ceffd28ace3fa44f01. Reverted https://github.com/pytorch/pytorch/pull/95715 on behalf of https://github.com/PaliC due to This pr has caused a regression on distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess causing it to timeout (https://hud.pytorch.org/failure/distributed%2Ftest_dynamo_distributed.py%3A%3ATestMultiProc%3A%3Atest_ddp_baseline_aot_eager_multiprocess)	2023-04-12 21:15:49 +00:00
Nikita Karetnikov	c650d7b67f	[inductor] add `cumprod` to `make_fallback` (#98898 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98898 Approved by: https://github.com/ezyang, https://github.com/ngimel	2023-04-12 21:02:24 +00:00
Das, Barun	a38ff4cfd1	documentation update (#98782 ) change` parameters_and_buffers` to `parameter_and_buffer_dicts` in function docstring Fixes #98766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98782 Approved by: https://github.com/ngimel, https://github.com/kit1980	2023-04-12 20:34:30 +00:00
PyTorch MergeBot	4828585019	Revert "Move/Fix FakeTensor logic for detecting multiple fake modes (#97186 )" This reverts commit 8a057c445d372e17501c1257e51f47ab7878b371. Reverted https://github.com/pytorch/pytorch/pull/97186 on behalf of https://github.com/huydhn due to This breaks ONNX test in trunk and it looks like a landrace as the CI signal is green	2023-04-12 19:24:54 +00:00
BJ Hargrave	dc52ba2906	Fix test_mps for macos 13.3 (#98739 ) Expected dtype is changed from torch.int64 to torch.int32 prior to macos 13.3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98739 Approved by: https://github.com/kulinseth	2023-04-12 19:23:08 +00:00
Edward Z. Yang	419ad49e65	Make Tensor.__contains__ accept SymInt/Float/Bool. (#98933 ) Fixes https://github.com/pytorch/pytorch/issues/98870 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98933 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-04-12 19:16:33 +00:00
Wanchao Liang	ada7dfff71	fix allgather func collective to use maybe_wrap_tensor (#98866 ) It looks like we forgot to switch allgather to use maybe_wrap_tensor, this PR switch to use that and added test to guard tracing behavior Pull Request resolved: https://github.com/pytorch/pytorch/pull/98866 Approved by: https://github.com/mrshenli	2023-04-12 19:13:46 +00:00
Wanchao Liang	e99549526e	[spmd] move the param_buffers to the front (#98437 ) Makes it a bit easy to track the parameter count and corresponding named states Pull Request resolved: https://github.com/pytorch/pytorch/pull/98437 Approved by: https://github.com/mrshenli	2023-04-12 19:13:46 +00:00
Guang Yang	65070e1f0a	Allow set pred with ConstantVariable (#98900 ) It's part of the effort to improve PT2 Export UX. This PR is to improve the usability of `torch.cond()` by allowing user to set `pred` as `ConstantVariable` as it's not often to see control flow on rank or a tensor or dim size which is traced as `ConstantVariable`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98900 Approved by: https://github.com/jansel	2023-04-12 18:56:44 +00:00
Eddie Yan	a33eac3988	[NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking (#95715 ) Support for nonblocking NCCL communicators/fault tolerance/checking which was added in 2.14 as an experimental feature. Enabled via the environment variable: ``` TORCH_NCCL_USE_COMM_NONBLOCKING=1 ``` CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/95715 Approved by: https://github.com/kwen2501	2023-04-12 18:33:10 +00:00
mikey dagitses	09458a2bf1	introduce TensorBase::mutable_data_ptr() (#98163 ) See D44409928 for motivation. Note that we keep the const-ness of the existing data_ptr() member so that we don't have to change all references atomically. We just change the ones here that we have higher confidence with. Differential Revision: [D44611466](https://our.internmc.facebook.com/intern/diff/D44611466/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98163 Approved by: https://github.com/ezyang	2023-04-12 18:15:18 +00:00
Li-Huai (Allan) Lin	be8a4eb8e3	[MPS] Add index_fill op (#98694 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/98694 Approved by: https://github.com/kulinseth	2023-04-12 18:13:33 +00:00
Li-Huai (Allan) Lin	01e011b07c	[MPS] Move bitwise ops registration to native_functions.yaml (#98908 ) Per the offline discussion, there is no technical reason/limitation to have to register bitwise ops using `TORCH_LIBRARY_IMPL`. Move the registration to `native_functions.yaml` for an easier lookup and consistent registration patterns as other mps ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98908 Approved by: https://github.com/kulinseth	2023-04-12 18:07:15 +00:00
Scott Wolchok	c47464ed95	[PyTorch] Further reduce cost of TypeMeta::_typeMetaData (by 10x!) (#98105 ) Currently we should be paying a small cost for the thread-safe initialization of `index`. Now we should eliminate that cost. (10x figure in the title comes from internal benchmark that just calls `TypeMeta::Match<caffe2::Tensor>()` in a loop). Differential Revision: [D44597852](https://our.internmc.facebook.com/intern/diff/D44597852/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98105 Approved by: https://github.com/ezyang	2023-04-12 17:44:48 +00:00
Elias Ellison	8a057c445d	Move/Fix FakeTensor logic for detecting multiple fake modes (#97186 ) This was leftover for when we had more logic in the FakeTensor and not FakeTensorMode, and wasn't firing correctly. It also makes more sense for it to be in the other validation function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97186 Approved by: https://github.com/bdhirsh	2023-04-12 17:40:41 +00:00
Animesh Jain	8654699c54	[dynamo] Remove _dynamo.skip and fold it in _dynamo.disable (#98899 ) Summary There is confusion between`_dynamo.skip` and `_dynamo.disable`. This removes the `_dynamo.skip` API. The functionality is still available via `_dynamo.disable(recursive=False)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98899 Approved by: https://github.com/jansel	2023-04-12 17:33:26 +00:00
Li-Huai (Allan) Lin	71aea7f56e	[MPS] Add error inputs check (#98167 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/98167 Approved by: https://github.com/kulinseth	2023-04-12 17:19:13 +00:00
mikey dagitses	286212080f	introduce TensorBase::mutable_data_ptr<T> (#97874 ) See D44409928 for motivation. Note that we keep the const-ness of the existing data_ptr<T>() member so that we don't have to change all references atomically. We just change the ones here that we have higher confidence with. Differential Revision: [D44497685](https://our.internmc.facebook.com/intern/diff/D44497685/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44497685/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/97874 Approved by: https://github.com/ezyang	2023-04-12 15:13:30 +00:00
PyTorch MergeBot	629377ea8b	Revert "Replace _dynamo.config with an object instead of module (#96455 )" This reverts commit 420104a88654b0cf1b8600d042cd1f3c90ec5a59. Reverted https://github.com/pytorch/pytorch/pull/96455 on behalf of https://github.com/jansel due to BC breaking, was landed prematurely	2023-04-12 15:06:14 +00:00
Bin Bao	0c0e5c574e	[inductor] Consolidate constant_args and cpp_constant_args (#98742 ) Summary: Refactor code to simplify the logic. Support convolution as an extern call in CudaWrapperCodeGen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98742 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-12 11:59:08 +00:00
Bin Bao	ff9e34fb35	[inductor] Consolidata kernel and cpp_kernel for wrapper codegen (#98741 ) Summary: refactor to simplify the wrapper codegen logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/98741 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/ngimel	2023-04-12 11:59:08 +00:00
mikey dagitses	439a716785	remove unused TensorImpl::unsafe_data<T>() (#98720 ) Differential Revision: [D44824809](https://our.internmc.facebook.com/intern/diff/D44824809/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98720 Approved by: https://github.com/ezyang	2023-04-12 09:58:41 +00:00
Animesh Jain	951df11af8	[dynamo] Raise exception on incorrect usage of disallow_in_graph (#98892 ) Summary - `disallow_in_graph` is mostly useful for backends. Suppose, your backend does not support `torch.abs()`. So, you can use `disallow_in_graph` to do a graph break. The assumption in the above statement is that `disallow_in_graph` is called on an `allowed` callable. `allowed` in Dynamo language refers to a callable that is put as-is in the Dynamo graph. Therefore, if one uses `disallow_in_graph` on some non-torch non-allowed function, we want to raise an exception to tell user that they probably want something else. * If they want to disable Dynamo - they should use torch._dynamo.disable * If they wanted to stop inlining - they should use torch._dynamo.graph_break. However this is not a decorator. So, we need to provide another API. But, the question - who would want to do this? Pull Request resolved: https://github.com/pytorch/pytorch/pull/98892 Approved by: https://github.com/jansel	2023-04-12 07:50:56 +00:00
mikey dagitses	ee0143bf65	distinguish mutability of TensorImpl::data<T>() (#98719 ) There already is a mutable_data<T>() with different semantics, so we introduce new names: TensorImpl::(mutable_)?data_dtype_initialized<T>(). Differential Revision: [D44824778](https://our.internmc.facebook.com/intern/diff/D44824778/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98719 Approved by: https://github.com/ezyang	2023-04-12 07:24:35 +00:00
XiaobingSuper	9c98f2ceb7	inductor: rewrite mkldnn fx fusion using pattern_matcher(binary) (#97141 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97141 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-04-12 06:23:03 +00:00
XiaobingSuper	d3a1a772b5	inductor: rewrite mkldnn fx fusion using pattern_matcher(conv_transpose_unary) (#97140 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97140 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-04-12 06:12:48 +00:00
XiaobingSuper	73c3cb717d	inductor: fix the issue of cat missing dim argument for sink_cat_after_pointwise (#98901 ) Fix #98850 which reports an error when a cat doesn't give a dim value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98901 Approved by: https://github.com/jansel	2023-04-12 06:08:11 +00:00
XiaobingSuper	562e5d4942	inductor: rewrite mkldnn fx fusion using pattern_matcher(linear_unary) (#97139 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97139 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-04-12 05:55:37 +00:00
XiaobingSuper	c214c50355	inductor: rewrite mkldnn fx fusion using pattern_matcher(conv_unary) (#97007 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97007 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-04-12 05:52:54 +00:00
Nikita Shulga	0be65069d3	[BE] Use `Literal` from `typing` (#98846 ) Since PyTorch is Python-3.8+ compatible framework Pull Request resolved: https://github.com/pytorch/pytorch/pull/98846 Approved by: https://github.com/janeyx99, https://github.com/ZainRizvi, https://github.com/Neilblaze	2023-04-12 05:49:37 +00:00
Li-Huai (Allan) Lin	6ff32b5575	[MPS] Expose mps package in torch (#98837 ) Fixes #98740 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98837 Approved by: https://github.com/albanD, https://github.com/Neilblaze	2023-04-12 04:27:49 +00:00
Huy Do	d3a35956de	Skip dtensor ops on CPU-only runner due to flaky timeout (#98868 ) `distributed/_tensor/test_dtensor_ops` is still flaky in trunk with a curious timeout issue, for example `ce4df4cc59`. It seems that the test just hang without any failure. The root cause is unclear. On the other hang, https://github.com/pytorch/pytorch/issues/98816 might offer a solution for this. Anyway, I'm disable the test on CPU for now while the investigation is being done. The test is still being run on CUDA-available runner because it's not flaky there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98868 Approved by: https://github.com/clee2000	2023-04-12 03:40:02 +00:00
Rohan Varma	60ebb2f116	[Gloo][BE] Print stacktrace on collectFullMesh (#98810 ) Catch error and torch_check it so full C++ stacktrace is printed for better debug Differential Revision: [D44860626](https://our.internmc.facebook.com/intern/diff/D44860626/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98810 Approved by: https://github.com/wanchaol	2023-04-12 03:27:53 +00:00
Tugsbayasgalan Manlaibaatar	39fd7f945f	Add Symbool support in python to C++ translation (#98453 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98453 Approved by: https://github.com/ezyang	2023-04-12 03:21:57 +00:00
Mark Saroufim	bc8cb62bcb	torch.compile benchmark utility (#97699 ) I've had many exchanges that look like this https://github.com/rasbt/faster-pytorch-blog/pull/2 so this is an attempt to get make this problem easier Pull Request resolved: https://github.com/pytorch/pytorch/pull/97699 Approved by: https://github.com/ezyang	2023-04-12 03:02:06 +00:00
blorange-amd	455795c799	Enable fake_crossref unit tests on rocm (#97368 ) This PR should enable 900+ fake_crossref unit tests for ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97368 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2023-04-12 02:38:35 +00:00
Nikita Shulga	9c5473b79c	[BE] Move mobile builds to python-3.8 (#98886 ) As we've deprecated 3.7 support for PyTorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/98886 Approved by: https://github.com/PaliC, https://github.com/seemethere	2023-04-12 02:01:10 +00:00
Angela Yi	1510eb4072	[export] Constraints API (#98433 ) Wrapper for users to insert constraints into model code. The constraints will not be maintained in the graph after tracing through make_fx so retracing with dynamo/make_fx will not work. This will be supported after torch._assert supported is implemented. Then we can convert the constrain_range calls to torch._asserts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98433 Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan	2023-04-12 01:32:44 +00:00
Alexander Yermolovich	ac5025cdad	[llvm-17][ORC] Fix for move most ORC APIs to ExecutorAddr, introduce ExecutorSymbolDef. (#98811 ) Summary: Due to change in upstream there are multiple builds that fail to build with llvm-17. `8b1771bd9f` Added a llvm version check. Test Plan: local testing on failing build with trunk/llvm-12 Reviewed By: zhuhan0 Differential Revision: D44851324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98811 Approved by: https://github.com/malfet, https://github.com/bertmaher	2023-04-12 01:12:37 +00:00
Chien-Chin Huang	f3080997e5	[SPMD] Introduce remove_copy_for_optimizer optimization (#98580 ) This PR adds the ability to remove unused `copy_` (`len(node.users) == 0`) that generated by tracing the optimizer. Differential Revision: [D44761556](https://our.internmc.facebook.com/intern/diff/D44761556/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98580 Approved by: https://github.com/mrshenli	2023-04-12 00:51:22 +00:00
Chien-Chin Huang	401320690b	[SPMD] Add optimizer states and steps to the return (#98579 ) This will correctly functionalize the optimizer. Otherwise, there are orphand copy_. Differential Revision: [D44761512](https://our.internmc.facebook.com/intern/diff/D44761512/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98579 Approved by: https://github.com/mrshenli	2023-04-12 00:51:22 +00:00
Chien-Chin Huang	07a1378f52	[SPMD] Introduce schedule_comm_wait (#98578 ) `schedule_comm_wait` delays the wait_tensor ops as late as possible. Note that this optimization currently does not reorder the computation ops. For `foreach` based optimizer, we observe that reordering the computation ops is required to achieve a good performance. Differential Revision: [D44761487](https://our.internmc.facebook.com/intern/diff/D44761487/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98578 Approved by: https://github.com/mrshenli	2023-04-12 00:51:19 +00:00
Chien-Chin Huang	dd3e2ddc0a	[SPMD] Introduce graph_optimization_pass and comm_fusion_with_cat (#98285 ) This PR add `graph_optimization_pass` decorator which should be wrapped by all graph optimization passes. This PR also introduces the first graph optimization, `comm_fusion_with_cat`, as the first use case of `graph_optimization_pass`. Differential Revision: [D44661608](https://our.internmc.facebook.com/intern/diff/D44661608/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98285 Approved by: https://github.com/yifuwang	2023-04-12 00:51:16 +00:00
Xiaodong Wang	78ad800a2a	[nccl] Remove lock for nccl collective launch for 2.0+ (#97904 ) Summary: It looks nccl 2.0+ no longer needs a lock to avoid being called concurrently with cudaFree. Test Plan: sandcastle + OSS CI Differential Revision: D44514446 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97904 Approved by: https://github.com/malfet, https://github.com/kwen2501	2023-04-11 23:58:54 +00:00
Zachary DeVito	e37986d48f	[memory viz] support larger visualizations (#98865 ) When there are > 15000 polygons trace_plot starts to get really slow. So order the allocations and take the smallest allocations beyond the 15000 limit and put them into a single summarized polygon. A slider allows this limit to be adjusted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98865 Approved by: https://github.com/yf225	2023-04-11 23:56:41 +00:00
Animesh Jain	a2e0f5128c	[dynamo] Fix bug with torch._dynamo.skip (#98862 ) Summary * Fixed an issue with `skip` * Also removed some tests from test_misc.py and moved them to test_decorators.py as test_misc.py is becoming a dumping ground. ~~~ # Code - fn1 was not getting skipped earlier def fn2(x): return x.sin() @torch._dynamo.skip def fn1(x): x = x.sigmoid() return fn2(x.cos()) def fn(x): return fn1(x.tan()) # Extracted graph def forward(self, L_x_ : torch.Tensor): l_x_ = L_x_ tan = l_x_.tan(); l_x_ = None return (tan,) def forward(self, L_x_ : torch.Tensor): l_x_ = L_x_ sin = l_x_.sin(); l_x_ = None return (sin,) ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/98862 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-04-11 23:20:08 +00:00
Chien-Chin Huang	2de67eaaee	[SPMD] Add a dump_graphs_to_files utils to facilitate graph transformation debug (#98284 ) Throughout the compilation, there are multiple graphs that will be generated. This PR add an utils to dump the result graphs to a folder. Differential Revision: [D44661599](https://our.internmc.facebook.com/intern/diff/D44661599/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98284 Approved by: https://github.com/mrshenli	2023-04-11 23:14:12 +00:00
medivh-xp	0962114802	Fix 'fully_shard' may determine compute device incorrectly (#98831 ) Fixes #98829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98831 Approved by: https://github.com/awgu	2023-04-11 22:42:48 +00:00
Andrew Gu	c93ff384c3	[Easy] Reuse `source` variable in `wrap_tensor` (#98845 ) `2fab2893aa/torch/_dynamo/variables/builder.py (L759-L760)` We already save `source = self.get_source()` to begin `wrap_tensor()`. Since the source should be fixed at `VariableBuilder` construction time, we should be okay to reuse the `source` variable instead of calling `get_source()` every time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98845 Approved by: https://github.com/ezyang	2023-04-11 22:23:59 +00:00
AllenTiTaiWang	ad373efe6d	[ONNX] Skip flaky dynamic tests before ORT==1.15 in fx exporter (#98856 ) Disable all flaky dynamic tests From https://github.com/pytorch/pytorch/issues/98626#issuecomment-1502692018 Rerun all test cases and update skip reasons. The cases failing on both static and dynamic shapes are unittest.skipped. If it only fails on dynamic, it's skipped by skip_dynamic_test. There are a few skipped with skip_ort_min_version, since ORT is not supporting dynamic fx exporter until next version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98856 Approved by: https://github.com/BowenBao	2023-04-11 22:08:12 +00:00
PaliC	6cbe5c5ef7	Fix Lint (#98873 ) Fixes lint errors introduced by [#98433](https://github.com/pytorch/pytorch/pull/98779) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98873 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-04-11 21:47:21 +00:00
Xilun Wu	89894115ab	[MTPG] add all_to_all collective to MTPG (#98791 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98791 Approved by: https://github.com/kumpera	2023-04-11 21:35:45 +00:00
Han Qi	420104a886	Replace _dynamo.config with an object instead of module (#96455 ) Summary: Replace _dynamo.config with an object instead of module Current usage patterns of setting and reading fields on config will work unchanged. Only changes needed going forward: 1. import torch._dynamo.config will not work. However, just doing import torch._dynamo is sufficient to access dynamo config as torch._dynamo.config. 2. Files inside of _dynamo folder need to access config via from torch._dynamo.config_util import config instead of from torch._dynamo import config. Because _dynamo/__init__.py imports some of the files so it would be circular import. Test Plan: Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/96455 Approved by: https://github.com/williamwen42	2023-04-11 21:23:32 +00:00
Chien-Chin Huang	06c206cea3	[SPMD] Add the default graph module transformation that is applied after tracing and expansion (#98182 ) This PR adds the GraphModuleTransformation class that can be used as the default transformation after the `train_step()` is traced and expand. The current implementation includes: 1. Wrap the input graph module with IterGraphModule. This will enable the futher graph optimizations which are all implemented based on IterGraphModule. 2. Ability to lower the graph module to the Inductor. To achieve this goal, `lower_to_inductor()` is implemented. TODO: 1. The `override` and `gm_transofmation` have overlapping functions -- `override.transform` can be used to achieve the same function as `gm_transformation`. However, the current semantics of `override` is to override and transform partial graphs while `gm_transformation` is to transform the entire expaned GM. The final UX of `compile()` needs some discussion. 2. The current `lower_to_inductor()` assumes that the entire graph can be lowered to Inductor. This assumption is okay for integration of graph optimizations but is too restrictive for many models. We should upstream `partial_lowering()`. Differential Revision: [D44616783](https://our.internmc.facebook.com/intern/diff/D44616783/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98182 Approved by: https://github.com/mrshenli	2023-04-11 21:12:49 +00:00
soulitzer	367051e47e	[docs] Add missing functions to autograd.rst (#98854 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98854 Approved by: https://github.com/albanD	2023-04-11 20:45:49 +00:00
Yanbo Liang	3b6a78ea87	[Dynamo] Lazy Module support list/tuple input (#98809 ) Fixes Meta internal user case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98809 Approved by: https://github.com/wconstab	2023-04-11 20:38:04 +00:00
Huy Do	def50d2534	Create a new unstable workflow for periodic jobs (#98858 ) And move ROCm distributed job there as it's very flaky in trunk at the moment. Also move ROCm slow job to `slow` workflow as it should be. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98858 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2023-04-11 20:12:26 +00:00
Avik Chaudhuri	88dae230d0	dynamic range constraint API (#98779 ) This diff adds the ability to specify range constraints on dynamic dimensions. (Previously we only supported declaring a dynamic dimension, which gets the default range `[2, sympy.oo]`.) One point worth calling out: our initial design called for compound expressions like `lower <= dynamic_dim(x, d) <= upper`. However this seems difficult to support, because of a combination of desugaring and overloading semantics for such compound expressions in Python. Rather than silently doing the wrong thing, we explicitly error in this case and recommend users to specify multiple constraints, which is supported. Differential Revision: [D44847318](https://our.internmc.facebook.com/intern/diff/D44847318/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98779 Approved by: https://github.com/ezyang	2023-04-11 20:11:46 +00:00
Edward Yang	1e807f1189	Log PT2 compile to Scuba (#98790 ) Summary: Modeled off of https://www.internalfb.com/code/fbsource/[5f363eaeab1b5d620b9df83ba0de65adfd96771b]/fbcode/caffe2/torch/fb/trainer/profilers/gpu_mem_signpost.py?lines=106-115 I didn't use the Scuba integration in torch/_inductor/fb/logging.py to avoid having to make a new Scuba table; probably should do this. Test Plan: ``` buck2 test //caffe2/test:test_dynamo ``` Differential Revision: D44850903 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98790 Approved by: https://github.com/desertfire, https://github.com/bertmaher	2023-04-11 20:10:35 +00:00
Natalia Gimelshein	97889fa199	simplify indexing expression before trying to determine strides (#98783 ) This fixes a few failing cases where we fail to compute stride_hint for an indexing expression with ModularIndexing When can size_hint error out? It shouldn't happen when we are getting regular size hints for expressions where free vars are in ShapeEnv. But this is not the case when we try to recover strides from indexing expressions (which is what stride_hint is for). Suppose you have an indexing expression that looks like ``` 289d0 + ModularIndexing(7399d1 + d2, 1, 17) + 17ModularIndexing(7399d1 + d2, 17, 17) + 46240ModularIndexing(7399d1 + d2, 289, 128) ``` and want to understand its stride wrt to variable `d1`. Let's ignore for a moment that stride for ModularIndexing is not well defined, it'll become negative around modulo divisor value, but even without that, the way we usually compute stride is we substitute `0` and `1` for `d1` and compute difference in indexing expression with those substitutions - this is our stride. But for the expression above, the difference would result in an expression that still has free variable `d2` that we don't have a substitution for. The fix that this PR makes is it expands stride computation to substitute not only `0` and `1` for the variable we are computing a stride for, but also `0` for other variables in the indexing expression (`support_vars`). Note that computing strides in `stride_hints` is a performance optimization that we use to reorder dimensions or make split decisions for split reduction. If it fails, it's not a hard error - we may incorrectly apply reordering by it won't affect correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98783 Approved by: https://github.com/ezyang, https://github.com/voznesenskym	2023-04-11 20:01:20 +00:00
Yun Wang (Speech)	4130e4f284	[hypothesis==6.70.1] Fix more test errors (#98685 ) Summary: This diff fixes more test failures (T150117218) caused by upgrading the "hypothesis" library to 6.70.1 (D44523679). # //caffe2/caffe2/python:hypothesis_test This test generates float numbers and filters out those whose absolute values are less than 1e-2. It is a known issue of the new version of "hypothesis" that it generates zeros or floats with small absolute values too often: https://github.com/HypothesisWorks/hypothesis/issues/3603 I'm circumventing this issue by suppressing the health check `filter_too_much`. # //caffe2/caffe2/quantization/server:resize_nearest_dnnlowp_op_test All arithmetic should be done in float32 when calculating the reference, since the network being tested uses float32 everywhere. Mixing float32, float64 or even integers will result in intermediate values in float64. The different precision may cause off-by-1 errors when converting to integer. Test Plan: Run all the tests in both "dev" and "opt" modes: ``` for mode in dev opt; do buck2 test mode/$mode //caffe2/caffe2/python:hypothesis_test -- --run-disabled buck2 test mode/$mode //caffe2/caffe2/quantization/server:resize_nearest_dnnlowp_op_test -- --run-disabled buck2 test mode/$mode //caffe2/caffe2/fb/layers/tests:tum_history_test -- --run-disabled buck2 test mode/$mode //caffe2/caffe2/fb/dper/layer_models/tests:nn_ops_test -- --run-disabled buck2 test mode/$mode //caffe2/caffe2/fb/metrics:metrics_test -- --run-disabled buck2 test mode/$mode //deeplearning/numeric_suite/toolkit/test:net_transform_test -- --run-disabled buck2 test mode/$mode //f3/type_system:tests -- --run-disabled done ``` NOTE: In the first test (`//caffe2/caffe2/python:hypothesis_test`), the two methods `test_constant_fill_from_tensor` and `test_recurrent` would crash. But these crash on hypothesis 5.49.0, too, so I'm leaving them alone. Differential Revision: D44812706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98685 Approved by: https://github.com/malfet	2023-04-11 19:07:55 +00:00
Edward Z. Yang	16beb636b8	Generalize summary script to work with more CSV names (#98500 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98500 Approved by: https://github.com/wconstab	2023-04-11 19:05:18 +00:00
Wyatt Borsos	6361c3debc	Return zero_point from determine_qparams as a int64 (#98746 ) Summary: In some cases, zero_point is returned as an int tensor. We want it to be a long. This fixes a failed assertion in Executorch op_choose_qparams: https://www.internalfb.com/code/fbsource/[4609e7dbbf2e]/fbcode/executorch/kernels/quantized/cpu/op_choose_qparams.cpp?lines=49-52 Test Plan: CI Reviewed By: jerryzh168 Differential Revision: D44764070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98746 Approved by: https://github.com/jerryzh168	2023-04-11 19:01:05 +00:00
Angela Yi	abafb1e6dc	[fx] Minor bug fix for SubgraphMatcher when ignoring literals (#98458 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/98458 Approved by: https://github.com/andrewor14	2023-04-11 18:54:30 +00:00
Andrew Gu	c9adc4c376	[Dynamo] De-dup graph inputs (#98775 ) ### Overview This PR de-duplicates graph inputs in TorchDynamo, using the `Source` as the unique identifier for each input. This closes https://github.com/pytorch/pytorch/issues/98743 and https://github.com/pytorch/pytorch/issues/98625. ### Details `VariableBuilder.wrap_tensor()` should return a `VariableTracker` for the passed-in `value: Tensor`. If `value` is duplicated, we should avoid calling `OutputGraph.create_graph_input()` and `OutputGraph.add_grapharg()`. - Note that `create_graph_input()` and `add_grapharg()` are not 1:1. For a constant source and either `wrap_sym()` or `wrap_unspecialized_primitive()`, TorchDynamo still calls `create_graph_input()` but not `add_grapharg()`. - Note that `create_graph_input()` should be called before constructing the corresponding `VariableTracker`. TorchDynamo needs the `fx.Proxy` object to pass to `wrap_fx_proxy()`. In this PR, the `OutputGraph` saves an additional mapping `input_source_to_var` from each graph input's `Source` to its `VariableTracker`, which works because `Source` is now hashable. This mapping should be updated each time `create_graph_input()` is called. However, since we must construct the `VariableTracker` after `create_graph_input()` returns, we must have a separate call to the `OutputGraph` to update the mapping. If anyone has any suggestion on how to coalesce this logic and avoid having to remember to update `input_source_to_var` for each `create_graph_input()`, I would love to hear it. <details> <summary> Alternate Approach</summary> Initially, I tried having TorchDynamo construct a new but equivalent `VariableTracker` for the duplicated tensor. However, I abandoned this approach after hitting an assertion in `def wrap_fx_proxy_cls()` due to `"example_value"` already being in the proxy node's metadata because we were reusing the primary tensor's `Proxy` object. Reusing the exact `VariableTracker` also seems less error-prone instead of requiring constructing a new but identical `VariableTracker`. </details> ### Testing #### Global Variable Test ``` import torch @torch.compile() def f(): return x + x x = torch.randn(3) f() ``` Before: ``` ====== Forward graph 0 ====== <eval_with_key>.6 class <lambda>(torch.nn.Module): def forward(self, arg0_1: f32[3], arg1_1: f32[3]): # File: /data/users/ezyang/b/pytorch/ff.py:5, code: return x + x add: f32[3] = torch.ops.aten.add.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None return (add,) ``` After (only `arg0_1` and no more `arg1_1`): ``` ====== Forward graph 0 ====== <eval_with_key>.4 class <lambda>(torch.nn.Module): def forward(self, arg0_1: f32[3]): # File: dynamo/test_dup_global.py:8, code: return x + x add: f32[3] = torch.ops.aten.add.Tensor(arg0_1, arg0_1); arg0_1 = None return (add,) ``` #### FSDP Test Before we error on ``` File "/.../pytorch/torch/_guards.py", line 244, in __post_init__ assert self.input_source_a != self.input_source_b ``` and now there is no error. --- The rename from `name_to_input` to `input_name_to_proxy` is not part of the core logic change and is a remnant from initial attempts. I can undo it later if desired, but I also feel that the new name is more informative. It also fixes the type annotation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98775 Approved by: https://github.com/ezyang, https://github.com/voznesenskym	2023-04-11 18:07:20 +00:00
Li-Huai (Allan) Lin	ca791b6909	[MPS] Add higher order derivatives warning to max_pool2d (#98582 ) The higher order derivatives calculations of `max_pool2d` require indices provided, but `mps_max_pool2d` kernel doesn't calculate it. If we calculate indices during back propagations afterwards, that would be expensive and unnecessary since users can directly call `max_pool2d` with `return_indices=True`, which calculates `indices` along. This PR adds a warning for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98582 Approved by: https://github.com/soulitzer	2023-04-11 18:03:46 +00:00
albanD	e2cfdf177b	Remove un-used part of cuda rng state (#98787 ) The comment is quite confusing as given the use of `sizeof()`, this was never backward compatible as the state is not the same size as it used to be. Running this through CI right now. If it turns our we serialize some rng_state Tensor, I will update the set function to be BC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98787 Approved by: https://github.com/ngimel	2023-04-11 17:45:22 +00:00
Jesse Cai	778fd1922a	[core][pruning][be] Rename sparsifier folder to pruner (#98758 ) Summary: att Test Plan: ``` python test/test_ao_sparsity.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98758 Approved by: https://github.com/jerryzh168	2023-04-11 17:26:29 +00:00
Nikita Shulga	583193e1d9	[MPS] Fix batch_norm_backwards key (#98794 ) One needs different graphs for batch_norm_backwards depending whether or not gradients are required for some of the params Fixes https://github.com/pytorch/pytorch/issues/98602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98794 Approved by: https://github.com/kulinseth	2023-04-11 17:23:36 +00:00
BowenBao	2b38bd5bba	[ONNX] Safely set node name for 'replace_placeholder_name_and_target' (#98633 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98633 Approved by: https://github.com/wschin	2023-04-11 17:02:19 +00:00
Yanbo Liang	ad1d842234	[Dynamo] Make python random calls real random (#98812 ) Fixes #95425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98812 Approved by: https://github.com/wconstab	2023-04-11 16:57:34 +00:00
soulitzer	abe96654de	[reland][BE][autograd Function] Raise an error if input is returned a… (#98051 ) …s-is and saved for forward or backward in setup_context Fixes #ISSUE_NUMBER Relanding this in a new non-ghstack PR so I can import this to do co-dev Pull Request resolved: https://github.com/pytorch/pytorch/pull/98051 Approved by: https://github.com/zou3519	2023-04-11 15:42:54 +00:00
Edward Z. Yang	97a756f57d	Enable G004 lint check (#98843 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98843 Approved by: https://github.com/janeyx99, https://github.com/malfet	2023-04-11 14:57:15 +00:00
Wanchao Liang	15686950b7	[spmd] quick fix on batch input view issue (#98813 ) This is a quick fix/hack to get around with the issue that some "global" tensor view operation is invalid, but somehow it get triggered by some models as mini-batch input itself won't have this issue. Since ultimately we should remove the dtensor expand and use the new expansion, this hack is only temporary to unblock Pull Request resolved: https://github.com/pytorch/pytorch/pull/98813 Approved by: https://github.com/yifuwang, https://github.com/mrshenli	2023-04-11 14:27:01 +00:00
Howard Huang	760967a284	Update _store_based_barrier implementation to reduce load on rank 0 (#98000 ) Summary: Update from using add() which makes rank 0 overloaded with requests to a single request every 10 seconds to handle the last joined worker Added optional logging_interval arg to _store_based_barrier Test Plan: ``` pytest test/distributed/test_c10d_common.py -vsk test_store_based_barrier ``` Reviewed By: rohan-varma Differential Revision: D44430531 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98000 Approved by: https://github.com/kumpera	2023-04-11 14:25:29 +00:00
Edward Z. Yang	b8b840be3d	Convert logging f-strings to use % format, part five (#98765 ) This does some annoying but simple cases by hand. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98765 Approved by: https://github.com/wanchaol	2023-04-11 13:17:59 +00:00
Edward Z. Yang	5a7aad9681	Convert logging f-strings to use % format, part four (#98705 ) This does multi-line concatenated string literals. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98705 Approved by: https://github.com/voznesenskym	2023-04-11 13:17:59 +00:00
Edward Z. Yang	5a458a9df4	Convert logging f-strings to use % format, part three (#98704 ) This does triple-quoted strings. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98704 Approved by: https://github.com/voznesenskym, https://github.com/albanD	2023-04-11 13:17:56 +00:00
Tristan Rice	5ca3afd1bf	torch.hub: add safe weights_only option to load_state_dict_from_url (#98479 ) This adds a `weights_only` option to torch.hub.load_state_dict_from_url which is helpful for loading pretrained models from potentially untrusted sources. Ex: https://github.com/d4l3k/torchdrive/blob/main/torchdrive/models/simple_bev.py#L618-L621 See https://github.com/pytorch/pytorch/pull/86812 for more info on weights_only Test plan: ``` pytest test/test_hub.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98479 Approved by: https://github.com/NicolasHug	2023-04-11 12:44:25 +00:00
vfdev-5	5907173022	Updated upsampling test to use parametrize_test decorator (#97769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97769 Approved by: https://github.com/NicolasHug	2023-04-11 12:20:00 +00:00
mikey dagitses	6145964ec9	distinguish implementation of data() and mutable_data() on TensorImpl (#98732 ) The old style had them both going through a mutable method on Storage, which would prevent us from implementing checks differently depending on whether we are writing or reading. Differential Revision: [D44831044](https://our.internmc.facebook.com/intern/diff/D44831044/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98732 Approved by: https://github.com/ezyang	2023-04-11 11:37:29 +00:00
Michael Lazos	34961d416c	Remove unused log config settings (#98795 ) Summary: Removing deprecated log settings Test Plan: Removing code, no tests needed Differential Revision: D44853619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98795 Approved by: https://github.com/anijain2305	2023-04-11 10:07:29 +00:00
Jithun Nair	ce4df4cc59	Enable triton build in CI docker image for ROCm (#98096 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98096 Approved by: https://github.com/malfet	2023-04-11 09:02:19 +00:00
David Berard	7117c87489	torch.library.Library.impl: add missing param in docstring example (#98619 ) previously this was missing the callable ``` >>> my_lib = Library("aten", "IMPL") >>> def div_cpu(self, other): >>> return self * (1 / other) >>> my_lib.impl("div.Tensor", "CPU") # ^ missing `div_cpu` here ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98619 Approved by: https://github.com/ezyang	2023-04-11 06:09:46 +00:00
Jason Ansel	0c162adfa8	[dynamo] Support callable() on user defined functions (#98662 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98662 Approved by: https://github.com/yanboliang	2023-04-11 05:43:46 +00:00
Guang Yang	c377a8590b	Add `nonzero_static()` op to pytorch to unblock export (#97417 ) Summary: Add new experimental python op (`torch.nonzero_static`) for export. There is NO cuda impl included in this PR Example: Say input tensor is `x = torch.tensor([[1, 0], [3, 2]])` call regular `nonzero()` on x will give you a tensor `tensor([[0, 0], [1, 0], [1, 1])` call `nonzero_static(x, size=4)` on x will give you a tensor `tensor([[0, 0], [1, 0], [1, 1], [fill_value, fill_value])` (padded) call `nonzero_static(x, size=2)` on x will give you a tensor `tensor([[0, 0], [1, 0])` (truncated) Test Plan: Unit Tests ``` buck test @mode/dev-nosan //caffe2/test:test_dynamo -- 'caffe2/test:test_dynamo - test_export.py::ExportTests::test_export_with_nonzero_static' -- 'caffe2/test:test_dynamo - test_misc.py::MiscTests::test_nonzero_static' ``` PT2 Export with `nonzero_static()` Example of `GraphModule` in the exported graph ``` def forward(self, x): arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec) nonzero_static_default = torch.ops.aten.nonzero_static.default(arg0, size = 4); arg0 = None return pytree.tree_unflatten([nonzero_static_default], self._out_spec) ``` Differential Revision: D44324808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97417 Approved by: https://github.com/ezyang	2023-04-11 05:13:36 +00:00
fakeYan	d4ce045cfc	[Add] storage support for custom backend. (#98469 ) Currently storage only considers partial backend. We want storage to create on custom backend by key PrivateUse1. @ezyang Could you review my changes? Pull Request resolved: https://github.com/pytorch/pytorch/pull/98469 Approved by: https://github.com/ezyang	2023-04-11 03:55:23 +00:00
Jack Wolfard	1ff0a03e3f	Fix misuse of active mask (#98157 ) (#98159 ) Fixes #98157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98159 Approved by: https://github.com/ngimel, https://github.com/kit1980	2023-04-11 03:30:10 +00:00
Jason Ansel	a7892802b9	[dynamo] Add einops to skipfiles (#98661 ) This was causing failures in a torchbench model Pull Request resolved: https://github.com/pytorch/pytorch/pull/98661 Approved by: https://github.com/yanboliang	2023-04-11 03:21:36 +00:00
Yifu Wang	910d9224b5	[spmd compile api] use fake tensor for DTensor propagation (#98789 ) Summary: When using real tensors for DTensor propagation, functionalized _fuse_adam causes a memory spike of size(params + optim_state), which causes OOM on memory constrained environments. Test Plan: Tested manually. Differential Revision: D44845043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98789 Approved by: https://github.com/mrshenli	2023-04-11 03:11:22 +00:00
Yifu Wang	5a2de506fc	[spmd compile api] run gm_transforms before running the first iteration (#98788 ) Summary: The non-transformed graph module contains functionalized optimizer which, in a memory constraint environment, needs to be defunctionalized (via fx transformation or lowering to Inductor) before running the first iteration. Otherwise OOM may occur. Test Plan: Manually tested. Reviewed By: mrshenli Differential Revision: D44843942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98788 Approved by: https://github.com/mrshenli	2023-04-11 03:02:32 +00:00
Ivan Zaitsev	ec1d6580f1	[stronghold][bc-linter] correctly determine the base commit of the PR (#98538 ) Currently `${{ github.event.pull_request.base.sha }}` returns the HEAD of the base branch, which is different from the base of the PR. See: https://github.com/github/docs/issues/431 https://github.com/orgs/community/discussions/39880 However, BC linter needs to know the base revision of the PR, as it looks at the changes in the PR. This change is a workaround that determines the correct base of the PR. Hopefully in the future GH provides this information in the event, and this workaround could be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98538 Approved by: https://github.com/PaliC	2023-04-11 03:00:22 +00:00
Guspan Tanadi	ab385bd49e	docs: Linking ResNeXt PyTorch Hub Pipeline (#98689 ) Introducing ResNeXt model as link to PyTorch Hub see Skip connections section. Handle issue in #98690. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98689 Approved by: https://github.com/zou3519, https://github.com/kit1980	2023-04-11 02:20:26 +00:00
Rohan Varma	85e1d74c52	[FSDP] Clarify CPU offload implicitly in reshard_doc (#98666 ) Per title Differential Revision: [D44812344](https://our.internmc.facebook.com/intern/diff/D44812344/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98666 Approved by: https://github.com/awgu	2023-04-11 02:13:23 +00:00
Nikita Shulga	c00fd71a95	Workaround for CuDNN-8.7+ load bug (#98644 ) Preload `cudnn_cnn_infer` and consume `dlerror` to prevent spurious call to `abort()` from `libcudnn.so.8`, if `libnvrtc.so` is missing on the system. Fixes https://github.com/pytorch/pytorch/issues/97041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98644 Approved by: https://github.com/ngimel	2023-04-11 01:45:43 +00:00
Max Ren	fa077377ea	[PtE][CoreML] Create modelID as value not reference (#98655 ) Summary: https://www.internalfb.com/logview/details/instagram_ios_crashes/d5fd49a99f3ee21a82b66861de797711 CoreML is crashing in torch::jit::mobile::coreml::CoreMLBackend::compile(c10::IValue, c10::Dict<c10::IValue, c10::IValue>) (PTMCoreMLBackend.mm<175>) This is related to the crash here https://www.internalfb.com/logview/details/instagram_ios_crashes/a8a317c8da13cd577529e1763364f496/?trace_key=8002f84f5ea00ac68b0dfb91878c754a&selected-logview-tab=shared kimishpatel's original fix here D44386623 by passing modelID by value instead of reference, however I believe it just moved the error to loadModel invocation. When we create a copy of modelID on loadModel invocation, it is a reference to the string within the preprocessed IValue payload. When the payload is deallocated, modelID is no longer valid and the dispatched thread still tries to use it causing the error Test Plan: ``` Running with tpx session id: 2a77b7b1-7594-4479-8ac3-c01db29cf5cc Trace available for this run at /tmp/tpx-20230407-173155.849234-2a77b7b1-7594-4479-8ac3-c01db29cf5cc/trace.log RemoteExecution session id: reSessionID-2a77b7b1-7594-4479-8ac3-c01db29cf5cc-tpx I0407 17:31:55.970502 780835 ConfigeratorDomainConfigs.cpp:177] Notify user with updated size: 92 removed size: 0 Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/1970325002807752 ✓ ListingSuccess: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests : 13 tests discovered (0.177) ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchBITests/testBITextModel (0.028) ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchBITests/testBIXRayModel (0.167) ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCPUBlasTests/testGemmComplexDouble (0.001) ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCPUBlasTests/testGemmComplexFloat (0.001) ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCPUBlasTests/testGemmDouble (0.001) ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCPUBlasTests/testGemmFloat (0.001) ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCoreMLTests/testGanModel (0.303) ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCoreMLTests/testMCSModel (0.395) ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCoreMLTests/testMCSModelInvalidInputShape (0.305) ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCoreMLTests/testXirpModel (0.110) ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchDynamicPyTorchTests/testDynamicPytorchFamFlDictModel (0.014) ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchDynamicPyTorchTests/testDynamicPytorchFamFlModel (0.005) ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchDynamicPyTorchTests/testDynamicPyTorchXirpModel (0.065) ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - main (13.177) ``` Differential Revision: D44808433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98655 Approved by: https://github.com/SS-JIA, https://github.com/tiandiao123, https://github.com/kirklandsign	2023-04-11 01:05:13 +00:00
pbialecki	ef3ea30eed	Add CUDA 12.1 workflows (#98492 ) CC @atalman @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/98492 Approved by: https://github.com/malfet	2023-04-11 00:33:53 +00:00
albanD	dda95236c9	Add fast path in our type checks and argparser (#98764 ) Add fastpath for common use cases in our python arg parsing. This is using the observation that exact type check is a lot fast (pointer comparison) than subtype check (isintance call). So we make sure to do these before any isinstance check. This can be pretty significant where `a.view((1, 1, 1, 1))` goes from ~1.13us to 800ns. Full test: Tested perf locally with cpu freq locked and script pinned to a single core to reduce jitter. Benchmark results after doing each change in this PR one by one: ``` [albandes@albandes-fedora-K2202N0104138 test]$ # Original [albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' Running a.view(1) 827 ns ± 0.945 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1)) 947 ns ± 1.23 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1)) 1.04 µs ± 0.882 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1, 1)) 1.14 µs ± 1.59 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze(0) 797 ns ± 0.955 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0,)) 937 ns ± 1.51 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0, 1)) 1.02 µs ± 3.52 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) [albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' Running a.view(1) 823 ns ± 1.76 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1)) 938 ns ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1)) 1.03 µs ± 0.801 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1, 1)) 1.13 µs ± 0.877 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze(0) 768 ns ± 2.27 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0,)) 927 ns ± 0.779 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0, 1)) 1.01 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) [albandes@albandes-fedora-K2202N0104138 test]$ # checkLong fastpath [albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' Running a.view(1) 801 ns ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1)) 900 ns ± 0.593 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1)) 1 µs ± 1.44 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1, 1)) 1.1 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze(0) 782 ns ± 0.968 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0,)) 1.11 µs ± 424 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0, 1)) 1.09 µs ± 54.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) [albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' Running a.view(1) 817 ns ± 0.65 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1)) 912 ns ± 0.853 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1)) 1.02 µs ± 8.45 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1, 1)) 1.11 µs ± 2.53 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze(0) 781 ns ± 0.942 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0,)) 939 ns ± 1.57 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0, 1)) 1.01 µs ± 0.875 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) [albandes@albandes-fedora-K2202N0104138 test]$ # Tensor check fastpath [albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' Running a.view(1) 806 ns ± 2.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1)) 903 ns ± 1.82 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1)) 1 µs ± 1.21 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1, 1)) 1.1 µs ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze(0) 770 ns ± 1.66 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0,)) 931 ns ± 3.36 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0, 1)) 1.02 µs ± 0.983 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) [albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' Running a.view(1) 813 ns ± 2.42 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1)) 915 ns ± 0.868 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1)) 1.02 µs ± 1.09 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1, 1)) 1.11 µs ± 1.15 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze(0) 785 ns ± 0.807 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0,)) 941 ns ± 1.02 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0, 1)) 1.02 µs ± 0.857 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) [albandes@albandes-fedora-K2202N0104138 test]$ # Fast path number in intlist/symintlist [albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' Running a.view(1) 728 ns ± 0.503 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1)) 749 ns ± 0.829 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1)) 771 ns ± 0.727 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1, 1)) 800 ns ± 0.962 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze(0) 772 ns ± 0.622 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0,)) 883 ns ± 0.567 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0, 1)) 915 ns ± 0.638 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) [albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' Running a.view(1) 735 ns ± 1.27 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1)) 753 ns ± 2.57 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1)) 774 ns ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.view((1, 1, 1, 1)) 801 ns ± 0.835 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze(0) 773 ns ± 0.677 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0,)) 873 ns ± 1.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Running a.squeeze((0, 1)) 907 ns ± 0.836 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) ``` <details> <summary>Test script</summary> ```python import torch from IPython import get_ipython a = torch.empty(1) print("Running ", "a.view(1)") get_ipython().run_line_magic("timeit", "a.view(1)") print("Running ", "a.view((1, 1))") get_ipython().run_line_magic("timeit", "a.view((1, 1))") print("Running ", "a.view((1, 1, 1))") get_ipython().run_line_magic("timeit", "a.view((1, 1, 1))") print("Running ", "a.view((1, 1, 1, 1))") get_ipython().run_line_magic("timeit", "a.view((1, 1, 1, 1))") a = torch.empty(1, 1, 1) print("Running ", "a.squeeze(0)") get_ipython().run_line_magic("timeit", "a.squeeze(0)") print("Running ", "a.squeeze((0,))") get_ipython().run_line_magic("timeit", "a.squeeze((0,))") print("Running ", "a.squeeze((0, 1))") get_ipython().run_line_magic("timeit", "a.squeeze((0, 1))") ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98764 Approved by: https://github.com/ngimel	2023-04-11 00:08:26 +00:00
Xilun Wu	7ecbce374e	[DTensor][3/N] enable aten.native_dropout (#98577 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98577 Approved by: https://github.com/wanchaol	2023-04-10 23:57:04 +00:00
Xilun Wu	e686a1e1b3	[DTensor][2/N] add Philox offset adjustment logic in operator_dispatch (#98199 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98199 Approved by: https://github.com/wanchaol	2023-04-10 23:57:04 +00:00
Xilun Wu	67963c32bd	[DTensor][1/N] add DTensor RNG state APIs (#98198 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98198 Approved by: https://github.com/wanchaol	2023-04-10 23:57:00 +00:00
Ian Graves	3c2bc0760b	[EdgeML] Switch from BZL to BUCK for model resource testing (#98450 ) Summary: See [this post](https://fb.workplace.com/groups/devinfra.capacity.eng/permalink/1200060064273920/) for context and specifically [this solution](https://fb.workplace.com/groups/devinfra.capacity.eng/posts/1200060064273920/?comment_id=1200166060929987&reply_comment_id=1200177124262214) which this diff implements. The gist is that updating `bzl` file is very expensive for diff time testing and triggers many flaky tests when attempting to land a model update from EdgeML. The purpose of these bzl files (from what I can tell) is to unit test models via a CXX resources map. Since it's only used for CXX resource generation, this can be accomplished via generating `fb_xplat_cxx_library` BUCK target instead. This required shuffling around some existing BUCK files due to buck rules around file ownership. Since the EdgeML process already generates code to begin with, this is straightforward to do by just changing the code from generating bzl files to now generate a BUCK file and change the existing targets to use it thus we can now delete the old bzl files. Test Plan: Run the model gen script. ``` buck2 run mode/opt caffe2/torch/fb/mobile/cli:cli -- --concat_all_model_configs ``` Sanity test the new BUCK target. ``` buck2 build xplat/pytorch_models/build:test_resources ``` Run the model unit tests and confirm they still work. ``` buck2 run xplat/caffe2:for_each_prod_ptl_model_test ``` CI/CD for the rest. I expect some flaky test given the `bzl` file deletion which triggers off a ton of unrelated tests. Differential Revision: D44699671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98450 Approved by: https://github.com/JacobSzwejbka	2023-04-10 23:24:05 +00:00
Kunal Bhalla	803a1a041a	[torch.package][easy] Add another opcode for matching pickle protocol 4+ correctly (#98674 ) Summary: IL generates massive function names: which meant that the pickle opcode used is BINUNICODE instead of the short version -- and then it would silently get skipped while pickling with protocol 4. Differential Revision: D44815351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98674 Approved by: https://github.com/ezyang	2023-04-10 22:58:48 +00:00
Elias Ellison	76ac454146	Index expanded dims before checking memory overlap (#98656 ) As the comment for `get_expanded_dims` says: ``` # copy_ fails when trying to write to tensors with memory overlap, # for expanded dimensions (a dimension which used to have size 1 -> ?) # we can select one element from that dimension and write to it # to achieve writing to all values of that dimension of the input tensor ``` We were doing this for the copy, for not for checking if we could copy. Update it so we index then check for memory overlap. This covers all of the `complex_striding` warnings I observed in TB. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98656 Approved by: https://github.com/ngimel, https://github.com/yf225	2023-04-10 22:58:32 +00:00
Kazuaki Ishizaki	f011db345f	Fix typos under torch/_inductor directory (#97592 ) This PR fixes typos in comments and messages of `.py` files under `torch/_inductor` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/97592 Approved by: https://github.com/dagitses, https://github.com/kit1980	2023-04-10 22:53:18 +00:00
Edward Z. Yang	822464567f	Lazily format graphs for debug printing (#98776 ) The current code unconditionally formats the graphs, which is a waste of CPU if no one looks at them. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98776 Approved by: https://github.com/albanD, https://github.com/mlazos	2023-04-10 22:41:33 +00:00
shibo	f25f85546f	add rng_state support for custom device (#98069 ) Fixes #ISSUE_NUMBER Extend rng device related func，support custom device extensions，and default device is `cuda`. @bdhirsh @kit1980 would you please take a moment to review my changes？ Pull Request resolved: https://github.com/pytorch/pytorch/pull/98069 Approved by: https://github.com/bdhirsh	2023-04-10 22:36:55 +00:00
Kazuaki Ishizaki	a13a63ae9a	Fix typos under torch/ao directory (#97679 ) This PR fixes typos in comments and messages of `.py` files under `torch/ao` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/97679 Approved by: https://github.com/janeyx99, https://github.com/kit1980	2023-04-10 22:25:15 +00:00
Kazuaki Ishizaki	a531a464fd	Fix typos under torch/nn directory (#97594 ) This PR fixes typos in comments of `.py` files under `torch/nn` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/97594 Approved by: https://github.com/dagitses, https://github.com/kit1980	2023-04-10 22:07:15 +00:00
Kazuaki Ishizaki	105ef68f72	Fix typos under torch/fx directory (#97596 ) This PR fixes typos in comments and messages of `.py` files under `torch/fx` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/97596 Approved by: https://github.com/dagitses, https://github.com/kit1980	2023-04-10 21:57:36 +00:00
Jesse Cai	4584851da5	[core][pruning][be] rename BaseSparsifier to BasePruner (#98747 ) Summary: att Test Plan: `python test/test_ao_sparsity.py -- TestBasePruner` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98747 Approved by: https://github.com/jerryzh168	2023-04-10 21:25:19 +00:00
Xiao Wang	bd83b205cc	Skip test test_triton_bsr_dense_bmm if not TEST_WITH_TORCHINDUCTOR (#98462 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/98462 Approved by: https://github.com/zou3519	2023-04-10 21:21:06 +00:00
Huy Do	5bcbb9bca7	Skip testing distributed backend if the backend (UCC, NCCL, Gloo) is not available (#98576 ) After the recent change on https://github.com/pytorch/pytorch/pull/88110 to add a new c10d test for UCC backend, the test starts to fail on ROCm distributed job. I guess ROCm doesn't support that backend yet, so I go ahead and disable the test there. Please let me know if the support on ROCm is coming, I will close this PR accordingly. But it's now failing in ROCm trunk with `AssertionError: Unknown c10d backend type UCC`, for example `4adba70cc6` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98576 Approved by: https://github.com/Fuzzkatt, https://github.com/jithunnair-amd, https://github.com/malfet, https://github.com/ZainRizvi	2023-04-10 20:04:40 +00:00
William Wen	117da58b65	[dynamo 3.11] enable dynamo unittests in 3.11 (#98104 ) Enable most dynamo unittests for 3.11. There are a few tests that are skipped due to failures that will be addressed in upcoming PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98104 Approved by: https://github.com/yanboliang, https://github.com/voznesenskym, https://github.com/albanD, https://github.com/jansel, https://github.com/jerryzh168, https://github.com/malfet	2023-04-10 20:04:10 +00:00
Scott Wolchok	457afe48fd	[caffe2] Micro-optimizations in BlobGetMutableTensor (#98103 ) Make sure we don't call Tensor::GetDevice() twice. Remove redundant branch for the case when tensor->dtype() == options.dtype(); in this case we end up calling raw_mutable_data(options.dtype()) anyway! Differential Revision: [D44596695](https://our.internmc.facebook.com/intern/diff/D44596695/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98103 Approved by: https://github.com/jerryzh168	2023-04-10 19:43:02 +00:00
Edward Z. Yang	02cff64784	Assert that there are not duplicate sources for distinct arguments (#98738 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98738 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2023-04-10 19:32:08 +00:00
Keshav Kolur	b663f7e887	[better_engineering][multiplatform] Replace host_info() check with separate cmd and cmd_exe commands for protos (#98426 ) Summary: Same as title Test Plan: CI Differential Revision: D44670281 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98426 Approved by: https://github.com/ezyang	2023-04-10 18:34:13 +00:00
Richard Zou	d5120ff18a	[torch.library] Add ability to create library fragments (#98439 ) In C++ we have TORCH_LIBRARY_FRAGMENT. This PR adds the same functionality to the Python torch.library API. The motivation for this is: for the simple custom op API, we don't want users to need to deal with Library objects. One way to hide this from users is to create library fragments. Test Plan: - tests that you can create multiple fragments and def+impl operators on each. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98439 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2023-04-10 18:04:53 +00:00
Richard Zou	618ea6fac3	Fix test_python_dispatch under debug mode (#98609 ) The problem for these operators is that they were returning the input directly as the output. This isn't support and will raise debug asserts. Test Plan: - Test locally. The debug build in CI doesn't actually do anything. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98609 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2023-04-10 18:04:53 +00:00
Richard Zou	01b2c45659	[autograd_function_db] Add NumpyTake as OpInfo (#98438 ) Previously we used this to test the backward of NumpySort. It doesn't hurt to test it separately (plus I want to use the sample_inputs for something else). Test Plan: - run tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/98438 Approved by: https://github.com/ezyang, https://github.com/soulitzer	2023-04-10 18:04:50 +00:00
Huy Do	c139df407b	Skip failing test_torchinductor_codegen_dynamic_shapes tests on CPU (#98621 ) This test starts to fail in trunk after https://github.com/pytorch/pytorch/pull/97230. The original PR missed this because these test are marked as slow and is only run periodically. Is this ok to skip them like `test_upsample_cat_conv_dynamic_shapes`? Here is an example failure https://github.com/pytorch/pytorch/actions/runs/4638277468/jobs/8208270657. The following tests are all failing with `Failed to find dynamic for loop variable` error like others in the list. They are: * `test_conv2d_binary_dynamic_shapes`. Fixes https://github.com/pytorch/pytorch/issues/98679 * `test_conv2d_unary_dynamic_shapes`. Fixes https://github.com/pytorch/pytorch/issues/98680 * `test_conv_bn_fuse_dynamic_shapes`. Fixes https://github.com/pytorch/pytorch/issues/98681 * `test_conv_transpose2d_unary_dynamic_shapes`. Fixes https://github.com/pytorch/pytorch/issues/98682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98621 Approved by: https://github.com/malfet	2023-04-10 17:52:30 +00:00
Edward Z. Yang	9abae6ae32	Make all Source subclasses frozen. (#98737 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98737 Approved by: https://github.com/albanD	2023-04-10 17:51:10 +00:00
Aidyn-A	69eef5a4be	[CUDA12] set_device change (#94864 ) This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this: ```Python import torch x = torch.randn(1, device="cuda:1") ``` would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`. After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang	2023-04-10 17:31:12 +00:00
Shen Li	3fcc5ff0d6	Avoid passing buffers to optimizers during spmd rematerialization (#98714 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98714 Approved by: https://github.com/fegin	2023-04-10 17:09:15 +00:00
shibo	a3701b6740	fix backward bug for custom device (#98586 ) Fixes #ISSUE_NUMBER In the backward on some device , it may get an error to get device index because of exchange a new thread. So just set_device and check the device index in `setDevice` func may be better for some many kinds of devices. For CUDA, the device index check is also included in `setDevice` func.https://github.com/pytorch/pytorch/blob/master/c10/cuda/impl/CUDAGuardImpl.h#:~:text=%7D-,void%20setDevice(Device%20d)%20const%20override%20%7B,%7D,-void%20uncheckedSetDevice(Device ``` void setDevice(Device d) const override { TORCH_INTERNAL_ASSERT(d.is_cuda()); Device current_device = getDevice(); if (current_device != d) { C10_CUDA_CHECK(cudaSetDevice(d.index())); } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98586 Approved by: https://github.com/albanD	2023-04-10 15:56:38 +00:00
ykddd	537c346117	feat(add method is_private_use1() in class Device) (#98123 ) As the title Pull Request resolved: https://github.com/pytorch/pytorch/pull/98123 Approved by: https://github.com/bdhirsh	2023-04-10 12:30:37 +00:00
Edward Z. Yang	b09722f540	Convert logging f-strings to use % format, part two (#98700 ) This hits multi-line logging strings Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98700 Approved by: https://github.com/voznesenskym	2023-04-10 12:19:31 +00:00
Edward Z. Yang	9a8f71f23e	Convert logging f-strings to use % format (#98697 ) Codemod done with https://gist.github.com/ezyang/2e8b0463cdc6be278478495b23ff0530 with assistance from ChatGPT. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98697 Approved by: https://github.com/voznesenskym	2023-04-10 12:19:31 +00:00
PyTorch MergeBot	ad88afcff8	[xla hash update] update the pinned xla hash (#98195 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98195 Approved by: https://github.com/pytorchbot	2023-04-10 10:17:34 +00:00
Liao, Xuan	95621b3c2e	[aot] fix disable amp for runtime wrapper (#97864 ) For the current runtime wrapper in aot, `disable_amp` is always set to True. In fact, we would like to avoid disabling autocast if possible because accessing TLS is slow. In this PR, `disable_amp` depends on whether there is any autocast enabled instead of always being True. Many operators would get an improvement of performance (inductor v.s. eager) with this fix. Example of operators' 0.8 speedup in torchbench (inductor v.s. eager): <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> \| current \| new -- \| -- \| -- aten.hardsigmoid.default \| 0.709372349 \| 0.81414306 aten.tanh.default \| 0.715227805 \| 0.855556349 aten.add.Scalar \| 0.682292123 \| 0.860371222 aten.sigmoid_backward.default \| 0.688039934 \| 0.915606579 </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97864 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/jgong5, https://github.com/bdhirsh	2023-04-10 05:00:12 +00:00
Elias Ellison	96fb64a159	Turn off cudagraph trees (#98709 ) There were some recent failures on master, and I think it's fair to defer on turning it on till we get a bit of the Tensor construction overhead down because that shows up a lot in the TB benchmarks. There may ultimately be an unavoidable tradeoff between memory and performance to some extent but we can get the overhead numbers down a bit first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98709 Approved by: https://github.com/Chillee	2023-04-10 03:31:54 +00:00
PyTorch MergeBot	fdfd370c10	[vision hash update] update the pinned vision hash (#98654 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98654 Approved by: https://github.com/pytorchbot	2023-04-10 03:06:47 +00:00
mingfeima	584244460b	use float as accumulate type for reduce Ops: min, max, minmax on CPU (#96079 ) Use float32 as acc type for `min`, `max` and `minmax`, in the function ` vec::reduce_all`, float16 inputs will be accumulated in float32. The performance benefit basically comes from the vectorization of `Half` https://github.com/pytorch/pytorch/pull/96076 Tested on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz single socket ``` (before) ### using OMP_NUM_THREADS=20 ### using numactl --physcpubind=0-19 --membind=0 max: size: torch.Size([64, 128, 1024]) 2.071 ms (after) ### using OMP_NUM_THREADS=20 ### using numactl --physcpubind=0-19 --membind=0 max: size: torch.Size([64, 128, 1024]) 0.071 ms ``` single core ``` (before) ### using OMP_NUM_THREADS=1 ### using numactl --physcpubind=0 --membind=0 max: size: torch.Size([64, 128, 1024]) 33.488 ms (after) ### using OMP_NUM_THREADS=1 ### using numactl --physcpubind=0 --membind=0 max: size: torch.Size([64, 128, 1024]) 0.953 ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96079 Approved by: https://github.com/jgong5, https://github.com/kit1980	2023-04-10 01:48:31 +00:00
Jason Ansel	8fee46693c	Fused attention patterns (#97741 ) Patterns based on https://github.com/pytorch/pytorch/pull/94729 mainly as a forcing function for implementing joint graph replacements. Up until now, we had two places to do pattern matching 1) Pre-grad has janky infra (graph not normalized or functional), but is desirable for many types of passes where you want your change to affect grad formulas. 2) Post-grad has good infra, but cant change grad formulas. This PR adds a third place to do pattern matching: the joint forward+backwards graph. The idea is to take the patterns and lower them to a joint graph and replace both the forwards+backwards before we partition them. This allows us to do something similar to pre-grad transforms, but run after normalization and functionalization. Note that we don't seem to have kernels for all of these patterns, some get decomposed in the dispatcher. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97741 Approved by: https://github.com/Chillee	2023-04-10 00:35:22 +00:00
Jason Ansel	f4858fa8ef	Improve dynamo support for autograd.Function (#98158 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98158 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2023-04-10 00:33:51 +00:00
Nikita Shulga	7e0c26d4d8	[JIT] Allow `tuple` and `list` generics (#98703 ) As in Python-3.9+ `Dict`, `List`, and `Tuple` from `typing` module are deprecated in favor of their `builtins` counterparts, see [PEP 585](https://peps.python.org/pep-0585/) Test plan: Run: ``` import torch from typing import Union @torch.jit.script def to_tuple(v: Union[int, tuple[int, int]]) -> tuple[int, int]: """Converts int or tuple to tuple of ints.""" if torch.jit.isinstance(v, int): return v, v else: return v print(to_tuple(1), to_tuple((3, 4))) ``` It's almost impossible to add test to an existing CI, as test script will not be parseable by Python-3.8, which is a oldest supported Python version Fixes https://github.com/pytorch/pytorch/issues/98521 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98703 Approved by: https://github.com/kit1980	2023-04-09 22:58:58 +00:00
mikey dagitses	2400cb1d57	distinguish mutability of TensorImpl::data() (#97776 ) See D44409928. Differential Revision: [D44459999](https://our.internmc.facebook.com/intern/diff/D44459999/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97776 Approved by: https://github.com/ezyang	2023-04-09 20:21:56 +00:00
blorange-amd	6b9a1cf858	Removed hip call hipDeviceSynchronize (#97209 ) Similar to CUDA, fixed test_profiler_tree.py::TestProfilerTree unit test suites by DeviceSynchronize call removal. @jithunnair-amd @pruthvistony Pull Request resolved: https://github.com/pytorch/pytorch/pull/97209 Approved by: https://github.com/kit1980, https://github.com/jithunnair-amd	2023-04-09 20:12:52 +00:00
Nikita Karetnikov	ff825de442	[primTorch] add ref for `cumprod` (#98670 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98670 Approved by: https://github.com/ezyang	2023-04-09 15:22:28 +00:00
mikey dagitses	9d36361601	make TensorImpl::data_ptr_impl() non-const and have mutable in the name (#97744 ) See D44409928. Differential Revision: [D44450468](https://our.internmc.facebook.com/intern/diff/D44450468/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97744 Approved by: https://github.com/ezyang	2023-04-09 11:08:41 +00:00
Shen Li	54b168484d	Support LayerNorm without weight or bias parameters (#98687 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98687 Approved by: https://github.com/yifuwang	2023-04-09 02:13:10 +00:00
Shen Li	1be3549a27	Enable replicated embedding in SPMD for NLP models (#98686 ) For models like NanoGPT, embeddings are replicated and input ids are sharded. In this case, output lookups should be sharded to match ids. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98686 Approved by: https://github.com/yifuwang	2023-04-09 02:13:10 +00:00
ecao	fdb04c6a86	Add overflow check for stride calculation (#94900 ) Fixes #94120 and #94128. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94900 Approved by: https://github.com/ezyang, https://github.com/jgong5	2023-04-09 01:30:55 +00:00
mingfeima	3925f6edb2	add Half to cat fast path on CPU (#96078 ) Extend current fast path on `cat` with `Half`: for non-arithmetic Ops, simply do `Vec::load` and `Vec::store`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96078 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-04-09 01:30:48 +00:00
feifan	d95ee64b58	ddp forward support custom backend. (#98283 ) Currently DDP only considers CUDA backend，DDP forward will transfer tensor to CUDA. We want ddp to run on custom backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98283 Approved by: https://github.com/ezyang	2023-04-09 01:30:42 +00:00
Nikita Karetnikov	a2e7910dfd	[pt2] remove skip for `masked.logsumexp` in `test_proxy_tensor.py` (#98676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98676 Approved by: https://github.com/ezyang	2023-04-09 01:28:16 +00:00
Nikita Karetnikov	b411238d76	[pt2] add meta function for `logcumsumexp` (#98683 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98683 Approved by: https://github.com/ezyang	2023-04-09 01:26:37 +00:00
mikey dagitses	387feaa131	add mutable to name of non-const Storage::data_ptr (#97694 ) See D44409928. Differential Revision: [D44432585](https://our.internmc.facebook.com/intern/diff/D44432585/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97694 Approved by: https://github.com/ezyang	2023-04-08 12:44:30 +00:00
Shunting Zhang	2edfcafd4b	[inductor] remove RBLOCK from persistent reduction kernel's parameter list (#98653 ) This PR resolves comments https://github.com/pytorch/pytorch/pull/97203#discussion_r1160491318 . Send a separate PR since it's easier to test and make sure there is no perf impact. Tests: 1. python test/inductor/test_torchinductor.py 2. run `python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only hf_Bert --disable-cudagraphs --training` before and after the change to make sure the perf change is neutral. Now a persistent reduction kernel in hf_Bert looks like: ``` @persistent_reduction( size_hints=[4096, 1024], reduction_hint=ReductionHint.INNER, filename=__file__, meta={'signature': {0: 'fp32', 1: 'i64', 2: 'fp16', 3: 'i64', 4: 'fp16', 5: 'i64', 6: 'fp16', 7: 'fp16', 8: 'fp16', 9: 'fp16', 10: 'i32', 11: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': ['in_out_ptr0'], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11), equal_to_1=())]} ) @triton.jit def triton_(in_out_ptr0, in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, in_ptr5, in_ptr6, in_ptr7, out_ptr2, xnumel, rnumel, XBLOCK : tl.constexpr): xnumel = 4096 rnumel = 768 RBLOCK: tl.constexpr = 1024 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98653 Approved by: https://github.com/jansel	2023-04-08 10:17:14 +00:00
AllenTiTaiWang	d77d2f03a5	[ONNX] Fix scalar elements in op.Concat (#98509 ) op.Concat wrongly concatenated scalar int, and it would raise errors in ORT. However, we didn't see this bug until SegFault was fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98509 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2023-04-08 09:55:18 +00:00
Huy Do	70535d60fc	Restore CPU distributed tests (#97424 ) This looks pretty stable now on [HUD](https://hud.pytorch.org/reliability/pytorch/pytorch?jobName=unstable%20%2F%20linux-focal-py3.8-gcc7%20%2F%20test%20(distributed)), so moving it back from unstable. Pending some discussion on https://github.com/pytorch/pytorch/issues/97178 I'll monitor this a bit longer before merging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97424 Approved by: https://github.com/clee2000	2023-04-08 07:34:05 +00:00
Huy Do	0fa25cbd57	Fix broken MacOS build due to #97690 (#98665 ) https://github.com/pytorch/pytorch/pull/97690 breaks MacOS build `c68a94c5ea`. The fix looks easy enough so I try to go ahead with a forward fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98665 Approved by: https://github.com/dagitses	2023-04-08 07:02:12 +00:00
PyTorch MergeBot	cb3c478069	Revert "refactor(add privateuseone floder in aten/src/ATen): add a PrivateUse… (#98127 )" This reverts commit 5a537e291d03baf3ea8b23e4102acb10bfd5db23. Reverted https://github.com/pytorch/pytorch/pull/98127 on behalf of https://github.com/weiwangmeta due to Sorry, our internal code is not ready to take such changes	2023-04-08 05:32:21 +00:00
AllenTiTaiWang	526d9bbc65	[ONNX] Refactor op level debugging (#97494 ) Fixes #97728 Fixes #98622 Fixes https://github.com/microsoft/onnx-script/issues/393 Provide op_level_debug in exporter which creates randomnied torch.Tensor based on FakeTensorProp real shape as inputs of both torch ops and ONNX symbolic function. The PR leverages on Transformer class to create a new fx.Graph, but shares the same Module with the original one to save memory. The test is different from [op_correctness_test.py](https://github.com/microsoft/onnx-script/blob/main/onnxscript/tests/function_libs/torch_aten/ops_correctness_test.py) as op_level_debug generating real tensors based on the fake tensors in the model. Limitation: 1. Some of the trace_only function is not supported due to lack of param_schema which leads to arg/kwargs wronly split and ndarray wrapping. (WARNINGS in SARIF) 2. The ops with dim/indices (INT64) is not supported that they need the information(shape) from other input args. (WARNINGS in SARIF) 3. sym_size and built-in ops are not supported. 4. op_level_debug only labels results in SARIF. It doesn't stop exporter. 5. Introduce ONNX owning FakeTensorProp supports int/float/bool 6. parametrized op_level_debug and dynamic_shapes into FX tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/97494 Approved by: https://github.com/justinchuby, https://github.com/BowenBao	2023-04-08 05:24:43 +00:00
Jiong Gong	5375e78b50	[Inductor] turn on vectorization with fallback for indirect indexing etc. (#98138 ) Always do vectorization with scalar fallback for indirect indexing right now. We can vectorize the indirect indexing load/store by analyzing how the indirect indices are related to the loop variables. This will be done in future PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98138 Approved by: https://github.com/jansel	2023-04-08 05:14:56 +00:00
Jiong Gong	584a7ef35c	[Inductor] cpp further code cleanup (#98135 ) This PR primarily made two changes: 1. Support all ops (not only the load related ops) for `ops.masked`. Do recursive checks on masked body in `CppVecKernelChecker`. With this, we can remove `is_load_only_block` function and corresponding checking logic in `masked`. 2. Change the loop steps to the vectorized scaling factor instead of scaling the vectorized loop variables. With this, we can remove all the code that scales the loop variables explicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98135 Approved by: https://github.com/EikanWang, https://github.com/jansel	2023-04-08 05:09:25 +00:00
Elias Ellison	85a90d9181	Rename assert options, turn off by default (#98616 ) Rename the runtime assert checking options to be more clear. Also turn off the slow path checking, since it is slow enough to significantly affect our compilation time speed in dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98616 Approved by: https://github.com/davidberard98, https://github.com/Neilblaze	2023-04-08 04:44:55 +00:00
Yanbo Liang	a5f3468618	[Dynamo] Fix bug when dynamo generate guards for enum type (#98652 ) Fixes Meta internal user case, actually I think this is a ```enum``` bug, we provide workaround in dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98652 Approved by: https://github.com/jansel	2023-04-08 04:30:30 +00:00
Natalia Gimelshein	0dbdc8a380	reenable lowmem dropout (#98631 ) Fixes #98614. I'll look into when lowmem dropout helps (as enabling it will interfere with sdpa pattern matching), but for now to avoid regressions let's return to the existing state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98631 Approved by: https://github.com/jansel	2023-04-08 04:18:14 +00:00
mikey dagitses	c68a94c5ea	distinguish mutability of untyped Storage::data (#97690 ) See D44409928. Differential Revision: [D44429769](https://our.internmc.facebook.com/intern/diff/D44429769/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97690 Approved by: https://github.com/ezyang	2023-04-08 02:02:28 +00:00
Shen Li	d255c8e1ad	Add NLLLoss to DTensor prop rule (#98512 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98512 Approved by: https://github.com/wanchaol	2023-04-08 01:22:36 +00:00
Catherine Lee	a6155f34f6	Set up automated hash pinning for triton (#97568 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97568 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-04-08 01:13:05 +00:00
Sun, Jiayi	f959a0d56c	Modify 'fake_tensor_unsupported' function (#98585 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98585 Approved by: https://github.com/jansel	2023-04-08 01:04:00 +00:00
Peter Bell	b7ff717232	[inductor] Use 64-bit indexing for large tensors in triton code (#97447 ) This changes `TritonKernel` to have an `index_dtype` property which is used as the dtype in indexing calculations. By default it is `tl.int32` but if any input or output buffer is larger than `INT_MAX` then we use `tl.int64` instead. should fix #96978, #93606 (need to double check) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97447 Approved by: https://github.com/ngimel	2023-04-08 00:55:51 +00:00
Peter Bell	48397cddd7	[inductor] Fix benchmark_compiled_module codegen with CppWrapperCodeGen (#98608 ) The python function `benchmark_compiled_module` ends up using C++ expression printer to print the size for `rand_strided`, so you get a set e.g. `{2, 17}` instead of a tuple `(2, 17)`. Here is a complete example from master: ```python def benchmark_compiled_module(times=10, repeat=10): from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance arg0_1 = rand_strided({2, 17}, {17, 1}, device='cpu', dtype=torch.float32) arg1_1 = rand_strided({2, 17}, {17, 1}, device='cpu', dtype=torch.uint8) return print_performance(lambda: call([arg0_1, arg1_1]), times=times, repeat=repeat) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98608 Approved by: https://github.com/ngimel	2023-04-08 00:55:51 +00:00
Peter Bell	917e9f1157	Fix pytest config (#98607 ) `report` can be `TestReport` or `CollectReport`, the latter fails because there is no duration attribute. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98607 Approved by: https://github.com/clee2000	2023-04-08 00:55:51 +00:00
Huy Do	4563adacc5	Update the use of nvidia-smi for GPU healthcheck (#98036 ) This goes together with https://github.com/pytorch/test-infra/pull/3967 to: * Provide a more accurate health check command with `nvidia-smi` * Avoid running the check in the edge case when `nvidia-smi` doesn't even exist due to GitHub outage, i.e. https://github.com/pytorch/pytorch/actions/runs/4591098682/jobs/8107204277 * Also check for the number of GPU as part of the health check. The number of GPUs needs to be a power of 2 on a healthy runner. Fixes https://github.com/pytorch/test-infra/issues/4000 ### Testing Luckily, the PR picked up the broken runner https://github.com/pytorch/pytorch/actions/runs/4640688249/jobs/8213191715, and the script correctly detected that the runner had only 3/4 GPUS and shut it down Pull Request resolved: https://github.com/pytorch/pytorch/pull/98036 Approved by: https://github.com/weiwangmeta	2023-04-08 00:53:20 +00:00
Christine Cheng	112dfa1415	Back out "[kineto] add SOFT_ASSERT when logging metdata" (#98630 ) Summary: Original commit changeset: 1089c4d95c54 Original Phabricator Diff: D44513152 Test Plan: signals Reviewed By: eeggl Differential Revision: D44804013 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98630 Approved by: https://github.com/weiwangmeta	2023-04-08 00:48:51 +00:00
YJ Shi	5ceae85f1c	[Dynamo] Include UserDict in clone_inputs (#97725 ) Fixes #97724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97725 Approved by: https://github.com/yanboliang	2023-04-08 00:19:35 +00:00
Edward Z. Yang	cf10fd827e	Add comments about maybe_guard usage in Inductor (#98563 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98563 Approved by: https://github.com/jansel	2023-04-07 23:54:18 +00:00
Peeyush Agarwal	ebd4c165ff	Back out "`GradScaler` recomputes `optimizer_state["found_inf_per_device"]` before `optimizer.step` (#97415 )" (#98613 ) Summary: This change causes multi-GPU job from XI team to hang after 8K steps. Differential Revision: D44797248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98613 Approved by: https://github.com/ngimel	2023-04-07 23:31:31 +00:00
Angela Yi	2d9f482d88	[fx] Subgraph rewriter matching on attributes (#98604 ) Fixes #68534 Similar to how submodules are added, if there already exists an attribute with the same name in `gm` as in `replacement`, the attribute value in `gm` will take precedence. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98604 Approved by: https://github.com/andrewor14, https://github.com/SherlockNoMad	2023-04-07 23:24:13 +00:00
Han Qi (qihqi)	4adae2d1ae	Enable flatbuffer tests properly. (#98363 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/98363 Approved by: https://github.com/angelayi	2023-04-07 22:36:19 +00:00
Huy Do	0a0f107b50	Retry ONNX tests (the quick way) (#98627 ) This is to mitigate a flaky ONNX test in trunk and also improve its reliability till we have https://github.com/pytorch/pytorch/issues/98626 (I figure that this is better than moving the job to unstable). I try to disable the flaky test https://github.com/pytorch/pytorch/issues/98622, but that won't work as @clee2000 points out because ONNX isn't part of `run_test.py` to download and apply the list of disabled tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98627 Approved by: https://github.com/BowenBao	2023-04-07 22:20:39 +00:00
BowenBao	4f9dbc17a4	[ONNX] Enable xdoctests in CI (#98546 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98546 Approved by: https://github.com/justinchuby, https://github.com/kit1980	2023-04-07 22:20:18 +00:00
Shen Li	b2b783ea3c	Fix wrong SPMD test target in test_log_softmax (#98610 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98610 Approved by: https://github.com/wanchaol	2023-04-07 21:58:05 +00:00
Howard Huang	61c74ab0f8	Fix MPI rank and world size pg initialization (#98545 ) Fixes https://github.com/pytorch/pytorch/issues/97507 Test command `pytest test/distributed/test_c10d_common.py -vsk def test_init_process_group_for_all_backends` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98545 Approved by: https://github.com/malfet	2023-04-07 21:57:31 +00:00
Rodrigo Kumpera	24d9001527	Move functional collectives implementation to python. (#98595 ) This simplifies a lot the work we need to add new ops. This relands the previous PR, not sure why it was reverted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98595 Approved by: https://github.com/wconstab	2023-04-07 21:48:05 +00:00
Horace He	c75dd7c413	grab bag of changes (#98572 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98572 Approved by: https://github.com/shunting314, https://github.com/mlazos	2023-04-07 20:02:59 +00:00
Nikita Shulga	9667f261c6	Remove MERGE_IN_PROGRESS when exiting merge (#98611 ) I.e. do it in finally section This should take care of cases like https://github.com/pytorch/pytorch/pull/97645#issuecomment-1500490754 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98611 Approved by: https://github.com/PaliC	2023-04-07 19:59:08 +00:00
PyTorch MergeBot	55724a5ec9	Revert "[experiment] More procs in CI (#98098 )" This reverts commit 9fd3eba6ceb048cfdcb430e34f9168eda888b4c8. Reverted https://github.com/pytorch/pytorch/pull/98098 on behalf of https://github.com/clee2000 due to I think theres a bug	2023-04-07 19:50:54 +00:00
Bin Bao	5210d7c423	[CI] Mark vision_maskrcnn as NONDETERMINISTIC (#98570 ) Summary: vision_maskrcnn fails eager checking, so mark it as NONDETERMINISTIC to reduce noise on the dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98570 Approved by: https://github.com/eellison, https://github.com/huydhn	2023-04-07 19:33:20 +00:00
Jerry Zhang	c5269ad6c6	[quant][pt2e] Add support for a few ops in QNNPackQuantizer to enable quantizing internal model (#98560 ) Summary: This PR adds support for adaptive_avg_pool2d (traced as mean.dim), mean and hardtanh to QNNPackQuantizer Test Plan: python test/test_quantization.py TestQuantizePT2E.test_qnnpack_quantizer_obs_sharing_ops Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/98560 Approved by: https://github.com/andrewor14	2023-04-07 19:26:45 +00:00
Will Constable	89e5774482	Work around CI worker gpu issue for inductor_distributed (#98601 ) Just run with 2 gpus no matter how many there are (still skip if less than 2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98601 Approved by: https://github.com/ngimel, https://github.com/mrshenli	2023-04-07 18:50:27 +00:00
Nikita Karetnikov	1c226f5aad	[pt2] add meta functions for `cummax` and `cummin` (#98552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98552 Approved by: https://github.com/Chillee	2023-04-07 17:58:28 +00:00
maxren	483fd3351a	[Quant] Add get_symmetric_qnnpack_qat_qconfig_mapping (#98569 ) Differential Revision: [D44776230](https://our.internmc.facebook.com/intern/diff/D44776230/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98569 Approved by: https://github.com/andrewor14	2023-04-07 17:57:56 +00:00
Michael Voznesensky	e016dec66e	Clean up compile reason logic, report only graph break compiles (#98574 ) context: https://fb.workplace.com/groups/1075192433118967/posts/1222935648344644/?comment_id=1223002365004639&reply_comment_id=1223501008288108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98574 Approved by: https://github.com/Chillee, https://github.com/xw285cornell	2023-04-07 17:40:00 +00:00
Michael Lazos	f55e72c0f6	Add option to log recomps (#98564 ) Adds an option to TORCH_LOGS to log recompilations Pull Request resolved: https://github.com/pytorch/pytorch/pull/98564 Approved by: https://github.com/ezyang, https://github.com/voznesenskym	2023-04-07 17:30:27 +00:00
Catherine Lee	9fd3eba6ce	[experiment] More procs in CI (#98098 ) experiment with more procs but only in master so prs dont get affected Pull Request resolved: https://github.com/pytorch/pytorch/pull/98098 Approved by: https://github.com/huydhn	2023-04-07 17:21:32 +00:00
Elias Ellison	e302f083bb	Flip Switch Redux (#98341 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98341 Approved by: https://github.com/davidberard98	2023-04-07 16:05:58 +00:00
Edward Z. Yang	16ec7efa49	Don't use f-strings in logging calls (1/X) (#98591 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98591 Approved by: https://github.com/albanD	2023-04-07 15:52:50 +00:00
Keshav Kolur	79e14f8fd6	[better_engineering][multiplatform] Repalce host_info() check with select for default_compiler_flags (#98306 ) Summary: Same as title Test Plan: CI Differential Revision: D44667769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98306 Approved by: https://github.com/priyaramani, https://github.com/malfet	2023-04-07 15:39:38 +00:00
Will Constable	390c51bf87	Skip nnmodule hook guards by default (#98371 ) This PR makes basic nnmodule forward hooks work by default, without any overhead. But it leaves silent correctness issues if users modify/remove their hooks later, thus also emits a warning. - the usual case is to not use hooks, so avoid guard overhead here - registering any hook before compile will trigger a warning about hook support - registering a hook later (or removing one) requires user knowledge and opting in, currently this isn't warnable (but maybe we can observe compiled nnmodules to make it warnable). Why skip hook guards by default instead of not tracing __call__/hooks by default? - avoid having a mode flag that alters dynamo tracing behavior (harder to test both codepaths in CI with full coverage) - the most basic hook usecase (registering a hook before compile, and never removing it) will work by default with this PR, while it would require enablement and incur overhead in the 'not tracing __call__' proposal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98371 Approved by: https://github.com/jansel	2023-04-07 15:10:51 +00:00
PaliC	46d765c15e	[devX] make labels only count their own occurences (#98551 ) Small QoL improvement such that add_numbered_label now works more intuitively. Now if we push different labels instead of having `[reverted, mergedX2, revertX3, mergedX4, revertedX5, mergedX6]` we have `[reverted, merged, revertX2, mergedX2, revertedX3, mergedX3]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98551 Approved by: https://github.com/huydhn	2023-04-07 08:30:46 +00:00
PaliC	d06662fb57	Add ephemeral merging label (#98543 ) Addresses https://github.com/pytorch/test-infra/issues/3950 Test Plan: Ran a dry run on this pr. The label showed up while trying to merge <img width="354" alt="Screenshot 2023-04-06 at 4 57 48 PM" src="https://user-images.githubusercontent.com/13758638/230514276-1ac70b58-d2d1-4e4b-892b-a957bf156063.png"> And then disappeared after failing <img width="373" alt="Screenshot 2023-04-06 at 5 00 11 PM" src="https://user-images.githubusercontent.com/13758638/230514470-38b15ec7-cfd9-4efe-b6e8-0f9af5577c62.png"> There's also the trail of adding and removing the "merging" label at the bottom Notes: This is slightly buggy sometimes. For example when the merge failed when I was editing this textbox, the label did not disappear. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98543 Approved by: https://github.com/malfet	2023-04-07 08:24:54 +00:00
XiaobingSuper	d643a00efc	inductor(CPU): support dynamic shape for onednn fusion path (#97230 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97230 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel	2023-04-07 06:53:31 +00:00
Yanbo Liang	77d9742c24	[Inductor] Fix bug in lowering.slice_ when negative start out of range (#98517 ) Fixes error from 14k github models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98517 Approved by: https://github.com/ngimel	2023-04-07 06:48:51 +00:00
PyTorch MergeBot	45a2f6b70f	Revert "Reduce includes of CUDACachingAllocator.h (#97072 )" This reverts commit 1bcb88089468a6ebc667bd76256c4dd6f58b7ee3. Reverted https://github.com/pytorch/pytorch/pull/97072 on behalf of https://github.com/weiwangmeta due to breaking internal builds	2023-04-07 06:15:11 +00:00
Elias Ellison	5c8fea5647	Reduce overhead in CUDAGraph Trees (#98529 ) Significantly reduces overhead of constructing Tensors and Storages and checking Storage Liveness. Removes the regression for HF models that I tested and removes 75% of overhead of the extremely overhead bound resnet50 training we have in torchbench. (.91x base commit, 1.02x torchinductor default, 1.16x this PR, 1.25 previous cudagraphs impl). This PR takes care of all of the lower hanging fruit. - Computes storage aliasing at record time instead of during at runtime. We no longer need to use a runtime storage cache, and can instead index directly into the existing alias if there is one, or construct a new Storage - Moves the heavyweight C++ calls into a batch - getting storage weakrefs and constructing tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/98529 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-04-07 05:46:08 +00:00
Jerry Zhang	616f50da3a	[quant][pt2e] QNNPackQuantizer support annotation for resnet18 (#98507 ) Summary: This PR adds annotation support for conv2d relu, linear, maxpool2d, add and add relu so that we can successfully quantize resnet18 with the prepare_pt2e_quantizer API and get the same result as fx graph mode quantization Test Plan: python test/test_quantization.py TestQuantizePT2EModels.test_resnet18_with_quantizer_api Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/98507 Approved by: https://github.com/vkuzo	2023-04-07 04:27:21 +00:00
ykddd	5a537e291d	refactor(add privateuseone floder in aten/src/ATen): add a PrivateUse… (#98127 ) Add a PrivateUse1 folder to contain all the feature adaptations for PrivateUse1 under Aten,For example GetGeneratorPrivate which is used for the three-party backend to register his own Generator implementation.This makes it easier for us to centrally manage these features, and it will increase the convenience of adaptation for different back-end manufacturers. For more info: https://github.com/pytorch/pytorch/issues/98073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98127 Approved by: https://github.com/bdhirsh	2023-04-07 03:43:16 +00:00
Nicolas Macchioni	29608fd28d	[pt2][inductor] hardcode autotuning names (#98351 ) Summary: switch to hardcoded autotuning names, we want consistency incase the default choice changes Test Plan: CI Differential Revision: D44643318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98351 Approved by: https://github.com/jansel	2023-04-07 03:40:33 +00:00
PyTorch MergeBot	3d8ead7ee1	[vision hash update] update the pinned vision hash (#98367 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98367 Approved by: https://github.com/pytorchbot	2023-04-07 02:56:14 +00:00
Jason Ansel	1fb8428d70	Fix off-by-1 error in dynamo coverage stats (#98558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98558 Approved by: https://github.com/malfet	2023-04-07 02:52:22 +00:00
Huy Do	2161be08c4	Disable test_torchinductor_dynamic_shapes on ASAN (#98544 ) This is yet another wrong shard number calculation on ASAN causing flakiness. I figure that we don't really need to run this test on ASAN, so let disable it. There is discussion at the moment to run ASAN periodically too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98544 Approved by: https://github.com/malfet	2023-04-07 02:27:52 +00:00
Bin Bao	152d65ae1d	[reland][inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98534 ) Summary: This is a reland of #98264. When _inductor.config.cpp_wrapper is specified, we run a two-pass wrapper codegen to generate wrapper code in cpp which calls cuLaunchKernel to launch pre-compiled cuda kernels, and then call load_inline to load that generated wrapper back into the python world. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98534 Approved by: https://github.com/huydhn	2023-04-07 02:04:03 +00:00
Wei Wang	d4dbdee528	Update _linux-test.yml (#98317 ) Skip "setup-ssh" for now for a100 runners from GCP as it frequently encounter issues like "connect ETIMEDOUT 173.231.16.75:443" Every day about 10 occurrences Examples for just today so far: 2023-04-04T15:07:50.916331Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4609056040/jobs/8146321650 -- \| -- \| -- 2023-04-04T15:03:56.914692Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4609010125/jobs/8146217819 2023-04-04T14:39:58.004717Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4608784966/jobs/8145641764 2023-04-04T14:19:28.854825Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4608561116/jobs/8145147916 2023-04-04T06:15:39.241848Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4604422106/jobs/8135687673 2023-04-04T06:10:21.056131Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4604406947/jobs/8135611094 2023-04-04T05:34:50.908482Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4604198332/jobs/8135201048 2023-04-04T03:04:36.628201Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4603162241/jobs/8133620905 2023-04-04T01:49:27.119830Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4600897505/jobs/8132760483 2023-04-04T01:18:06.141437Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4602745871/jobs/8132387930 2023-04-04T00:38:30.610770Z \| inductor \| https://github.com/pytorch/pytorch/actions/runs/4602537869/jobs/8131938265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98317 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-04-07 01:51:02 +00:00
Jason Ansel	a0a0b0c701	Dont decompose dropout so it can be pattern matched (#97931 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97931 Approved by: https://github.com/ngimel	2023-04-07 01:15:24 +00:00
Kazuaki Ishizaki	482f87a7bc	[quantized] Fix return values of _get_name() in quantized ConvTranspose (#97678 ) This PR fixes incorrect return values of _get_name() in quantized `ConvTranspose?d`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97678 Approved by: https://github.com/vkuzo, https://github.com/kit1980	2023-04-07 01:14:42 +00:00
Liao, Xuan	88208c6fdf	[inductor][cpp] fix mul for uint8 (#98473 ) Fixes #98149 The type of `mul`'s output is not inconsistent with its input. This PR fixes the type of `mul`'s output. Here is the output code for the newly added test case `pow+cos`. `tmp4` is 1024 before fixing and 0 after fixing. #### Before fixing ``` auto tmp0 = in_ptr0[static_cast<long>(0)]; // tmp0 is unsigned_char auto tmp1 = tmp0 * tmp0; // tmp1 is int auto tmp2 = tmp1 * tmp1; // tmp2 is int auto tmp3 = tmp2 * tmp0; // tmp3 is int auto tmp4 = static_cast<float>(tmp3); // tmp4 is float auto tmp5 = std::cos(tmp4); out_ptr0[static_cast<long>(0)] = tmp5; ``` #### After fixing ``` auto tmp0 = in_ptr0[static_cast<long>(0)]; // tmp0 is unsigned_char auto tmp1 = decltype(tmp0)(tmp0 * tmp0); // tmp1 is unsigned_char auto tmp2 = decltype(tmp1)(tmp1 * tmp1); // tmp2 is unsigned_char auto tmp3 = decltype(tmp2)(tmp2 * tmp0); // tmp3 is unsigned_char auto tmp4 = static_cast<float>(tmp3); // tmp4 is float auto tmp5 = std::cos(tmp4); out_ptr0[static_cast<long>(0)] = tmp5; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98473 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel	2023-04-07 01:10:36 +00:00
Rohan Varma	06eaa0970b	[Resubmit] Don't crash on retrieveDesyncReport (#98470 ) Per title Differential Revision: [D44736409](https://our.internmc.facebook.com/intern/diff/D44736409/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98470 Approved by: https://github.com/XilunWu	2023-04-07 01:10:30 +00:00
Shunting Zhang	4adba70cc6	[inductor][easy] use num_stages=1 for reduction (#98524 ) Since num_stages only matters for matmul and does not matter for pointwise/reduction, set num_stage to 1 uniformly for all reductions in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98524 Approved by: https://github.com/ngimel	2023-04-07 01:06:07 +00:00
Huy Do	86cb7f40a9	Fix the missing PATH in mps workflow after #98522 (#98559 ) This was missed in #98522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98559 Approved by: https://github.com/malfet	2023-04-07 00:15:50 +00:00
PyTorch MergeBot	22411b6f02	Revert "[dynamo 3.11] enable dynamo unittests in 3.11 (#98104 )" This reverts commit 0066f3405f290ab6ef379abea6945058f8eb7ce5. Reverted https://github.com/pytorch/pytorch/pull/98104 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it is failing on CPU 3.11 test in trunk `0066f3405f`. This is probably a landrace	2023-04-07 00:05:30 +00:00
Fuzzkatt	481ecffb5e	Add test c10d ucc tests (#88110 ) Creates the equivalent c10d test for ucc for https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_gloo.py and https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_nccl.py. Uses test_c10d_gloo.py as the reference and adds all the common ops. More detailed comparison of available ops here: https://docs.google.com/document/d/1yPsa_X9EiEiqo-j2Yn7ierhccBtEjwoqC-B7-amI0MI/edit?usp=sharing Also removes extra line for ProcessGroupUCC.cpp barrier blocking wait that got duplicated from merging https://github.com/pytorch/pytorch/pull/85047. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88110 Approved by: https://github.com/zasdfgbnm, https://github.com/kit1980, https://github.com/kwen2501, https://github.com/malfet	2023-04-06 23:51:27 +00:00
Rohan Varma	8a29afe98a	[RFC] Add warning about object-based collectives for GPU tensors to docs. (#97702 ) Using GPU tensors in these collectives have caused SEVs, user confusion, and slowness in the past. These APIs were only designed to communicate arbitrary python objects, and GPU tensors should either be copied to CPU first or use the regular collecitves. Add a warning indicating so. Differential Revision: [D44435849](https://our.internmc.facebook.com/intern/diff/D44435849/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97702 Approved by: https://github.com/kumpera	2023-04-06 23:47:35 +00:00
Tailing Yuan	eb5da4df8a	Speed up LossCTC.cu (#97269 ) For these two kernels, `grid.x == 1` is enough. `grid.x > 1` leads to repeated computation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97269 Approved by: https://github.com/ngimel, https://github.com/malfet	2023-04-06 23:44:25 +00:00
Thiago Crepaldi	a2bb2fae1b	Add Autocast support to MatMult thourgh explicit cast (#98346 ) Fixes external issue https://github.com/microsoft/onnx-converters-private/issues/157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98346 Approved by: https://github.com/BowenBao	2023-04-06 23:19:52 +00:00
William Wen	0066f3405f	[dynamo 3.11] enable dynamo unittests in 3.11 (#98104 ) Enable most dynamo unittests for 3.11. There are a few tests that are skipped due to failures that will be addressed in upcoming PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98104 Approved by: https://github.com/yanboliang, https://github.com/voznesenskym, https://github.com/albanD, https://github.com/jansel, https://github.com/jerryzh168, https://github.com/malfet	2023-04-06 23:15:48 +00:00
Huy Do	dbfc4df075	Add $CONDA_ENV/bin to PATH on MacOS (#98522 ) This PR explicitly add $CONDA_ENV/bin to MacOS PATH, so that it can always detect and use the correct Python. $CONDA_ENV is always set to the correct value in setup-miniconda https://github.com/pytorch/test-infra/blob/main/.github/actions/setup-miniconda/action.yml#L141 <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at b4de81a</samp> This pull request fixes the conda-pip environment mismatch for the macOS build and test workflows by using consistent pip requirements files. It also adds a conditional block to the `.github/workflows/_mac-test-mps.yml` file to enable the test MPS job. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98522 Approved by: https://github.com/malfet	2023-04-06 21:34:52 +00:00
mikey dagitses	531b8e8f1e	stop using caffe2/core/logging.h forwarding header in serialize lib (#98168 ) No need to create a library for this useless header. Differential Revision: [D44612668](https://our.internmc.facebook.com/intern/diff/D44612668/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98168 Approved by: https://github.com/PaliC	2023-04-06 21:27:07 +00:00
Mitchell, Frost	fdb9441e7e	Stop recursion on trivial replacement (#97903 ) Pattern replacement behaves incorrectly when the replacement pattern maps inputs to outputs (such a pattern can be used to replace redundant code). However, current code in `torch.fx.subgraph_rewriter._replace_pattern` causes the list of replacement nodes to include the entire graph before that node, resulting in an exponential slowdown due to recursive calls traversing the entire graph multiple times. The proposed fix is to add a check in `_replace_pattern` to prevent the call to `get_replacement_nodes`: ```python for ret_node in copied_returning_nodes: if ret_node in match.placeholder_nodes: replacement_nodes.append(ret_node) else: get_replacement_nodes(ret_node) ``` Fixes #97817 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97903 Approved by: https://github.com/angelayi	2023-04-06 20:49:08 +00:00
mikey dagitses	ca1fe9bae5	remove no-op C10_DISABLE_NUMA preprocessor flag (#98243 ) Nothing reads this, so setting it does nothing. Differential Revision: [D44642070](https://our.internmc.facebook.com/intern/diff/D44642070/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44642070/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/98243 Approved by: https://github.com/PaliC	2023-04-06 20:38:10 +00:00
Rohan Varma	e4c8c75583	[PG NCCL] Add TDD, NCCL_DEBUG log (#97692 ) Prints these env var setting during setup for easier debug. Differential Revision: [D44430875](https://our.internmc.facebook.com/intern/diff/D44430875/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97692 Approved by: https://github.com/kumpera	2023-04-06 20:37:46 +00:00
BowenBao	03a428a5b2	[ONNX] Introduce 'Functionalization' for fx exporter (#98245 ) <img src="https://user-images.githubusercontent.com/9376104/229648898-7e85efc8-143f-42f9-93e0-298a8f86c0a1.png" width="80%" height="80%"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98245 Approved by: https://github.com/wschin, https://github.com/titaiwangms	2023-04-06 20:26:50 +00:00
Yu Guo	edebe413d3	[inductor] fix scatter fallback and fallback in deterministic mode (#98339 ) Fixes https://github.com/pytorch/pytorch/issues/93537 add `ir.ScatterFallback` to handle the mutation correctly of scatter/scatter_reduce fallback, also handle the case that `src` is a scalar, and lastly fallback in deterministic mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98339 Approved by: https://github.com/jansel	2023-04-06 19:43:17 +00:00
Guang Yang	68cb06c752	Make gen_annotated_args support kwargs (#98396 ) This PR is to address the issue seeing in PR #97417 where the newly added op requires `kwargs`, however, currently tools/autograd/gen_annotated_fn_args.py does not support `kwargs`, only `func_args` are generated for test_overrides.py. The PR adds a new field "is_kwargs" to each argument indicating whether it's a `kwargs` or not. See example: ``` annotated_args = { torch._C._VariableFunctions._cast_Byte: [{'is_kwarg_only': 'False', 'name': 'self', 'simple_type': 'Tensor'}], ... ``` The full comparison of the generated file `annotated_fn_args.py` can be found here: - Before: [P681991116](https://www.internalfb.com/phabricator/paste/view/P681991116) - After: [P681994218](https://www.internalfb.com/intern/paste/P681994218/) Differential Revision: D44698310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98396 Approved by: https://github.com/ezyang	2023-04-06 19:42:26 +00:00
mikey dagitses	fe99d39fbd	migrate PyTorch to c10::bit_cast (#98418 ) Use the standardized version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98418 Approved by: https://github.com/ezyang	2023-04-06 19:38:06 +00:00
PyTorch MergeBot	213cec3c45	Revert "Add typing_extensions as MacOS ci dependency (#98522 )" This reverts commit e6e33488d3e7de4f58359b6c86b3c43fa33cbfc5. Reverted https://github.com/pytorch/pytorch/pull/98522 on behalf of https://github.com/huydhn due to This needs rework	2023-04-06 19:37:38 +00:00
Tugsbayasgalan Manlaibaatar	12f340dcd9	Add round as UserError (#98376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98376 Approved by: https://github.com/anijain2305	2023-04-06 19:28:00 +00:00
Chien-Chin Huang	e0b958f975	[SPMD] Allow IterGraph support a more general subgraph movement (#98360 ) Resubmit D44444398 due to the merging conflict. The original assumption of IterGraph is very restrictive and only allow users to move a subgraph that only one node has the input from external nodes. This PR fixes the limitation. Differential Revision: [D44689730](https://our.internmc.facebook.com/intern/diff/D44689730/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44689730/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/98360 Approved by: https://github.com/lessw2020	2023-04-06 19:13:37 +00:00
PyTorch MergeBot	f228b3977b	Revert "[inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98264 )" This reverts commit 77f32eb6ccf9c276fba1724e463247930ef71ec3. Reverted https://github.com/pytorch/pytorch/pull/98264 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is failing in trunk due to a name error fake_mode_from_tensors is not defined `67d1a77086`. This is probably a landrace	2023-04-06 19:00:09 +00:00
Howard Huang	3b6e94cb8c	[small] replace with .format() with f-strings (#98514 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98514 Approved by: https://github.com/awgu	2023-04-06 18:58:56 +00:00
albanD	0210481dcb	Fix _like meta registrations (#98160 ) The meta implementation for these _like function is wrong whenever device != "meta" (it doesn't fill the memory!). zeros_like is special due to sparse and is fixed directly by always filling it with zeros. Every other one is CompositeExplicit implementation, I went with removing their meta registration and tweaking code to avoid infinite recursions. I can do the same as zeros_like (and add the proper filling for each) but that would duplicate the c++ logic and make the meta registrations non trivial. I can do it if you prefer to removal. test_meta works fine with these fixes, relying on CI to see if other tests are breaking as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98160 Approved by: https://github.com/ezyang	2023-04-06 18:44:34 +00:00
Jay Chae	dcb9440af9	[kineto] add SOFT_ASSERT when logging metdata (#98442 ) Summary: having a valid `kineto_activity_` before logging metadata is a crucial invariant worthy of asserts Test Plan: ## Test with D44362040 Verify that we get SOFT_ASSERT logs before and after the diff ## Log ``` W0329 11:29:34.269824 718148 profiler_kineto.cpp:122] Warning: (function operator()) W0329 11:29:34.270107 718148 profiler_kineto.cpp:122] Warning: (function operator()) W0329 11:29:34.270385 718148 profiler_kineto.cpp:122] Warning: (function operator()) W0329 11:29:34.270653 718148 profiler_kineto.cpp:122] Warning: (function operator()) W0329 11:29:34.270941 718148 profiler_kineto.cpp:122] Warning: (function operator()) W0329 11:29:34.271199 718148 profiler_kineto.cpp:122] Warning: (function operator()) W0329 11:29:34.271476 718148 profiler_kineto.cpp:122] Warning: (function operator()) W0329 11:29:34.271724 718148 profiler_kineto.cpp:122] Warning: (function operator()) W0329 11:29:34.272003 718148 profiler_kineto.cpp:122] Warning: (function operator()) W0329 11:29:34.272280 718148 profiler_kineto.cpp:122] Warning: (function operator()) W0329 11:29:34.272553 718148 profiler_kineto.cpp:122] Warning: (function operator()) W0329 11:29:34.272822 718148 profiler_kineto.cpp:122] Warning: (function operator()) W0329 11:29:34.273092 718148 profiler_kineto.cpp:122] Warning: (function operator()) ``` Reviewed By: aaronenyeshi Differential Revision: D44513152 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98442 Approved by: https://github.com/aaronenyeshi	2023-04-06 18:39:13 +00:00
PyTorch MergeBot	e394f6db5a	Revert "Improve dynamo support for autograd.Function (#98158 )" This reverts commit 4716fa24115435fa87d04213382d757816b8f1f3. Reverted https://github.com/pytorch/pytorch/pull/98158 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it seems to breaks MacOS trunk job `4716fa2411`. The signal was missing from the PR because we disabled MacOS job yesterday due to https://github.com/pytorch/pytorch/issues/98362	2023-04-06 18:15:02 +00:00
Huy Do	e6e33488d3	Add typing_extensions as MacOS ci dependency (#98522 ) MacOS jobs start to fail in trunk because of this missing dependency `938c5da61e`. So I add it explicitly. Caching issue? Pull Request resolved: https://github.com/pytorch/pytorch/pull/98522 Approved by: https://github.com/malfet	2023-04-06 17:48:25 +00:00
mikey dagitses	49b80c3ea2	[reland] remove typed StorageImpl::data() and StorageImpl::unsafe_data() (#98411 ) Original commit changeset: a466b3cb6a0a Original Phabricator Diff: D44629941 Differential Revision: [D44709004](https://our.internmc.facebook.com/intern/diff/D44709004/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98411 Approved by: https://github.com/ezyang	2023-04-06 17:42:48 +00:00
William Wen	e663143871	[dynamo 3.11] fix 3.11.2 issues (#98364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98364 Approved by: https://github.com/albanD	2023-04-06 17:37:25 +00:00
Zachary DeVito	1bcb880894	Reduce includes of CUDACachingAllocator.h (#97072 ) On my machine this goes from > 200 to ~80, making rebuilds faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97072 Approved by: https://github.com/wanchaol	2023-04-06 17:22:35 +00:00
Zachary DeVito	e085acc9f3	Cleanup Copy.cu logic (#97071 ) Some of the logic specific to the cudaMallocAsync allocator related to peer access is placed outside of the allocator itself. This PR refactors, documents, and encapsulates it, while maintaining the same behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97071 Approved by: https://github.com/ngimel, https://github.com/eellison	2023-04-06 17:22:35 +00:00
Nikita Karetnikov	938c5da61e	[inductor] do not generate loops when the condition doesn't hold (#98185 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98185 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-04-06 17:22:16 +00:00
William Wen	bb33173962	Add max-autotune compilers to benchmarks (#98464 ) Title Pull Request resolved: https://github.com/pytorch/pytorch/pull/98464 Approved by: https://github.com/shunting314	2023-04-06 17:13:02 +00:00
PyTorch MergeBot	67d1a77086	Revert "Move functional collectives implementation to python. (#98315 )" This reverts commit 8b0374f83c605c47b7c1ba9274011c4b961666ce. Reverted https://github.com/pytorch/pytorch/pull/98315 on behalf of https://github.com/huydhn due to Sorry for reverting for PR. This is failing in trunk probably due to a landrace	2023-04-06 16:49:40 +00:00
chezhou	ce797795e1	Support `getattr` for ConstantVariable when compiling with Dynamo (#98153 ) This PR enables `getattr` on ConstantVariable by implementing its `call_hasattr` function. Fixes #97480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98153 Approved by: https://github.com/ezyang	2023-04-06 16:48:24 +00:00
Jason Ansel	4716fa2411	Improve dynamo support for autograd.Function (#98158 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98158 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2023-04-06 16:44:37 +00:00
Shen Li	0c5389b401	Remove unnecessary schema_map from spmd API (#98444 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98444 Approved by: https://github.com/wanchaol	2023-04-06 16:04:22 +00:00
Bin Bao	77f32eb6cc	[inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98264 ) Summary: when _inductor.config.cpp_wrapper is specified, we run a two-pass wrapper codegen to generate wrapper code in cpp which calls cuLaunchKernel to launch pre-compiled cuda kernels, and then call load_inline to load that generated wrapper back into the python world. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98264 Approved by: https://github.com/ngimel	2023-04-06 15:59:55 +00:00
Bin Bao	348dcf51e5	[inductor] Combine CppWrapperCodeGen and CppAotWrapperCodeGen (#98088 ) Summary: Make CppAotWrapperCodeGen generate kernels and wrapper in one file, which unifies the codegen for AOT and non-AOT mode. There will be more refactoring for the AOT part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98088 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-06 15:59:55 +00:00
Nikita Karetnikov	7b25976323	[pt2] add meta function for `take` (#98451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98451 Approved by: https://github.com/ezyang	2023-04-06 14:48:35 +00:00
Shen Li	019914095e	[Easy] remove unnecessary get_rank() in tests (#98445 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98445 Approved by: https://github.com/wanchaol	2023-04-06 14:46:40 +00:00
Thiago Crepaldi	bbf180af9f	Add new aten::device variant to TorchScript (#97023 ) Fixes #96627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97023 Approved by: https://github.com/jgong5, https://github.com/BowenBao, https://github.com/davidberard98	2023-04-06 14:19:00 +00:00
DanilBaibak	d1e7434bcf	Improved configuration naming for repetitive workflows (#98496 ) Improved configuration naming for repetitive workflows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98496 Approved by: https://github.com/atalman, https://github.com/malfet	2023-04-06 14:14:16 +00:00
Nikita Shulga	fa4cab8925	[Sparse] Raise exception when expand is called on sparse tensor (#98365 ) It's already not working, but this makes error message a bit more readable. I.e. it turns: ``` % python -c "import torch;x=torch.eye(3).to_sparse().expand(3,3)" ``` from ``` NotImplementedError: Could not run 'aten::as_strided' with arguments from the 'SparseCPU' backend. ``` to ``` RuntimeError: Expand is unsupported for Sparse tensors. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98365 Approved by: https://github.com/pearu, https://github.com/cpuhrsch	2023-04-06 14:10:16 +00:00
Rodrigo Kumpera	8b0374f83c	Move functional collectives implementation to python. (#98315 ) This simplifies a lot the work we need to add new ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98315 Approved by: https://github.com/albanD, https://github.com/wconstab, https://github.com/Neilblaze	2023-04-06 14:06:16 +00:00
Edward Z. Yang	f98c1809a4	Add mark_static (#98427 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98427 Approved by: https://github.com/voznesenskym	2023-04-06 12:58:16 +00:00
Edward Z. Yang	bdb79a8f52	Turn off divisible_by_16 for dynamic shapes; support ablation (#98471 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98471 Approved by: https://github.com/ngimel, https://github.com/voznesenskym	2023-04-06 12:57:07 +00:00
Jerry Zhang	3142ce208f	[quant][pt2e] Support quantizer API in prepare_pt2e_quantizer (#97994 ) Summary: This PR added a quantizer API to prepare_pt2e_quantizer, which enables user to annotate the nodes in the graph directly to configure quantization, instead of relying on QConfigMapping, please see test cases in test_quantize_pt2e.py for examples. Also added a prototype for QNNPackQuantizer, that will be modified later to fully support different quantization capabilities of QNNPack/XNNPack The goal for introducing quantizer is to add flexibility to the quantization API to allow modeling users and backend developers to express their quantization intentions programmably, which will free architecture optimization team from supporting different use cases in the core API in the future, as a concrete example, we used to have https://pytorch.org/docs/master/generated/torch.ao.quantization.qconfig_mapping.QConfigMapping.html#torch.ao.quantization.qconfig_mapping.QConfigMapping as the API for users to express their intent for quantization in fx graph mode quantization, and it has some fancy options like `set_module_name_regex` and `set_module_name_object_type_order`, this is not needed for all backends and adds burden of maintenance to AO team, in the quantizer API we will move these APIs to a backend specific `Quantizer` that needs this feature, and all the backends or even advanced modeling users can implement their own quantizer to express their intent for quantization through annotating the nodes, for example, to express the quantization intention of quantizing a convolution node, a user will find the convolution node in the graph and do: ``` operator_spec = qnnpack_quantizer.get_default_per_channel_symmetric_qnnpack_operator_spec() conv_node.meta["target_dtype_info"] = { "input_act_obs_or_fq_ctr": _get_act_obs_or_fq_ctr(operator_spec), "weight_obs_or_fq_ctr": _get_weight_obs_or_fq_ctr(operator_spec) "bias_obs_or_fq_ctr": _get_bias_obs_or_fq_ctr(operator_spec), "output_act_obs_or_fq_ctr": _get_act_obs_or_fq_ctr(operator_spec), # TODO: validation of weight_index must be set if weight_obs_or_fq_ctr is set "weight_index": 1, # TODO: validation of bias_index must be set if bias_obs_or_fq_ctr is set "bias_index": 2, } ``` each backend will introduce their own quantizer, e.g. QNNPackQuantizer, which may expose more convenient APIs for modeling users to configure the annotation, and different quantizer can compose with each other to annotate the graph correctly for quantization. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_simple_quantizer python test/test_quantization.py TestQuantizePT2E.test_qnnpack_quantizer_conv Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/97994 Approved by: https://github.com/vkuzo	2023-04-06 11:34:10 +00:00
Yanbo Liang	ccc27bc361	[Inductor] Fix convolution lowering if stride or padding or dilation is 1 element list (#98448 ) Fixes error from 14k github models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98448 Approved by: https://github.com/ngimel	2023-04-06 10:40:06 +00:00
Rohan Varma	b8cf010139	Print collective (#97544 ) Prints the collective running in TDD. Differential Revision: [D44347417](https://our.internmc.facebook.com/intern/diff/D44347417/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97544 Approved by: https://github.com/zhaojuanmao	2023-04-06 06:47:19 +00:00
Rohan Varma	dab1a7e6a1	[PG Wrapper] Add sequence number (#97462 ) Adds sequence number to PG wrapper Differential Revision: [D44347419](https://our.internmc.facebook.com/intern/diff/D44347419/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97462 Approved by: https://github.com/zhaojuanmao	2023-04-06 06:47:19 +00:00
Rohan Varma	428c531d00	[FSDP] records for composable (#98428 ) Add some function recording since composable API does record a FSDP.forward Differential Revision: [D44715137](https://our.internmc.facebook.com/intern/diff/D44715137/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98428 Approved by: https://github.com/awgu	2023-04-06 06:40:48 +00:00
lcskrishna	eadd84d065	[ROCm] Enable FSDP BF16 comm hooks unit tests (#97517 ) This PR enables the following unit tests in FSDP feature on ROCm. ``` test_bf16_hook_has_wrapping_False_sharding_strategy_ShardingStrategy_FULL_SHARD test_bf16_hook_has_wrapping_False_sharding_strategy_ShardingStrategy_NO_SHARD test_bf16_hook_has_wrapping_False_sharding_strategy_ShardingStrategy_SHARD_GRAD_OP test_bf16_hook_has_wrapping_True_sharding_strategy_ShardingStrategy_FULL_SHARD test_bf16_hook_has_wrapping_True_sharding_strategy_ShardingStrategy_NO_SHARD test_bf16_hook_has_wrapping_True_sharding_strategy_ShardingStrategy_SHARD_GRAD_OP ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97517 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/malfet	2023-04-06 05:45:17 +00:00
Tugsbayasgalan Manlaibaatar	37dc47a1ac	Make caling type on user defined class UserError (#98366 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98366 Approved by: https://github.com/anijain2305	2023-04-06 05:20:50 +00:00
Shen Li	1cd1d9c24a	[SPMD] Dedup collectives from DTensor expansion (#98216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98216 Approved by: https://github.com/wanchaol	2023-04-06 04:41:38 +00:00
Shen Li	11b0a84f3e	Enable LogSoftmax for SPMD tracing (#98380 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98380 Approved by: https://github.com/wanchaol	2023-04-06 04:41:37 +00:00
mikey dagitses	e2c81e44db	backport std::bit_cast from c++20 to c10 (#98417 ) This is already used in a few places in our codebase, so let's standardize on a tested implementation. Differential Revision: [D44712290](https://our.internmc.facebook.com/intern/diff/D44712290/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44712290/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/98417 Approved by: https://github.com/ezyang	2023-04-06 04:17:45 +00:00
Michael Voznesensky	ab95b7a05f	Support neg calls to dyn shapes (#94068 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94068 Approved by: https://github.com/jansel	2023-04-06 03:33:24 +00:00
mingfeima	11890156e7	fix grain size setting for baddbmm_cpu_kernel (#98297 ) fix https://github.com/pytorch/pytorch/issues/92892 the `grain_size` setting for parallelization in baddbmm_cpu_kernel is wrong, which will make small input size go parallel, leading to omp threading overhead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98297 Approved by: https://github.com/lezcano, https://github.com/nikitaved	2023-04-06 01:51:10 +00:00
shaoyf42	cc5f64957b	Add PrivateUse1 for dispatching PyTorch Distributed Collectives. (#98137 ) Add PrivateUse1 for dispatching PyTorch Distributed Collectives to support custom device. This PR is to fix https://github.com/pytorch/pytorch/issues/97938#issue-1646833919 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98137 Approved by: https://github.com/kumpera	2023-04-06 01:41:43 +00:00
Huy Do	d3adbbf44b	Clean up CUDA 11.6 Docker images (#98395 ) We need to clean up 11.6 on PyTorch too after https://github.com/pytorch/builder/pull/1366. Otherwise, docker build would fail with a `bad argument 11.6` error, i.e. https://github.com/pytorch/pytorch/actions/runs/4614525312/jobs/8159595038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98395 Approved by: https://github.com/atalman, https://github.com/malfet	2023-04-06 01:37:49 +00:00
Nikita Shulga	bd78532020	[BE] Fix `collect_env` for python-path-with-space (#98415 ) By invoking [`Popen`](https://docs.python.org/2.7/library/subprocess.html#popen-constructor) with list of command line arguments, rather than strings that would be parsed by shell. Test plan: ```shell % conda create -n py311 python=3.11 % cd ~/miniconda3/envs % cp -a py311 py\ 311 % ./py\ 311/bin/python -mtorch.utils.collect_env ``` Fixes https://github.com/pytorch/pytorch/issues/98385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98415 Approved by: https://github.com/huydhn	2023-04-06 01:09:23 +00:00
Edward Z. Yang	680bf14a40	[EASY] Fix some more places where we incorrectly assume only Tensor (#98310 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98310 Approved by: https://github.com/voznesenskym	2023-04-06 00:57:59 +00:00
Edward Z. Yang	478df47fab	Disable persistent reductions with dynamic shapes (#98405 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98405 Approved by: https://github.com/voznesenskym	2023-04-06 00:54:35 +00:00
Bin Bao	007587aa00	[CI] Update update_expected.py to make it generate a combined csv file (#98407 ) Summary: make update_expected.py combine csv files from all shards into a single csv file for each test suite Pull Request resolved: https://github.com/pytorch/pytorch/pull/98407 Approved by: https://github.com/wconstab, https://github.com/ezyang	2023-04-06 00:00:58 +00:00
Jerry Zhang	a76114832a	[quant][pt2e][fix] Fix the internal test failures caused by refactor (#98378 ) Summary: att, this PR removes some incorrect assumptions from `_maybe_insert_observers_before_graph_output` Test Plan: internal test Differential Revision: D44697212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98378 Approved by: https://github.com/andrewor14	2023-04-05 23:27:34 +00:00
Kiersten Stokes	2a48f43fe2	Add check for 0 to 1 inclusive for elements of target tensor in BCE loss (#97814 ) TODO for @mikaylagawarecki : add BC breaking description Fixes #87373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97814 Approved by: https://github.com/mikaylagawarecki	2023-04-05 23:26:09 +00:00
Escapeqyq	3112d2a2b6	Export function symbols to enable Windows build of Intel Extension for PyTorch (#98054 ) This PR is to export specific function symbols into .dll shared library on Windows platform to support Windows build for [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch). TORCH_API/TORCH_PYTHON_API/PYBIND11_EXPORT are macros that decorate the function as dllexport while compilation, so that the function symbol will be exported into the .dll shared library file on Windows platform. It is necessary for other libraries (such as IPEX) to import and call these functions through dynamic linking of PyTorch on Windows platform. The code changes of this PR adds decorators to export specific functions used by IPEX. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98054 Approved by: https://github.com/ezyang	2023-04-05 23:23:18 +00:00
Peter Bell	013c7f5ba4	[inductor] Move `tl.broadcast` call out codegen.common (#98304 ) This makes only a cosmetic change to the generated code, but means triton's broadcasting logic doesn't leak out into the CSE class. Before: ```python tmp5_load = tl.load(in_ptr1 + (0)) tmp5 = tl.broadcast_to(tmp5_load, [XBLOCK]) ``` After: ```python tmp5 = tl.load(in_ptr1 + (0)) tmp6 = tl.broadcast_to(tmp5, [XBLOCK]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98304 Approved by: https://github.com/ngimel	2023-04-05 23:10:46 +00:00
Peter Bell	bb4174d2a3	[inductor] Enable CSE on masked loads (#98303 ) Currently the `TritonKernel.mask_loads` context manager calls `swap_buffers` which creates a new CSE context. So, code generated in different mask contexts cannot be CSE'd even if their masks are the same. This fixes the issue by not calling `swap_buffers` and instead having `load` manually check if a `"tmp"` name appears in the mask meaning the load needs to be generated in the compute buffer. Currently, simple programs involving padding will result in duplcate masked loads. e.g. the generated code for ```python def forward(): a = torch.nn.functional.pad(x, (0, 1)) return a + a ``` contains the lines ```python tmp3 = tl.load(in_ptr0 + (x1 + tl.zeros([XBLOCK], tl.int32)), tmp2 & xmask, other=0) tmp4 = tl.where(tmp2, tmp3, 0.0) tmp5 = tl.load(in_ptr0 + (x1 + tl.zeros([XBLOCK], tl.int32)), tmp2 & xmask, other=0) tmp6 = tl.where(tmp2, tmp5, 0.0) ``` With this change, the duplicates are removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98303 Approved by: https://github.com/ngimel	2023-04-05 23:10:46 +00:00
mikey dagitses	aa7850c214	rewrite at::vec::*::convert_to_int_of_same_size (#98429 ) This was failing to compile with unrelated changes in the windows-binary-libtorch-release build job. This rewrite seems to avoid that problem. For an example failure, see: `144d5268a1` Differential Revision: [D44717809](https://our.internmc.facebook.com/intern/diff/D44717809/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98429 Approved by: https://github.com/huydhn	2023-04-05 23:04:08 +00:00
erjia	29d2e4b7fa	Forward fix for DataLoader to accept custom Sharding DataPipe (#97287 ) Fixes #96975 Changes: - Make sure custom ShardingDataPipe with `apply_sharding` can be used by `DataLoader` - Allow the `apply_sharding` function without the last argument of `sharding_group` - Make `DataLoader` not relying on `sharding_group` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97287 Approved by: https://github.com/NivekT	2023-04-05 22:33:37 +00:00
Edward Z. Yang	d01ee10b25	Add detect_fake_mode (#98321 ) This replaces fake_mode_from_tensors but it preferentially looks for fake_mode in TracingContext and also if there is an active fake mode on the dispatch stack, before groveling in tensors to find it. This advances PegasusForCausalLM, which was previously failing because we generated a graph that had a parameter (non-fake) and a SymInt, and thus previously we failed to detect the correct fake mode. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98321 Approved by: https://github.com/voznesenskym	2023-04-05 22:15:16 +00:00
Paweł Piskorski	5854923c17	Extract ExtraMeta symbolic shape fields into a dedicated SymbolicShap… (#98399 ) …eMeta This modularizes ExtraMeta to bring down its creation cost when it is needed for other functions than syn shape handling. Change-Id: Ife59b201b0c4fd75090fe8be5171a6dd73a10d10 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/98399 Approved by: https://github.com/ezyang	2023-04-05 22:06:10 +00:00
Catherine Lee	5af47dbb23	Add slow workflow to upload test stats workflow (#98447 ) I wonder if it would be better to write an exclusion list instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/98447 Approved by: https://github.com/huydhn	2023-04-05 22:03:16 +00:00
Shen Li	d0eafed7fb	[Easy] Fix minor errors in DTensor examples (#98430 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98430 Approved by: https://github.com/wanchaol	2023-04-05 21:44:01 +00:00
Yanbo Liang	b1c2925493	[Dynamo] Support typing.Union and typing.Optional (#98384 ) Fixes #98265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98384 Approved by: https://github.com/ezyang	2023-04-05 21:31:52 +00:00
Sujoy Saraswati	846415f6ea	Add HPU to the storage tensor backends (#98404 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/98404 Approved by: https://github.com/ezyang	2023-04-05 21:29:27 +00:00
Nikita Shulga	29cde00701	[MPS] Add `random_` overload (#98333 ) That simply calls `torch.random_(from=0, to=None)` Also, fix optional upper bound calculation for all `dtypes` but int64: As one can see from https://pytorch.org/docs/stable/generated/torch.Tensor.random_.html `from` boundary is inclusive, but `to` is exclusive, i.e. if `to` is omitted for `torch.int8` dtype, it should be set to `128` and to `2` for torch.bool. Add test for `torch.random_` Fixes https://github.com/pytorch/pytorch/issues/98118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98333 Approved by: https://github.com/kulinseth	2023-04-05 21:24:45 +00:00
Wanchao Liang	c9b1e09958	[c10d] delete lengths offset checks (#98368 ) According to @kwen2501, NCCL supports up to size_t send counts, so PGNCCL shouldn't restrict it A follow up is to think about whether we should add overflow protection of offset Pull Request resolved: https://github.com/pytorch/pytorch/pull/98368 Approved by: https://github.com/kwen2501	2023-04-05 21:06:49 +00:00
Yanbo Liang	9c7b03d51e	[Dynamo] Fix bug of torch.is_floating_point & is_complex (#98393 ) Fixes #95192 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98393 Approved by: https://github.com/ezyang	2023-04-05 20:58:16 +00:00
Tugsbayasgalan Manlaibaatar	ebeaf8adf1	Add hacky example inputs to dynamo produced graph (#96561 ) Executorch currently uses functorch.functionalize API, as a result we have to invoke make_fx twice (one for filtering out autograd related stuff (happens in torchdynamo.export(aten=True) and one for tracing the functionalized version of the graph). The previous PR changes the make_fx behaviour to pass in fake tensors used in dynamo. But as Executorch invokes the second make_fx directly, we need to have access to fake tensors that dynamo used. We cannot call torchdynamo.export again in the second round because we don't have a way to functionalize inside dynamo at the moment. Hence I added this attribute in dynamo for now. Once we move to AOTAutograd functionalization, we don't have to deal with this anymore and I will remove this. Differential Revision: [D43994692](https://our.internmc.facebook.com/intern/diff/D43994692) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96561 Approved by: https://github.com/zhxchen17, https://github.com/voznesenskym	2023-04-05 20:54:33 +00:00
Takeshi Watanabe	3d36f6f18d	Fix default argument of `parse_ir` stub (#98397 ) It's `false` in pybind code but not provided to stub `2987bc0758/torch/csrc/jit/python/init.cpp (L1625)` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/98397 Approved by: https://github.com/mikaylagawarecki	2023-04-05 20:46:16 +00:00
Nicolas Macchioni	3c8e9e38a1	[pt2][inductor] retry add `triton.__verison__` as cache key, update cache layout (#98369 ) Summary: retry of landing D44550100, try to import triton otherwise consider version as `None` Test Plan: will make sure windows OSS tests run as well in CI Differential Revision: D44694213 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98369 Approved by: https://github.com/huydhn	2023-04-05 20:43:54 +00:00
Richard Zou	f21a176c03	Python Dispatcher should respect FuncTorchBatchedDecomposition key (#98328 ) Fixes https://github.com/pytorch/pytorch/issues/97425. Python Dispatcher's resolve_key function should be equivalent to computeDispatchTableEntryWithDebug. We added a section to computeDispatchTableEntryWithDebug but forgot to add it to resolve_key. This PR fixes that discrepancy. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/98328 Approved by: https://github.com/Chillee, https://github.com/kshitij12345, https://github.com/Neilblaze	2023-04-05 20:32:53 +00:00
atalman	78e991e575	Patch release process description (#98425 ) Patch release process description Pull Request resolved: https://github.com/pytorch/pytorch/pull/98425 Approved by: https://github.com/seemethere	2023-04-05 20:25:13 +00:00
Edward Z. Yang	37b9143206	Require sequence length in huggingface to be dynamic (#98335 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98335 Approved by: https://github.com/voznesenskym	2023-04-05 19:40:22 +00:00
Edward Z. Yang	cf1bfca2ba	Require batch dimensions to be compiled dynamically (#98334 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98334 Approved by: https://github.com/voznesenskym	2023-04-05 19:40:22 +00:00
Edward Z. Yang	69f9bd2323	Don't error if we mark_dynamic without dynamic_shapes on (#98324 ) In the terminal state, it won't matter if you have dynamic_shapes on or not, mark_dynamic will always work. Today, it's helpful to make this not error so I can easily swap between static or not and run experiments. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98324 Approved by: https://github.com/voznesenskym	2023-04-05 19:40:22 +00:00
Mikayla Gawarecki	2c6c7deeb3	Added ModuleInfos for Pooling ops (#98358 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98358 Approved by: https://github.com/albanD	2023-04-05 19:39:07 +00:00
Mikayla Gawarecki	3a0ad3c194	[easy] Remove large LayerNorm sample input causing OOM from ModuleInfo (#98424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98424 Approved by: https://github.com/huydhn, https://github.com/albanD	2023-04-05 19:38:15 +00:00
Edward Z. Yang	3ed66f94b5	Add more debug logs to evaluate_expr (#98344 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98344 Approved by: https://github.com/voznesenskym	2023-04-05 19:35:07 +00:00
mikey dagitses	f557402e8d	remove //c10:headers (#98420 ) The c10 library is light enough that there's not really much benefit to being very unbazel-y and providing an incomplete library that lacks the source files. Differential Revision: [D44713077](https://our.internmc.facebook.com/intern/diff/D44713077/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98420 Approved by: https://github.com/ezyang	2023-04-05 19:33:10 +00:00
Lei Mao	937ba248eb	Make the Index Rounding Mode Consistent Between the 2D and 3D GridSample Nearest Neighbor Interpolations (#97000 ) ## BC-breaking note: This is technically a bugfix. Prior to this PR, for `torch.nn.functional.grid_sample(mode='nearest')` the 2D kernel used `std::nearbyint` whereas the 3D kernel used `std::round` in order to determine the nearest pixel locations after un-normalization of the grid. This PR fixes the 3D kernel to use `std::nearbyint` which rounds values that are exactly `<>.5` to the nearest even which is consistent with the behavior of `torch.round`. Unnormalized indices that are exactly `<>.5` will now be rounded to the nearest even instead of being rounded away from 0. ## Description In the nearest neighbor interpolation mode, the 2D GridSample rounds index to the nearest even using [std::nearbyint](https://github.com/pytorch/pytorch/blob/v2.0.0/aten/src/ATen/native/cpu/zmath.h#L182) whereas the 3D GridSample rounds index away from zero using std::round. This discrepancy needs to be resolved. We are making both 2D GridSample and 3D GridSample to round to the nearest even. ## Unit Test Goals 1. Make sure the x dimension and y dimension rounding behaviors are the same for 2D GridSample. 2. ~~Make sure the 2D GridSample rounding mode is rounding to the nearest even.~~ 3. Make sure the x dimension, y dimension, and z dimension rounding behaviors are the same for 3D GridSample. 4. ~~Make sure the 3D GridSample rounding mode is rounding to the nearest even.~~ 5. The 2D GridSample and 3D GridSample rounding behaviors are exactly the same. After some experiments, I found 2 and 4 are difficult to achieve. Even though I can compute the normalized coordinates corresponding to the unnormalized coordinates including [0, 0.5, 1.0, 1.5, 2.0, 2.5, ..., 10.0], the unnormalization process in the GridSample implementations always have a small chance of having floating point error. Therefore, it's not possible to unit test the rounding mode from the normalized coordinates. ## Unit Test Methods The unit test is simple. By using the same values along the dimension to be tested in the input tensor and the same normalized indices in the grid tensor. The interpolation on the 2D GridSample x-dimension, 2D GridSample y-dimension, 3D GridSample x-dimension, 3D GridSample y-dimension, 3D GridSample z-dimension. Should produce exactly the same interpolated values. If one CPU/CUDA 2D/3D implementation use a different rounding mode from others, the unit test shall fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97000 Approved by: https://github.com/mikaylagawarecki	2023-04-05 18:47:03 +00:00
Wanchao Liang	dcec2100b1	[dtensor] add placement strategy and einsum strategy (#98227 ) This adds placement strategy to the op schema and implement einsum strategy. It's the basic building piece for compile mode expansion and new op implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/98227 Approved by: https://github.com/XilunWu	2023-04-05 17:09:32 +00:00
Jesse Cai	93063768da	[pruning][core][feature] Implement convert for pruner (#97545 ) Summary: This PR implements `BaseSparsifier.convert()`, which performs module swapping. The modules and mappings will be merged in a future PR. Test Plan: `python test/test_ao_sparsity.py -- TestBaseSparsifier.test_convert` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97545 Approved by: https://github.com/jerryzh168	2023-04-05 16:57:11 +00:00
Rohan Varma	3657b37d6b	Add forward_prefetch flag to fully_shard (#98277 ) Summary: Per title Test Plan: CI Differential Revision: D44656110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98277 Approved by: https://github.com/awgu	2023-04-05 16:49:15 +00:00
Nikita Shulga	49c130256d	Clarify `Tensor.is_sparse` doc (#98408 ) That it returns true only for COO matrix Pull Request resolved: https://github.com/pytorch/pytorch/pull/98408 Approved by: https://github.com/cpuhrsch	2023-04-05 16:42:19 +00:00
PaliC	d1de5f5f0d	Change daily aggregates upload job to use sum and occurence counter instead of averages (#98359 ) We used to keep track of the average of stats, however, when we munge the data to find interesting insights this makes things difficult (ie. finding total test time for an oncall). The pin is updated such that we keep track of the sum instead as well as an "occurrences" field such that the average can be rederived from sum/occurrences. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98359 Approved by: https://github.com/huydhn	2023-04-05 16:31:58 +00:00
Yifu Wang	762a81cb7d	[spmd compile api] pre-flatten state container and pass the flattened state container to transforms (#98392 ) Move the responsibility of flattening the input arguments from the graph module to the caller. This serves two purposes: - Transformations that add/remove state need to manipulate a state container that maintains the state tensors in the same order as they appear in graph placeholders. - Reduced runtime cost. The state container is only flattened once upfront. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98392 Approved by: https://github.com/mrshenli	2023-04-05 16:31:47 +00:00
Yifu Wang	37dbd5bf76	[spmd compile API] add a (temporary) mechanism for overriding input tensors' placements (#98391 ) Currently, the compile API assumes all input tensors' shard dimension is the first dimension. dtensor expansion doesn't work when there are input tensors whose shard dimension is not the first dimension. In addtion, respect non-tensor inputs beyond nn.Module and optim.Optimizers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98391 Approved by: https://github.com/mrshenli	2023-04-05 16:31:47 +00:00
Yifu Wang	970c08f92f	[spmd expansion] support scalar_tensor (#98390 ) scalar_tensor is a pure factory function that can't be handled by DTensor prop rule and needs to be currently handled in spmd expansion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98390 Approved by: https://github.com/mrshenli	2023-04-05 16:31:44 +00:00
Yifu Wang	0830808dde	[spmd expansion] speed up expansion by ~5x (#98389 ) According to profiling, the top two expensive operations in spmd expansion are propagate_op_sharding and make_fx (for every dispatcher op node). This PR makes the following changes to speed up spmd expansion: - We are unneccessarily doing propagate_op_sharding twice for every op. Remove one. - When no tensor redistribution is required, we only need to update non-tensor args of the node according to op_schema and avoid building a GraphModule just for the node. On a DDP use cases + foreach Adam, this change speeds up spmd expansion by ~5x (~10 min -> ~2 min). Pull Request resolved: https://github.com/pytorch/pytorch/pull/98389 Approved by: https://github.com/mrshenli	2023-04-05 16:31:40 +00:00
Yifu Wang	161f7c0b28	[spmd expansion] support torch.ops.aten.sym_numel (#98388 ) The current logic assumes non-overload ops takes two arguments however torch.ops.aten.sym_numel takes one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98388 Approved by: https://github.com/mrshenli	2023-04-05 16:31:36 +00:00
Jason Ansel	3344d79e3f	Pattern matcher improvements (#97740 ) This adds support for multi-output patterns and example-based replacements. Tests/usage are next in this PR stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97740 Approved by: https://github.com/ngimel	2023-04-05 15:25:34 +00:00
PyTorch MergeBot	279ca5f9db	Revert "[CUDA12] set_device change (#94864 )" This reverts commit c18be2b2ec00133abe28efcdd0462e50ddd45a1a. Reverted https://github.com/pytorch/pytorch/pull/94864 on behalf of https://github.com/ezyang due to avoid affecting cuda 11	2023-04-05 14:53:00 +00:00
Elias Ellison	981f9f0408	Better Handling of Storage Cache (#98254 ) Because we do not persist output memory of cudagraphs, we need to reconstruct tensors at their correct memory locations after we've done a run. We were using a storage cache for that but it had a couple of issues: - If the a data ptr existed in the cache, we should only reuse the corresponding storage if the storage hadn't died - didnt work across separate nodes. While you wouldn't think this would be an issue, it was in testing HF. - StorageWeakRef maintains whether the Storage C++ object remains allocated, not whether the corresponding memory has been deallocated. In order to use them to track memory deallocations we must maintain a single StorageWeakRef for all Storages that reference that memory (even if we are constructing Storages that do not have a deallocator function). This PR a singlestorage_cache as we execute any tree path. When we retrieve a storage from the cache we check that it is still alive, and we hash based on both observed recording data ptr and storageimpl weak ref. Update to use a single storage cache across all executions in a path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98254 Approved by: https://github.com/jansel	2023-04-05 14:45:25 +00:00
Elias Ellison	f1b901b040	Make sure we dealloc on recording, not just replay (#97440 ) Copy over non cuda graph inputs as we are allocating the recording inputs so they do not need to be allocated as we record the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97440 Approved by: https://github.com/ezyang	2023-04-05 14:41:51 +00:00
Aidyn-A	c18be2b2ec	[CUDA12] set_device change (#94864 ) This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this: ```Python import torch x = torch.randn(1, device="cuda:1") ``` would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`. After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang	2023-04-05 14:34:00 +00:00
Lei Mao	7b08889074	Fix GridSample Activation Quantization (#98278 ) The convention for activation quantization is rounding to the nearest even. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98278 Approved by: https://github.com/vkuzo	2023-04-05 13:22:08 +00:00
Nikita Shulga	3da7e83250	Add test for pickle_module (#98373 ) I.e. a regression test for https://github.com/pytorch/pytorch/issues/88438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98373 Approved by: https://github.com/huydhn, https://github.com/kit1980	2023-04-05 13:05:05 +00:00
XDaoHong	ea00f850e9	add new() method identifier to _StorageBase (#98201 ) The method torch.UntypedStorage.new is not detailed in API docs. Adding a method identifier may make it easier to know that new() method is only implemented by cpp, like copy_() or nbytes(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/98201 Approved by: https://github.com/ezyang	2023-04-05 12:47:40 +00:00
Edward Z. Yang	2a32bc50c6	Only print guard code when printing guards (#98347 ) Before: <img width="1138" alt="image" src="https://user-images.githubusercontent.com/13564/229915726-19bddea8-8fa4-46d2-8501-358e9f9ea639.png"> After: ``` [2023-04-04 13:44:23,109] torch._dynamo.convert_frame.__guards: [DEBUG] GUARDS: ___check_obj_id(L['self'], 139844089003936) not ___are_deterministic_algorithms_enabled() ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98347 Approved by: https://github.com/voznesenskym	2023-04-05 12:15:25 +00:00
BJ Hargrave	555ab310dc	Add itemsize and nbytes properties to Tensor (#98322 ) Adds properties for itemsize and nbytes to Tensor matching the properties in NumPy. Fixes https://github.com/pytorch/pytorch/issues/12728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98322 Approved by: https://github.com/ezyang	2023-04-05 12:11:55 +00:00
Andy Rock	14ccad73b4	fix _slice_meta's shape calculation (#98326 ) Fixes #98325. This PR corrects the output shape calculation used in `_slice_meta` from: ```python math.floor((end - start) / stride) ``` to ```python 1 + (end - start - 1) // stride ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98326 Approved by: https://github.com/ezyang	2023-04-05 12:07:18 +00:00
redwrasse	b4420f0fd5	Fix complex variable notation for division operator to be consistent. (#98057 ) A readability improvement: changes notation in complex division to match comments `(a + bi)/(c + di)` for `constexpr FORCE_INLINE_APPLE complex<T>& operator/=(const complex<U>& rhs)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98057 Approved by: https://github.com/ezyang	2023-04-05 12:06:20 +00:00
dilililiwhy	526b564fa0	Uniformly use elem when checking ListType (#97873 ) Fixes #ISSUE_NUMBER a initial trial to let code of arg parser become more readable (go through and understand logic behind torchgen as a rookie) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97873 Approved by: https://github.com/ezyang	2023-04-05 12:06:03 +00:00
Bin Bao	c4de7fdef5	[CI] Mark sebotnet33ts_256 as nondeterministic (#98356 ) Summary: The goal is make sure the new dashboard doesn't give noisy alert on this test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98356 Approved by: https://github.com/ezyang	2023-04-05 12:05:47 +00:00
Peter Bell	a05d787eb6	[inductor] Fix slow tests not being run in CI (#97841 ) PyTorch slow tests are run in CI with `PYTORCH_TEST_SKIP_FAST=1` which skips any test not decorated with `@slowTest`. That means tests marked with `skipIf(not TEST_WITH_SLOW)` are never run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97841 Approved by: https://github.com/jansel	2023-04-05 10:30:20 +00:00
PyTorch MergeBot	45edc58e4f	Revert "remove typed StorageImpl::data() and StorageImpl::unsafe_data() (#98219 )" This reverts commit 144d5268a1ee55a348c36bb6f02b881cc67d5173. Reverted https://github.com/pytorch/pytorch/pull/98219 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally	2023-04-05 09:08:08 +00:00
Sergii Dymchenko	752e43c301	Move android-emulator-build-test to periodic (#98370 ) Per internal discussion, this test is only to verify open source Android build, and it's not critical to run it on every commit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98370 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-04-05 07:07:27 +00:00
chunyuan	2987bc0758	Inductor cpp wrapper: support dynamic shapes (#97965 ) 1. Fixed dynamic shapes support in cpp_wrapper - fixed the cpp codegen of `size()` and `stride()` - fixed the cpp codegen of `ShapeAsConstantBuffer` - changed to use `cexpr` instead of `pexpr` in the cpp codegen of the `sizevar` 2. Enabled dynamic shapes tests for cpp_wrapper Pull Request resolved: https://github.com/pytorch/pytorch/pull/97965 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-05 07:02:30 +00:00
Kazuaki Ishizaki	601e7dc0bb	Fix typos under caffe2/operators directory (#98235 ) This PR fixes typos in comments and messages of `.cc` and `.h` files under `caffe2/operators` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/98235 Approved by: https://github.com/kit1980	2023-04-05 06:26:01 +00:00
Elias Ellison	feb9ec4282	Account for forwards which whose corresponding backwards are not invoked (#98112 ) Previously, when we would run a forward graph whose backward we never invoked it would prevent us from switching from warmup to recording. Now, refine the heuristic to allow incrementing the generation as soon as we invoke a backward graph. This still handles the ``` mod1 = torch.compile(...) mod2 = torch.compile(...) mod2(mod1(x)).sum().backward() ``` case while accounting for graphs which we may not run backward of. It also now handles the case where we skip cudagraphify the backward of a forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98112 Approved by: https://github.com/jansel	2023-04-05 06:12:16 +00:00
Aleksei Nikiforov	ae0d06b42c	Fix saving and loading pickle files on Big Endian systems (#95881 ) This change fixes test/test_cpp_api_parity.py tests on Big Endian systems. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95881 Approved by: https://github.com/malfet	2023-04-05 06:11:31 +00:00
Wei Wang	1e3abda31a	Revert "[spmd expansion] support torch.ops.aten.sym_numel (#98229 )" (#98382 ) This reverts commit 4d13fcddeff01bc7d44f752173e8efecf25fcf9b. Fixes diff train landing issue as the original diff was modified after the PR was merged in OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98382 Approved by: https://github.com/kit1980	2023-04-05 04:07:58 +00:00
knwng	e943b120a3	Fix incorrectly getting the name of OrderedDict's index in dynamo (#96940 ) Fixes #96737 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96940 Approved by: https://github.com/ezyang, https://github.com/voznesenskym	2023-04-05 03:53:45 +00:00
Edward Z. Yang	30d47e4520	Do not track parameters, do not generate guards (#98350 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98350 Approved by: https://github.com/voznesenskym	2023-04-05 03:48:46 +00:00
mikey dagitses	144d5268a1	remove typed StorageImpl::data() and StorageImpl::unsafe_data() (#98219 ) Typed data will now only be a tensor level concept. Differential Revision: [D44629941](https://our.internmc.facebook.com/intern/diff/D44629941/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98219 Approved by: https://github.com/ezyang	2023-04-05 03:32:02 +00:00
Bin Bao	6887333cf9	[inductor] Fix a perf regression caused by https://github.com/pytorch/pytorch/pull/98214 (#98343 ) Summary: See the description in https://github.com/pytorch/pytorch/issues/98342 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98343 Approved by: https://github.com/jansel	2023-04-05 01:46:20 +00:00
Edward Z. Yang	b923f84805	Switch accuracy CI to dynamic batch only (#98307 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98307 Approved by: https://github.com/wconstab	2023-04-05 01:20:12 +00:00
Kazuaki Ishizaki	6514d71add	Fix typos under torch/distributed directory (#98225 ) This PR fixes typos in comments and messages of `.py` files under `torch/distributed` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/98225 Approved by: https://github.com/soulitzer, https://github.com/kit1980	2023-04-05 00:21:33 +00:00
mikey dagitses	3af0228338	remove typed StorageImpl::unsafe_data() (#98218 ) Typed data will now only be a tensor level concept. Differential Revision: [D44629939](https://our.internmc.facebook.com/intern/diff/D44629939/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98218 Approved by: https://github.com/ezyang	2023-04-05 00:10:59 +00:00
Elias Ellison	a3365e1d0d	Increment pending forwards after invocation (#98101 ) Forwards are only pending following invocation, not before. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98101 Approved by: https://github.com/ngimel	2023-04-05 00:04:39 +00:00
Andrew Gu	3686416a57	[SyncBatchNorm] Support running with low precision parameters (#98332 ) This PR fixes https://github.com/pytorch/pytorch/issues/96203. Details When using `nn.SyncBatchNorm` with the model converted to FP16, there is a dtype discrepancy in the `SyncBatchNorm.forward()` causing an error like: ``` File "/.../pytorch/torch/nn/modules/_functions.py", line 91, in forward mean, invstd = torch.batch_norm_gather_stats_with_counts( RuntimeError: Expected counts to have type Half but got Float ``` [`torch.batch_norm_gather_stats_with_counts()`](`fe9da29842/torch/nn/modules/_functions.py (L88-L97)`) requires the `running_mean`, `running_var`, and `counts` to have the same dtype. However, when the model has been converted to FP16, only `running_mean` and `running_var` use FP16, while the `counts` are in FP32 due to [`mean` being in FP32](`fe9da29842/torch/nn/modules/_functions.py (L25-L30)`). This PR resolves this by casting `counts` from FP32 to FP16 instead of the alternative to cast `mean` and `invstd` from FP32 to FP16. Moreover, for the backward, this PR casts `weight` from FP16 to FP32 to match the dtype of `mean` and `invstd` as required by `torch.batch_norm_backward_elemt()` instead of the alternative to cast `mean` and `invstd` from FP32 to FP16. Test Plan I dug up this run command from 2021: For `world_size` in `{1,2}` and `backend` in `{nccl, gloo}`: ``` WORLD_SIZE=world_size BACKEND=backend python -m pytest test/distributed/test_distributed_spawn.py -k test_DistributedDataParallel_SyncBatchNorm_half -vs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98332 Approved by: https://github.com/rohan-varma	2023-04-05 00:00:30 +00:00
Paweł Piskorski	2d9b2bcfba	Extend TensorImpl with BackendMeta (#97429 ) BackendMeta offers a binary interface for the backend to attach arbitrary data to TensorImpl. TensorImpl has exactly one "slot" for backend metadata, however backend is free to compose any structure that is opaque to the framework beyond iheriting standard BackendMeta base. Change-Id: I670fcdd16dd1c2b00f7eaa1cbc5b5dfea59a6221 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97429 Approved by: https://github.com/ezyang	2023-04-04 23:47:03 +00:00
PyTorch MergeBot	dd503376bd	Revert "[pt2][inductor] add `triton.__verison__` as cache key, update cache layout (#98010 )" This reverts commit 0eab3ab51ec3e83bd9961b167bfbdbab7fc064e6. Reverted https://github.com/pytorch/pytorch/pull/98010 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally	2023-04-04 22:22:46 +00:00
Bin Bao	bd6db54285	[CI] Mark mobilenet_v3_large as nondeterministic (#98314 ) Summary: Skip mobilenet_v3_large for accuracy checking to reduce noise on the dashboard. The root cause still needs to be investigated. mobilenet_v3_large shows random accuracy check failures with different error values from time to time, and here are some examples: ``` cuda train mobilenet_v3_large [2023-04-04 14:54:50,990] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 0.02172, (ref-fp64): 0.01068 and shape=torch.Size([960, 1, 5, 5]) [2023-04-04 14:54:50,990] torch._dynamo.utils: [ERROR] Accuracy failed for key name features.14.block.1.0.weight.grad ``` ``` cuda train mobilenet_v3_large [2023-04-04 14:57:59,972] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 0.07744, (ref-fp64): 0.03073 and shape=torch.Size([72, 1, 5, 5]) [2023-04-04 14:57:59,973] torch._dynamo.utils: [ERROR] Accuracy failed for key name features.4.block.1.0.weight.grad ``` One observation is turnning off cudnn in the eager mode with `torch.backends.cudnn.enabled = False` makes the non-deterministic behvior go away but meanwhile it fails accuaracy checking consistently. Minifier didn't help to narrow down the error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98314 Approved by: https://github.com/huydhn	2023-04-04 21:55:23 +00:00
Jagadish Krishnamoorthy	ecf08a0f8b	[ROCm] Enable test_filtering_env_var (#84100 ) The test "test_filtering_env_var" requires CPU device_type along with GPU. Hence enable both device_types for ROCm, since the PYTORCH_TESTING_DEVICE_ONLY_FOR env var will have the same effect as the code being removed, making the latter redundant anyway: `9e81c0c3f4/.jenkins/pytorch/test.sh (L54-L59)` `9e81c0c3f4/torch/testing/_internal/common_device_type.py (L626-L634)` Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Enables the test disabled by #56178 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84100 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2023-04-04 21:49:53 +00:00
Jason Park	51a978fe7b	Set number of threads to be 1 for ARM (#97482 ) (#98267 ) Summary: In highly multi-threaded environment, using # of threads to be matching hardware_concurrency leads to high contention. x86 path actually ends up using different path (MKL path), which results in using 1 thread for x86 as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98267 Approved by: https://github.com/malfet	2023-04-04 21:24:50 +00:00
Andrew Gu	aaae588727	[FSDP][Docs] Add warning about forward saving param refs (#98320 ) This PR adds a warning following an issue hit by an internal user. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98320 Approved by: https://github.com/rohan-varma	2023-04-04 21:11:57 +00:00
Andrew Gu	66d07e3b19	[FSDP] Only move current FSDP's states to GPU during init (#98319 ) Fixes https://github.com/pytorch/pytorch/issues/95813 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98319 Approved by: https://github.com/rohan-varma	2023-04-04 21:03:47 +00:00
Andrew Gu	d7156175fe	[FSDP] Add skip writeback check gated by env var (#98300 ) This adds the option to skip the `_writeback_orig_params()` function that checks for parameter and gradient writeback in case storages changed, gated by an env var `FSDP_SKIP_WRITEBACK_CHECK`. As described in the code comment, this writeback check is important for detecting a failure mode of FSDP `use_orig_params=True`. However, because the failure mode is an atypical case and performing the check incurs nontrivial CPU overhead each iteration, we add this option to skip the check altogether. <details> <summary>(Before) Pre-backward hook: 1.044 ms</summary> ![Screenshot 2023-04-04 at 9 05 53 AM](https://user-images.githubusercontent.com/31054793/229800917-9580ce6b-3721-469a-9212-f0cbfd8cbb52.png) </details> <details> <summary>(After) Pre-backward hook: 0.500 ms</summary> ![Screenshot 2023-04-04 at 9 34 57 AM](https://user-images.githubusercontent.com/31054793/229810916-b16295d5-7da7-42c4-9168-04edeebe045c.png) </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98300 Approved by: https://github.com/rohan-varma	2023-04-04 20:55:09 +00:00
Matthias Reso	96595617b9	Support Modules with custom __getitem__ method through fallback (#97932 ) This PR allows to torch.compile torch.nn.Module with custom __getitem__ methods but falling back to Python. Fixes #97720 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97932 Approved by: https://github.com/yanboliang	2023-04-04 20:42:17 +00:00
Edward Z. Yang	057911741a	[EASY] Teach requires_bwd_pass how to interpret int. (#98312 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98312 Approved by: https://github.com/wconstab	2023-04-04 20:41:26 +00:00
Yanbo Liang	fd0be80dd1	[Dynamo] graph break when calling resize_() on graph input (#98279 ) Fixes #97921 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98279 Approved by: https://github.com/jansel, https://github.com/eellison	2023-04-04 20:39:12 +00:00
Edward Z. Yang	3c36f82fa2	[EASY] Handle new inference csv from CI (#98294 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98294 Approved by: https://github.com/wconstab	2023-04-04 20:37:51 +00:00
Tugsbayasgalan Manlaibaatar	75ac6fdcdd	Propogate dynamo shape_env to make_fx (#96437 ) Currently, when we use assume_static_by_default flag, dynamo won't produce any symbols for input tensors. But when we pass the dynamo generated graph onto make_fx via torchdynamo.export(aten_graph=True), there is no way to pass this flag. We enable this by directly passing the fake tensors dynamo used to make_fx and call make_fx with "real" mode with fake tensors from dynamo. Note that this is modified version of (https://github.com/pytorch/pytorch/pull/96143) Differential Revision: [D44561753](https://our.internmc.facebook.com/intern/diff/D44561753) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96437 Approved by: https://github.com/jansel, https://github.com/ezyang	2023-04-04 20:37:30 +00:00
Nicolas Macchioni	0eab3ab51e	[pt2][inductor] add `triton.__verison__` as cache key, update cache layout (#98010 ) Summary: * change caching to have `system` and `cache` components, where `system` servers as an identifier for that machine's performance. similar to original method of having GPU type and CUDA version be cache keys, and now also includes Triton version. `cache` is similar to the original cache type, but now without GPU name or CUDA version ``` { "system": { "device": "NVIDIA PG509-210", "version": { "cuda": "11.4.0", "triton": "2.1.0" }, "hash": "e7cfb8786d2e1366b3df564bcb2f957d07545e98bf20c98d33a43b6ee80a91e0" }, "cache": { "bias_addmm": { "[('cuda', 'torch.float32', 2048, 160, 0, 1, 0), ('cuda', 'torch.float32', 2048, 1140, 228148, 1, 206080), ('cuda', 'torch.float32', 1140, 160, 1, 1140, 0)]": { "bias_addmm-alpha=1-beta=1-c73frtshmeth2spjun3zc4l2q7ck43wl356pnlmsmxgmzbfsz7ef": 0.03654399886727333, "addmm-alpha=1-beta=1-c4xxd3iocu4yt6z4udrlqnumays7q6mfnfd3qprh4fxgsvyhqdkf": 0.03564799949526787, "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=64-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=2-num_warps=4-cxgwpjkimm4azwffrfuqniwncnv4h5bxrpo4od4an4bstnh7qrqh": 0.04927999898791313, "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=64-BLOCK_N=128-EVEN_K=False-GROUP_M=8-num_stages=3-num_warps=4-cqlirysniekkuuvc4ue33dr4gpfzsb5e4bexarrsnsyei4slxvcz": 0.03651199862360954, "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=128-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=3-num_warps=4-cww5uss3k4d3ei2c4lx63pudyzxdwl3ieibhxcrue4zg424eqrnu": 0.03580800071358681, "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=64-BLOCK_N=128-EVEN_K=False-GROUP_M=8-num_stages=4-num_warps=8-cqcla5edxdm3n6rrkmjehexsudravx6lpphfo5zazldpo3rzpqc4": 0.03558399900794029, "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=128-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=4-num_warps=8-c7gdf2snt4bjlnuzdy3px4pyq3lbsdh4jp6jaie7lq6mdxccy6nl": 0.03455999866127968, "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=64-BLOCK_N=32-EVEN_K=False-GROUP_M=8-num_stages=5-num_warps=8-cjhcy4scxgy4lxbhjiinvxl3bbrqya63jilcckx2ltsg3mpzxyqr": 0.036288000643253326, "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=32-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=5-num_warps=8-cu32a5vsbaln3t55jm2y6xhwgyggejmoatyakcm2huvxofw2zzva": 0.0398080013692379, "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=128-BLOCK_N=128-EVEN_K=False-GROUP_M=8-num_stages=2-num_warps=8-croberh4l55jxlrlgkttigtebsnmosycc5rdtbtn3lp3bpovgz4a": 0.0732479989528656, "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=64-BLOCK_M=64-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=3-num_warps=8-c6oxgunysrqpiwwoinylb3sb2hzvx66yhehma64drqvmz52h3r5t": 0.0306560005992651, "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=128-BLOCK_M=32-BLOCK_N=32-EVEN_K=False-GROUP_M=8-num_stages=2-num_warps=4-cdrev5e3zno6z6flmhlbxgd26gkdpurljyhrw3ovx6pftoe62dpf": 0.04800000041723251, "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=16-BLOCK_M=64-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=2-num_warps=4-ce3ofrgngrwuo45hw5wqlzztium7gfkf4n5x25gwu4d6ygkea4bs": 0.0751039981842041, "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=16-BLOCK_M=32-BLOCK_N=32-EVEN_K=False-GROUP_M=8-num_stages=1-num_warps=2-cfkz2smezre4x7hyhc2kbeawhqup6qpwzgiavrai2ghe5ghouvn4": 0.07401599735021591 }, ..., }, ..., } } ``` Test Plan: MAST no global: sw-966772723-OfflineTraining_df2509b8 MAST global: sw-966766969-OfflineTraining_19df7c20 Differential Revision: D44550100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98010 Approved by: https://github.com/jansel	2023-04-04 20:37:22 +00:00
Jay Chae	9ddd97e1eb	[kineto] make input shape collection opt-in for on-demand tracing (#746 ) (#97917 ) Summary: X-link: https://github.com/pytorch/kineto/pull/746 Make input shape collection opt-in as we are re-rolling out the feature for on-demand tracing. Making this default right away could cause bloated trace size for inferences Test Plan: # Repro ## Run ``` buck2 run mode/opt kineto/libkineto/fb/integration_tests:pytorch_resnet_integration_test ``` ## Default Path ``` dyno gputrace ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1679681884%2F127.0.0.1%2Flibkineto_activities_2125213.json.gz&bucket=gpu_traces ## Opt in ``` echo -e "CLIENT_INTERFACE_ENABLE_OP_INPUTS_COLLECTION=true" > /tmp/sigrid_kineto.conf && dyno gputrace --gpuconf /tmp/sigrid_kineto.conf ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1679682394%2F127.0.0.1%2Flibkineto_activities_2415763.json.gz&bucket=gpu_traces Reviewed By: aaronenyeshi Differential Revision: D44377341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97917 Approved by: https://github.com/aaronenyeshi	2023-04-04 20:33:20 +00:00
Rodrigo Kumpera	b04f86363f	Fix ideep submodule (#98305 ) It was incorrectly changed in #97157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98305 Approved by: https://github.com/H-Huang	2023-04-04 19:54:14 +00:00
mingfeima	34c7adf1d7	add Half support for sigmoid on CPU (#96077 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96077 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-04-04 18:43:27 +00:00
Elias Ellison	89dc87a225	Deduplicate pointers to manually free (#98097 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98097 Approved by: https://github.com/ngimel	2023-04-04 18:38:47 +00:00
PyTorch MergeBot	a52cf3398c	Revert "Add arm tests to mps workflow (#97279 )" This reverts commit dbd41cfa91170458092057397e4ff68f841b83e3. Reverted https://github.com/pytorch/pytorch/pull/97279 on behalf of https://github.com/clee2000 due to not needed	2023-04-04 18:32:50 +00:00
Aaron Bockover	558e5a240e	Introduce torch.onnx.dynamo_export API (#97920 ) This is the first phase of the new ONNX exporter API for exporting from TorchDynamo and FX, and represents the beginning of a new era for exporting ONNX from PyTorch. The API here is a starting point upon which we will layer more capability and expressiveness in subsequent phases. This first phase introduces the following into `torch.onnx`: ```python dynamo_export( model: torch.nn.Module, /, model_args, export_options: Optional[ExportOptions] = None, model_kwargs, ) -> ExportOutput: ... class ExportOptions: opset_version: Optional[int] = None dynamic_shapes: Optional[bool] = None logger: Optional[logging.Logger] = None class ExportOutputSerializer(Protocol): def serialize( self, export_output: ExportOutput, destination: io.BufferedIOBase, ) -> None: ... class ExportOutput: model_proto: onnx.ModelProto def save( self, destination: Union[str, io.BufferedIOBase], , serializer: Optional[ExportOutputSerializer] = None, ) -> None: ... ``` In addition to the API in the first commit on this PR, we have a few experiments for exporting Dynamo and FX to ONNX that this PR rationalizes through the new Exporter API and adjusts tests to use the new API. - A base `FXGraphModuleExporter` exporter from which all derive: - `DynamoExportExporter`: uses dynamo.export to acquire FX graph - `DynamoOptimizeExporter`: uses dynamo.optimize to acquire FX graph - `FXSymbolicTraceExporter`: uses FX symbolic tracing The `dynamo_export` API currently uses `DynamoOptimizeExporter`. ### Next Steps (subsequent PRs): * Combine `DynamoExportExporter` and `DynamoOptimizeExporter` into a single `DynamoExporter`. * Make it easy to test `FXSymbolicTraceExporter` through the same API; eventually `FXSymbolicTraceExporter` goes away entirely when the Dynamo approach works for large models. We want to keep `FXSymbolicTraceExporter` around for now for experimenting and internal use. * Parameterize (on `ExportOptions`) and consolidate Dynamo exporter tests. - This PR intentionally leaves the existing tests unchanged as much as possible except for the necessary plumbing. * Subsequent API phases: - Diagnostics - Registry, dispatcher, and Custom Ops - Passes - Dynamic shapes Fixes #94774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97920 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms, https://github.com/thiagocrepaldi, https://github.com/shubhambhokare1	2023-04-04 18:13:29 +00:00
Andrew Gu	fe9da29842	[FSDP][Easy] Remove unused `requires_grad_mask` (#98299 ) Follow-up to https://github.com/pytorch/pytorch/pull/98221 to clean up the unused `requires_grad_mask` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98299 Approved by: https://github.com/rohan-varma	2023-04-04 17:49:35 +00:00
Huy Do	4934dde310	Cleanup redundant CI jobs (#98044 ) This cleanup some redundant CI jobs that I found: * @malfet @ZainRizvi Do we need debug build in periodic for both 11.8 and 11.7? This is rarely needed AFAIK. I try to remove 11.8 here while keep 11.7 to be consistent with the rest of the CI. Or may be it should be the other way around to keep 11.8 * Remove libtorch 11.7 and 11.8 builds in periodic as it has already been done in [trunk](https://github.com/pytorch/pytorch/blob/master/.github/workflows/trunk.yml#L86-L97) * Cleanup TSAN (I added this a while back, but there is no drive to go into that further, so let's just kill it) - If you want to keep it, please raise your hand. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 4b3ec53</samp> This pull request simplifies and consolidates the scripts and workflows for the thread sanitizer (TSAN) build and test configuration. It removes redundant and outdated logic, files, and workflows that were previously used to handle the TSAN build differently from the regular build. It enables all the tests for the TSAN build, which has been fixed by another pull request. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98044 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2023-04-04 17:07:58 +00:00
Andrew Gu	10271a60a8	[FSDP] Skip `_use_sharded_views()` for `SHARD_GRAD_OP` (#98250 ) This PR has `SHARD_GRAD_OP` (and `_HYBRID_SHARD_ZERO2`) skip `_use_sharded_views()` in the post-forward reshard since the strategy does not free the unsharded flat parameter and can preserve the unsharded views. This saves nontrivial CPU overhead both in the post-forward reshard (`_use_sharded_views()`) and the pre-backward unshard (`_use_unsharded_views()`). <details> <summary>(Before) Pre-backward hook: 4.356 ms</summary> <img width="812" alt="Screenshot 2023-04-03 at 6 32 19 PM" src="https://user-images.githubusercontent.com/31054793/229641309-778cf1f9-4b5b-42ec-b2d8-0a1e6e7ce330.png"> </details> <details> <summary>(After) Pre-backward hook: 1.044 ms</summary> ![Screenshot 2023-04-04 at 9 05 53 AM](https://user-images.githubusercontent.com/31054793/229800917-9580ce6b-3721-469a-9212-f0cbfd8cbb52.png) </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98250 Approved by: https://github.com/rohan-varma	2023-04-04 17:07:28 +00:00
Huy Do	69f1131178	Bring the fix to flaky missing libzstd on MacOS M1 to its build job (#98236 ) I did this fix https://github.com/pytorch/pytorch/pull/92737 a while back on MacOS M1 test job and haven't seen this flaky issue [Library not loaded: @rpath/libzstd.1.dylib](https://hud.pytorch.org/failure/Library%20not%20loaded%3A%20%40rpath%2Flibzstd.1.dylib) there again. Recently, we start to build on M1 runners, and the flaky failure starts to show up there too, i.e. https://github.com/pytorch/pytorch/actions/runs/4599605256/jobs/8125180118. So I'm bringing the same fix to MacOS M1 build job. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98236 Approved by: https://github.com/ZainRizvi	2023-04-04 17:02:13 +00:00
Yun Wang (Speech)	ba6bc5080f	Fix fused_8bit_rowwise_conversion_ops_test (#98183 ) Summary: This test tests an operator that quantizes and serializes a float array. Among the data serialized, one element is the bias, i.e. the minimum value in the array. The test may fail when the array contains both +0.0 and -0.0, while all other elements are positive. (this happens quite frequently with a hypothesis version >= 6.17.4, due to [this issue](https://github.com/HypothesisWorks/hypothesis/issues/3606)) Depending on the exact settings of SIMD (single instruction, multiple data), the elements of the array may be visited in different orders while running the operator and while calculating the reference. Because +0.0 and -0.0 compare equal, the minimum value may be either +0.0 or -0.0. Nevertheless, the serialized forms of these two values differ in the sign bit, and can make the test fail because it's conducting an exact match on the serialized result. To avoid this failure, I'm adding a line to replace all -0.0 with +0.0 in the input array. Test Plan: Run this with both hypothesis < 6.17.4 and >= 6.17.4: ``` buck2 test mode/opt caffe2/caffe2/python:fused_8bit_rowwise_conversion_ops_test - test_quantize_op ``` Differential Revision: D44617022 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98183 Approved by: https://github.com/malfet	2023-04-04 16:06:13 +00:00
Sergei Vorobev	23a9e08d0d	[bazel] Move torch/csrc/distributed/c10d/quantization/quantization_gpu.cu (#98188 ) Fixes #79236 Avoid kernel de-registration problems in bazel by virtue of having a single cuda kernel lib. Test plan: cherry-picked on a branch where we run all GPU tests and verified that this fixes majority of the tests. https://github.com/pytorch/pytorch/actions/runs/4593347787/jobs/8111184857?pr=96202 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98188 Approved by: https://github.com/malfet	2023-04-04 14:44:42 +00:00
Yukio Siraichi	42cbf7120a	Add parentheses to `FloorDiv`. (#98290 ) [SymPy incorrectly prints](https://github.com/sympy/sympy/issues/25026) multiplications that: - Have a negative term; and - Have a custom operation of smaller precedence than `Mul` (except for `Add` -- which is expanded) I have observed this behavior when running `maml` with dynamic shapes (errors out on `master`, though). There was, for example, the following guard: ```python # vars[12].size()[2] == 2 # vars[0].size()[2] == 3 # x.size()[2] == 28 vars[12].size()[2] ** 2 - \ 2 * vars[12].size()[2] * \ (-vars[0].size()[2] + (-vars[0].size()[2] + (x.size()[2] - vars[0].size()[2]) // 2 + 1) // 2 + 1) // 2 ``` Which translates into: ```python >>> 2 ** 2 - 2 * 2 * (-3 + (-3 + (28 - 3) // 2 + 1) // 2 + 1) // 2 >>> 2 ** 2 - 2 * 2 * (-3 + (-3 + 25 // 2 + 1) // 2 + 1) // 2 >>> 2 ** 2 - 2 * 2 * (-3 + 10 // 2 + 1) // 2 >>> 2 ** 2 - 2 * 2 * 3 // 2 >>> 4 - 2 * 2 * 3 // 2 # floordiv and mul have same precedence in Python >>> -2 ``` Now, placing the parentheses correctly (this PR), we get: ```python >>> 4 - 2 * 2 * (3 // 2) >>> 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98290 Approved by: https://github.com/ezyang	2023-04-04 13:50:53 +00:00
Andrew Gu	0b31f87c18	[FSDP] Use correct handle training state when prefetching (#98249 ) This PR ensures that when prefetching a `FlatParamHandle.unshard()`, we temporarily set the `FlatParamHandle._training_state` to the expected training state as if the `unshard()` were not prefetched since the `as_params` argument to `_use_unsharded_views()` depends on the handle's training state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98249 Approved by: https://github.com/rohan-varma	2023-04-04 13:34:02 +00:00
mikey dagitses	950431c334	extract out a caffe2 macros library (#98156 ) Slowly carving out the minimal caffe2 dependencies to build PyTorch. Differential Revision: [D44609764](https://our.internmc.facebook.com/intern/diff/D44609764/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44609764/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/98156 Approved by: https://github.com/ezyang, https://github.com/PaliC	2023-04-04 10:04:21 +00:00
Andrew Gu	f6272ce79d	[FSDP] Allow non-uniform `requires_grad` for `use_orig_params=True` (#98221 ) Closes https://github.com/pytorch/pytorch/issues/91167. Differential Revision: [D44660134](https://our.internmc.facebook.com/intern/diff/D44660134) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98221 Approved by: https://github.com/rohan-varma	2023-04-04 09:47:43 +00:00
mikey dagitses	301f00f350	generate caffe2/core/macros.h in shared build structure (#98131 ) This is only used by Bazel for now. Differential Revision: [D44604078](https://our.internmc.facebook.com/intern/diff/D44604078/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44604078/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/98131 Approved by: https://github.com/ezyang, https://github.com/PaliC	2023-04-04 09:23:03 +00:00
Jun Luo	d47a4bf53f	Align settings for new device key. (#98224 ) Summary: As title. Test Plan: All CI tests should pass. Reviewed By: yuhc Differential Revision: D44341331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98224 Approved by: https://github.com/jackm321, https://github.com/ezyang	2023-04-04 08:39:11 +00:00
Huy Do	86505c692f	Disable inductor/test_minifier on ASAN (#98263 ) This is to mitigate the timeout issue on ASAN https://github.com/pytorch/pytorch/issues/98262. This test is slow on ASAN and it seems to cause problem to correctly compute the number of shards needed to run it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98263 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-04-04 08:10:18 +00:00
Yuxin Wu	e7874eea7a	fix the use of incomplete vector<T> for C++20 compatibilities (#93978 ) Avoid referring to std::vector<T> members and constructor/desctructors when T is incomplete. Referring to incomplete members is [not legal](https://timsong-cpp.github.io/cppwp/n4868/vector#overview-4) according to the C++ standard. Non-noexcept constructors need access to members' destructors. As of C++20, std::vector's destructor is constexpr and so forcefully requires a complete type for the vector's elements. These issues cause build errors in newer toolchains under c++20 mode. Fix them by moving code that needs complete types to a different place where the type is already defined. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93978 Approved by: https://github.com/Skylion007	2023-04-04 07:47:43 +00:00
Yanbo Liang	a9c7e882ac	[Dynamo] Support skip fbcode modules (#98192 ) Fix Meta internal use case: * We are going to skip tracing ```torchrec.distributed```, however, in fbcode, the structure is a bit different from OSS torchrec. * Meta internally uses ```torch.package```, so we should support skip tracing files like ```<torch_package_0>.torchrec/distributed/...```. * We put the logic behind a flag ```is_fbcode``` to avoid misuse. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98192 Approved by: https://github.com/yf225	2023-04-04 06:33:55 +00:00
Shunting Zhang	d16a9b7676	[inductor] be able to enable max-autotune and cudagraphs independently (#98255 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98255 Approved by: https://github.com/williamwen42	2023-04-04 06:12:46 +00:00
PyTorch MergeBot	7eaaefafb3	Revert "Extend TensorImpl with BackendMeta (#97429 )" This reverts commit bc38b278bf4c2890700f8fe751cfd15fcb01da60. Reverted https://github.com/pytorch/pytorch/pull/97429 on behalf of https://github.com/huydhn due to Sorry for reverting your PR as I am trying to root cause a libtorch build failure on Windows starting from your change `bc38b278bf`. AFAICT, there is no other change from the log. I will reland this if the failure is unrelated	2023-04-04 05:13:18 +00:00
Michael Suo	8f2f1a0b32	[torch/fx] add torch/utils/_stats.py to stack frame skiplist (#98117 ) We added some @count decorators to stuff that show up now Pull Request resolved: https://github.com/pytorch/pytorch/pull/98117 Approved by: https://github.com/SherlockNoMad	2023-04-04 05:03:56 +00:00
blzheng	1fae179ee1	add support for SymNodeVariable in getitem_const (#97756 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97756 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-04-04 03:33:25 +00:00
Jerry Zhang	b109083098	[quant][pt2e][refactor] Remove `backend_config` from `_maybe_insert_input_observers_for_node` (#98094 ) Summary: The goal is to remove the need to use backend_config when pt2e flow code call this function Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/98094 Approved by: https://github.com/jcaip	2023-04-04 03:18:24 +00:00
ppiskorski	bc38b278bf	Extend TensorImpl with BackendMeta (#97429 ) BackendMeta offers a binary interface for the backend to attach arbitrary data to TensorImpl. TensorImpl has exactly one "slot" for backend metadata, however backend is free to compose any structure that is opaque to the framework beyond iheriting standard BackendMeta base. Change-Id: I670fcdd16dd1c2b00f7eaa1cbc5b5dfea59a6221 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97429 Approved by: https://github.com/ezyang	2023-04-04 03:01:14 +00:00
PyTorch MergeBot	c5963b7792	[vision hash update] update the pinned vision hash (#98261 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98261 Approved by: https://github.com/pytorchbot	2023-04-04 02:56:49 +00:00
mikey dagitses	4431509a54	introduce c10::DataPtr::mutable_get() and use it in c10 (#98217 ) Differential Revision: [D44629940](https://our.internmc.facebook.com/intern/diff/D44629940/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98217 Approved by: https://github.com/ezyang	2023-04-04 02:26:18 +00:00
PyTorch MergeBot	fa08e546f3	Revert "Add all_reduce_coalesced functional collective (#97157 )" This reverts commit a3fc3531f514d4c01de9c4a60f978d704d615494. Reverted https://github.com/pytorch/pytorch/pull/97157 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it seems to have a land race with https://github.com/pytorch/pytorch/pull/96226 and fails lint on trunk	2023-04-04 01:50:49 +00:00
Liao, Xuan	177994eb54	[inductor] [cpp] fix bitwise codegen (#98056 ) Fixes #97968 Fix to maintain the data type after doing bitwise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98056 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-04-04 01:33:31 +00:00
chunyuan	0f151ad2ed	Inductor cpp wrapper: support LinearUnary (#97655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97655 Approved by: https://github.com/jansel	2023-04-04 01:31:15 +00:00
PaliC	0e2bde3000	Create script to upload test aggregation data (#97954 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 79f1b37</samp> This pull request improves the workflow and data processing for uploading contribution and testing statistics to Rockset and S3. It renames and updates a workflow file, removes unused code from a script, and adds a new script to aggregate and upload test results. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97954 Approved by: https://github.com/huydhn	2023-04-04 01:28:08 +00:00
William Wen	4cf3e7c255	[dynamo benchmarks] Fix inference benchmark runs (#98248 ) Update flags for dynamo inference benchmark runs. Add flag to not compute regressions/metric graphs (useful if there aren't previous runs to compare with). Pull Request resolved: https://github.com/pytorch/pytorch/pull/98248 Approved by: https://github.com/shunting314	2023-04-04 01:24:13 +00:00
Mikayla Gawarecki	96ad739ddc	Added ModuleInfos for {*}Norm modules (#97919 ) Not adding Lazy variants yet pending investigation of #97915 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97919 Approved by: https://github.com/albanD	2023-04-04 01:15:25 +00:00
Rodrigo Kumpera	a3fc3531f5	Add all_reduce_coalesced functional collective (#97157 ) Inductor codegen is suboptimal when calling all_reduce_coalesced with input args. We need to fix inductor's calling convention for that, or something else. Might not work if any outputs is unused. Test code: ```python import torch import torch.distributed as dist import torch.nn.functional as F from functorch import make_fx import os import torch.distributed._functional_collectives as ft_c from torch.testing._internal.common_distributed import ( spawn_threads_and_init_comms, ) from torch._inductor.compile_fx import compile_fx_inner def my_fun(a, b): c = a * 3 tensors = ft_c.all_reduce_coalesced([a, c, b], "sum", [0]) return ((tensors[1] + tensors[0] + tensors[2]).sum(), ) @spawn_threads_and_init_comms(world_size=1) def inductor_main(self): x = torch.arange(4).cuda() * (dist.get_rank() + 1) y = torch.arange(4).cuda() * (dist.get_rank() + 1) x = x.to(torch.float) y = y.to(torch.float) * 0.5 res = make_fx(my_fun)(x, y) print(f"fx graph:\n{res.graph}") ind = compile_fx_inner(res, [x, y]) print(f"inductor done:\n{ind}") os.environ["PROXY_TENSOR_TRACING"] = "1" os.environ["TORCH_COMPILE_DEBUG"] = "1" torch._dynamo.config.output_code = True if __name__ == "__main__": inductor_main(None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97157 Approved by: https://github.com/fegin	2023-04-04 01:13:18 +00:00
Bin Bao	69ff39d2e7	Skip gat, gcn and sage for TorchBench CUDA test (#98244 ) Summary: The three models only support CPU for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98244 Approved by: https://github.com/ezyang	2023-04-04 01:06:18 +00:00
Scott Wolchok	f386312ec9	[PyTorch] Don't do extra numel() check in TensorImpl::data() (#98090 ) `is_empty()` checks `numel() == 0`, but we don't need to access `numel_` at all (or the policy that `numel()` checks) in our happy path -- we just need the data pointer from `storage_`. Let's do the check we need to do using only the data we strictly need, rather than adding instructions loading other pieces of data. Differential Revision: [D44586464](https://our.internmc.facebook.com/intern/diff/D44586464/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98090 Approved by: https://github.com/Skylion007	2023-04-04 00:59:52 +00:00
Rodrigo Kumpera	9ad66dd588	Switch reduce_scatter and all_gather in DeviceMesh to use functional collectives (#96226 ) Among the changes is the introduction of gather_dim and scatter_dim in DeviceMesh collectives to simplify user code. The current plan is to keep padding and gather/scatter dim support in DeviceMesh while we explore optimization opportunities in Inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96226 Approved by: https://github.com/wanchaol	2023-04-04 00:58:33 +00:00
mikey dagitses	2ac9086987	run buildifier on unified build files (#98141 ) This is pretty tricky. buildifier by default doesn't do much to these files. It does a little more if you tell it that they are `BUILD.bazel` files with -type=build. But it can do even more if you remove the target definitions from the `def define_rules()` wrapper and dedent them. I wrote a little wrapper that does that. I'll submit it at a later date. Differential Revision: [D44606558](https://our.internmc.facebook.com/intern/diff/D44606558/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44606558/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/98141 Approved by: https://github.com/ezyang, https://github.com/PaliC	2023-04-04 00:37:19 +00:00
Michael Voznesensky	b1e60bfb6a	Pass f_locals as a dict rather than kwargs (#98107 ) Fixes https://github.com/pytorch/pytorch/issues/97688 One big problem is that instead of printing x < y we now print `E["x"] < E["y"]` and now all of the tests wobbled and I'm mad. Signed-off-by: Edward Z. Yang <ezyangmeta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98107 Approved by: https://github.com/ezyang	2023-04-04 00:30:08 +00:00
Jason Ansel	b96fe9b61c	Fix issues related to ClassInstantier in HF models (#97997 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97997 Approved by: https://github.com/anijain2305	2023-04-04 00:01:08 +00:00
Yifu Wang	4d13fcddef	[spmd expansion] support torch.ops.aten.sym_numel (#98229 ) The current logic assumes non-overload ops takes two arguments however torch.ops.aten.sym_numel takes one. Differential Revision: [D44615037](https://our.internmc.facebook.com/intern/diff/D44615037/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98229 Approved by: https://github.com/mrshenli	2023-04-03 23:57:10 +00:00
Yanbo Liang	a6bd21d935	[Dynamo] Eagerly initializing Lazy Module to reduce graph breaks (#97946 ) Fixes Meta internal user case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97946 Approved by: https://github.com/wconstab	2023-04-03 22:24:43 +00:00
Bin Bao	96f548a1ac	[inductor] Add an AOT mode for the Triton backend (#98214 ) Summary: This is a copy of https://github.com/pytorch/pytorch/pull/97152 to make the landing easier. This PR implements a two-pass wrapper codegen for the Triton backend to achieve ahead-of-time compilation. In the first pass, the regular python wrapper code will be generated, and then the generated code will be executed to perform Triton compilation and autotuning. After that, the second pass wrapper codegen will generate C++ wrapper with proper CUDA API to load and launch Triton-generated CUDA kernels. Like the AOT mode for the cpp backend, the next step would be to provide a more complete API for AOT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98214 Approved by: https://github.com/eellison	2023-04-03 22:19:18 +00:00
Mikayla Gawarecki	73b06a0268	Fix rendering of arguments for nn.functional ops that use boolean_dispatch (#98092 ) Fix #97982 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98092 Approved by: https://github.com/albanD	2023-04-03 21:17:43 +00:00
Guang Yang	eeb18d1e54	Fix dynamo tests and re-enable internally (#97937 ) Summary: `:test_dynamo` has been broken for long time internally in Meta. This PR is to fix the broken test and re-enable it internally. - Using the root `pytest.ini` for pytest - Decouple tests so that one can be disabled with affecting others - Temporarily disable the test cases that require additional efforts to fix OSS CI doesn't provide test code coverage info. Meta internal test infra does. The value of re-enabling these tests internally is not only to collect test coverage info but help fbcode developers to build/test from fbcode. Test Plan: `buck test mode/dev-nosan //caffe2/test:test_dynamo` https://www.internalfb.com/intern/testinfra/testrun/7318349540623516 Differential Revision: D44325238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97937 Approved by: https://github.com/ezyang	2023-04-03 20:47:13 +00:00
Yu Guo	3654552b8c	add deterministic impl for scatter and scatter_reduction sum/mean mode (#98060 ) using the existing deterministic implementation via `index_put` which has a deterministic implementation based on sorting indices. With the `accumulate` arg in `index_put`, this can work for both scatter and scatter_reduce with sum/mean reduction mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98060 Approved by: https://github.com/mikaylagawarecki	2023-04-03 20:38:29 +00:00
Kwanghoon An	13f169c9da	Per Channel in brack-propagation function (#97475 ) Summary: Supporting Per Channel quantization in the gradient computation function. One workaround that I have added here is Current QNNPACK is not designed to process [transposed weight](https://fb.workplace.com/groups/pytorch.edge.users/permalink/1283737025829921/) Here we are simply replacing Per Channel to Per Tensor to compute a gradient (Some slow learning curve or WER degradation might be expected - We don't know, nothing is guaranteed) Test Plan: You can create your own synthetic model, FP32 layer -> INT8 layer with Per Channel and see if loss is decreasing Differential Revision: D43898794 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97475 Approved by: https://github.com/weiwangmeta	2023-04-03 20:34:44 +00:00
PaliC	8e5f57a2b1	add users to external contribution metrics (#97928 ) :copilot summary Pull Request resolved: https://github.com/pytorch/pytorch/pull/97928 Approved by: https://github.com/kit1980	2023-04-03 19:52:31 +00:00
Rohan Varma	1ea528ef24	[bf16] bf16 support for conv_depthwise3d (#97819 ) Add bf16 for this op Differential Revision: [D44473429](https://our.internmc.facebook.com/intern/diff/D44473429/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97819 Approved by: https://github.com/fegin	2023-04-03 19:31:27 +00:00
Jason Ansel	55afaa46a4	Support functools.partial and itertools.product (#98120 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98120 Approved by: https://github.com/anijain2305	2023-04-03 18:23:25 +00:00
Devashish Shankar	2c905f2152	Extend Pattern Matcher to allow handling split-cat style patterns (#97726 ) Summary: This diff extends pattern matcher, by adding a few features which allows it to handle split-getitem-cat style patterns. 3 problems I encountered were: 1. In the handler, I only need one Arg() (the one which is the first input to split). None of the other args are relevant to replacement graph. So, we add a new Ignored() pattern to have ignored args 2. The pattern matching was visiting the split node again and again during the DFS. By propogating the patterns with _users>1 or Any into the child MatchContext, we avoid this problem. 3. To avoid the unbundling issue, I switched to using KeywordArg() instead of Arg() - as for this pattern, we need a flat list of Arg() in the end Example pattern: https://www.internalfb.com/intern/anp/view/?id=3325856 ``` pass_patterns.append(defaultdict(list)) register_replacement_pattern( CallFunction( aten.cat, ListOf( CallFunction(operator.getitem, CallFunction(aten.split_with_sizes, KeywordArg("input_"), Ignored(), Ignored(), _users=Any), Ignored() ),), Ignored() ), pass_number=3 ) def split_cat_replace(input_): return input_ ``` Test Plan: https://www.internalfb.com/intern/anp/view/?kernel=default&id=3317105 Reviewed By: jansel Differential Revision: D44282499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97726 Approved by: https://github.com/jansel	2023-04-03 17:30:56 +00:00
Bin Bao	095c129bd3	[CI] Add inference run for the performance dashboard (#98174 ) Summary: Remove fp32 training performance run and trade for amp inference performance run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98174 Approved by: https://github.com/huydhn	2023-04-03 17:29:55 +00:00
Bin Bao	ba7ee00f00	Add a --inference flag to dynamo benchmark script (#98173 ) Summary: When calling benchmark scripts, make it a requirement to pass --inference or --training Pull Request resolved: https://github.com/pytorch/pytorch/pull/98173 Approved by: https://github.com/huydhn	2023-04-03 17:12:28 +00:00
John Haitas	5a54eb0b15	[caffe2] miniz fix -Wstrict-prototypes (#98027 ) Summary: this fixes -Wstrict-prototypes Test Plan: eyes Reviewed By: rmaz Differential Revision: D44556017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98027 Approved by: https://github.com/albanD	2023-04-03 16:56:47 +00:00
Elias Ellison	0f0c1b6516	Flip back swtich (#98099 ) There are some errors occurring on the benchmark - switch back to old cudagraph impl until they are figured out https://torchci-git-fork-huydhn-add-compilers-bench-74abf8-fbopensource.vercel.app/benchmark/compilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/98099 Approved by: https://github.com/desertfire	2023-04-03 14:46:33 +00:00
DanilBaibak	55daa835e9	Added allowed_workflows to pytorch probot (#98082 ) Added allowed_workflows to pytorch probot. This is a follow up PR [regarding the retry bot](https://github.com/pytorch/test-infra/pull/3942/files#diff-ee5e4f1e1fa962c6f62e5dcebde6e0bab573e74474601bf5749ccb668fd9c900R14-R16). Pull Request resolved: https://github.com/pytorch/pytorch/pull/98082 Approved by: https://github.com/huydhn	2023-04-03 12:30:43 +00:00
mingfeima	ced5c89b6f	add explicit vectorization for Half dtype on CPU (#96076 ) This patch is part of half float performance optimization on CPU: * add specification for dtype `Half` in `Vectorized<>` under both avx256 and avx512. * add specification for dtype `Half` in functional utils, e.g. `vec::map_reduce<>()`, which uses float32 as accumulate type. Also add a helper struct `vec_hold_type<scalar_t>`, since Vectorized<Half>::value_type is pointing to its underlying storage type which is `uint16_t`, leading to error if the kernel uses `Vec::value_type`. Half uses the same logic as BFloat16 in the Vectorized<>, each half vector is mapped to 2x float vectors for computation. Notice that this patch modified the cmake files by adding -mf16c on AVX2 build, from https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html, we can see that all the hardware platforms that support avx2 already have f16c Pull Request resolved: https://github.com/pytorch/pytorch/pull/96076 Approved by: https://github.com/malfet	2023-04-03 10:58:37 +00:00
Huy Do	c99895ca6f	Move pull and trunk slow tests to periodic (#98040 ) I notice that we are running some slow tests for CPU and `sm86` on pull and trunk. They take much longer to run than other shards (1.5x to 2x longer). I propose that we move them to periodic instead. Thoughts? The correlation between them are: * `linux-bionic-cuda11.7-py3.10-gcc7-sm86 / test (slow)` and `linux-bionic-cuda11.7-py3.10-gcc7-sm86 / test (default)` is 0.93 * `linux-bionic-py3.8-clang9-slow / test (slow)` and `linux-bionic-py3.8-clang9 / test (default)` is 0.98 <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at db56750</samp> This pull request updates the `.github/workflows` files to optimize the testing workflows for PyTorch. It adds new periodic workflows for more platforms and configurations, and removes some redundant or slow workflows from the pull and trunk workflows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98040 Approved by: https://github.com/malfet	2023-04-03 08:13:12 +00:00
PyTorch MergeBot	c597d9c1f2	Revert "Inductor cpp wrapper: support LinearUnary (#97655 )" This reverts commit d03003ab8e0e00ff4c9e2b80065cf90a8fcef92d. Reverted https://github.com/pytorch/pytorch/pull/97655 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it looks like the change causes a regression on CPU test time `d03003ab8e` (inductor/test_cpp_wrapper.py)	2023-04-03 08:09:58 +00:00
chunyuan	d03003ab8e	Inductor cpp wrapper: support LinearUnary (#97655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97655 Approved by: https://github.com/jansel	2023-04-03 04:26:10 +00:00
chunyuan	0c1f524b92	Inductor cpp wrapper: support MKLPackedLinear (#90755 ) Invoke `torch.ops.mkl._mkl_linear` from c++. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90755 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel	2023-04-03 04:07:38 +00:00
Jiong Gong	5d62d12557	[Inductor] support transpose vertical reduction in cpp (#97781 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97781 Approved by: https://github.com/jansel	2023-04-03 02:02:15 +00:00
Jason Ansel	76074dc0a3	Improve support for dict subclasses (#98154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98154 Approved by: https://github.com/anijain2305	2023-04-03 01:42:08 +00:00
Jiong Gong	bf22ecba2a	[Inductor] support vertical reduction in cpp (#97644 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97644 Approved by: https://github.com/jansel	2023-04-03 01:29:12 +00:00
Jason Ansel	35b3309539	Fix graph break from inline patched init (#98150 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98150 Approved by: https://github.com/anijain2305, https://github.com/yanboliang	2023-04-03 01:11:30 +00:00
Jiong Gong	8e5f491623	[Inductor] simplify CPP backend Tile2D code and support non-contiguous load/store (#97626 ) Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`. This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire Pull Request resolved: https://github.com/pytorch/pytorch/pull/97626 Approved by: https://github.com/jansel	2023-04-03 01:11:20 +00:00
Jason Ansel	71d850a100	[inductor] Fallback on complex64 kernels (#98155 ) Later PRs in this stack fixe graph breaks in GoogleFnet which triggers errors from inductor trying to compile torch.complex64, this fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98155 Approved by: https://github.com/anijain2305, https://github.com/ngimel	2023-04-03 01:06:43 +00:00
Jason Ansel	bc9dd969e1	Support inlining no_grad() decorator (#98121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98121 Approved by: https://github.com/anijain2305, https://github.com/voznesenskym	2023-04-03 00:24:56 +00:00
Shen Li	96403cfcec	[Easy] Fix lint error on DTensor math_ops.py (#98170 ) This lint error is caused by conflicts betwee #97996 and #98148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98170 Approved by: https://github.com/yifuwang	2023-04-02 19:11:05 +00:00
Shen Li	02179827cb	[Easy] Include SPMD and DTensor files in UFMT checks (#98148 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98148 Approved by: https://github.com/fegin	2023-04-02 15:34:49 +00:00
Aleksei Nikiforov	38609cc47d	TensorExpr eval: fix copying variables from pointers on big endian systems (#96951 ) When copying data from pointers, only lowest bytes are copied. On little endian systems they are located at the beginning of pointer. On big endian systems they are located at the end of pointer. This change fixes TestTensorExprPyBind::test_dynamic_shape and TestTensorExprPyBind::test_dynamic_shape_2d tests from test/test_tensorexpr_pybind.py on big endian systems. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96951 Approved by: https://github.com/ezyang, https://github.com/EikanWang	2023-04-02 12:49:14 +00:00
yanbing-j	2ab18a23e1	Update ideep submodule (#97430 ) ### Description This PR is to update ideep submodule for the following two aspects: 1. At inductor side, we are supporting dynamic shape path for packed linear, which we hopes the packed weight of linear doesn't depend on the input shapes and still can get a better a performance using a packed weight got from a dummy input shapes. However the current ideep has a accuracy issue for this case. This updating will fix the issue. 2. Add an extra arg is_channels_last for deconv to notify ideep whether to go channels last or not because the memory format checks of ideep (e.g. is_nhwc(), is_ndhwc()) is not 100% identical to suggest_memory_format() from pytorch. ### Performance Benchmark Use TorchBench test in ICX with 40 cores Intel OpenMP & tcmalloc were preloaded ![image](https://user-images.githubusercontent.com/61222868/229072474-193513ba-6727-4451-91ff-0d57e016736f.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97430 Approved by: https://github.com/jgong5	2023-04-02 06:42:09 +00:00
Shen Li	347c67d4a2	[Easy] Consolidate string startswith checks (#98147 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98147 Approved by: https://github.com/fegin	2023-04-02 04:02:37 +00:00
Wanchao Liang	7fcff01b50	[reland] switch mean to use reduction linear (#97996 ) mean is actually a reduction linear formula if the final reduction is partial sum (which currently is), so switching to use that instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/97996 Approved by: https://github.com/XilunWu, https://github.com/yifuwang	2023-04-02 03:19:56 +00:00
Jason Ansel	d9e5ab4606	Fix graph break from 'hasattr: HFPretrainedConfigVariable()' (#98119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98119 Approved by: https://github.com/anijain2305	2023-04-02 02:56:45 +00:00
Jason Ansel	b9d3b3f595	Improve support for contextlib.nullcontext (#98111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98111 Approved by: https://github.com/anijain2305	2023-04-02 02:33:14 +00:00
Jason Ansel	92b46202ef	Add --stats option to benchmark scripts (#98109 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98109 Approved by: https://github.com/anijain2305	2023-04-02 02:23:13 +00:00
mikey dagitses	e402259b8a	avoid warning in irange for unsigned types (#97973 ) Unsigned types should not be compared to be less than zero. Differential Revision: [D44538384](https://our.internmc.facebook.com/intern/diff/D44538384/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97973 Approved by: https://github.com/Skylion007	2023-04-01 23:52:37 +00:00
Nikita Shulga	2af09393f9	`masked_scatter` should accept only bool masks (#97999 ) Modify test_torch to check that assert is raised in this case torch.uint8 usage has been deprecated for a few releases, and errors has been raised for other dtypes on CUDA device, but not on CPU. This PR finally restricts mask to just `torch.bool` See https://github.com/pytorch/pytorch/pull/96594 as an example doing it for `torch.masked_fill` Fixes https://github.com/pytorch/pytorch/issues/94634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97999 Approved by: https://github.com/ngimel	2023-04-01 23:25:25 +00:00
Jason Ansel	bbc4e911c8	Move CPUReproTests to its own file (#97943 ) test_torchinductor has gotten too big (almost 10k lines), this stack is trying to split it into smaller pieces. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97943 Approved by: https://github.com/ngimel	2023-04-01 22:39:49 +00:00
Li-Huai (Allan) Lin	db8abde9b6	[MPS] Enable conditional indexing tests (#97871 ) The tests seem to be working now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97871 Approved by: https://github.com/kulinseth	2023-04-01 16:15:08 +00:00
Shen Li	e8d39606eb	[SPMD] Enable fused Adam in full train step tracing (#98113 ) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98113 Approved by: https://github.com/yifuwang, https://github.com/fegin	2023-04-01 15:54:13 +00:00
Shen Li	bccf2ef0ce	Format DTensor dispatch.py and _meta_registrations.py (#98114 ) Format-only changes with black and lintrunner to prepare for the commit on top. Differential Revision: [D44603809](https://our.internmc.facebook.com/intern/diff/D44603809) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98114 Approved by: https://github.com/yifuwang, https://github.com/fegin	2023-04-01 15:54:13 +00:00
mikey dagitses	64077ce511	remove redundant typed StorageImpl::data() member (#97650 ) This has the same implementation as the unsafe variants and the unsafe variants match the original semantics of the code, given that they don't check that the type matches. Given that we're updating callsites anyways to address the mutability aspect, we might as well just drop this method now. Differential Revision: [D44410210](https://our.internmc.facebook.com/intern/diff/D44410210/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97650 Approved by: https://github.com/ezyang	2023-04-01 08:16:54 +00:00
Shunting Zhang	13461e9767	[inductor] more cuda metrics in wrapper (#97723 ) Following metrics should be helpful: - percent of time GPU is busy - percent of time various category of kernels (e.g. pointwise/reduction triton kernel) takes - percent of time each individual kernel takes compared to total wall time of the benchmark This PR add those. Example result from hf_Bert infernece graph: ``` == triton_pointwise category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- triton_poi_fused_gelu_6_0d1d 0.48154 12.0 5.52% triton_poi_fused_clone_1_0d1d2 0.29011 24.0 3.33% triton_poi_fused_clone_2_0d1d2 0.17417 12.0 2.00% triton_poi_fused_clone_4_0d1d2 0.10797 12.0 1.24% Total 1.05379 12.08% == triton_persistent_reduction category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- triton_per_fused__softmax__to_ 0.97188 12.0 11.14% triton_per_fused_add_native_la 0.37401 24.0 4.29% triton_per_fused_gelu_native_l 0.02 1.0 0.23% triton_per_fused_add_embedding 0.01718 1.0 0.20% Total 1.38307 15.86% == unknown category kernels == Kernel Self CUDA TIME (ms) Count Percent ------------------------------ --------------------- ------- --------- ampere_fp16_s16816gemm_fp16_12 2.24514 24.0 25.74% ampere_fp16_s16816gemm_fp16_25 1.39796 49.0 16.03% void cutlass::Kernel<cutlass_8 1.36093 1.0 15.61% ampere_fp16_s16816gemm_fp16_64 0.74591 12.0 8.55% ampere_fp16_s16816gemm_fp16_12 0.61989 12.0 7.11% Memset (Device) 0.024 12.0 0.28% void at::native::(anonymous na 0.01543 2.03 0.18% void at::native::vectorized_el 0.00011 0.03 0.00% Total 6.40937 73.49% Percent of time when GPU is busy: 101.44% ``` Note: the output shows total time GPU is busy is larger than total wall time. We measure total wall time disabling profiling while measure GPU time enabling profiling, that may distort the measurement a bit? But I assume the effect is not too large assuming the profiler mostly increase CPU time (rather than GPU). ## interesting usages 1. I pick a model that cudagraphs improve perf significantly like densenet121 and run the tool on it's forward graph. It's no surprise that quite a lot of time GPU is idle: ``` (Forward graph) Percent of time when GPU is busy: 32.69% Total wall time 17.307 ms ``` Its backward graph has less percent of GPU idle time, but it's still high: ``` (Backward graph) Percent of time when GPU is busy: 46.70% Total wall time 17.422 ms ``` 2. I profile a subset of torchbench models and plot a table to show the percent of execution time for pointwise/reduction/persistent_reduction/unknown_category . Since I plan to explore using coordinate descent tuner to improve reduction, those models with high percent of time spending on reduction should be good caididates (e.g. resnet50, mobilenet_v2 ). NOTE: a same model appears twice. The first rows is for the fwd graph and the second for the bwd graph. We profile different graphs for a model separately. ``` benchmark_name pointwise_percent reduction_percent persistent_reduction_percent unknown_category_percent GPU_busy_percent wall_time_ms ----------------------- ------------------- ------------------- ------------------------------ -------------------------- ------------------ -------------- resnet18 19.73% 7.86% 4.81% 41.25% 73.65% 2.549ms resnet18 18.59% 7.13% 3.35% 67.35% 96.41% 3.467ms resnet50 29.57% 22.13% 2.07% 51.68% 105.46% 6.834ms resnet50 26.42% 15.27% 0.94% 59.68% 102.31% 13.346ms vgg16 26.23% 0.00% 0.00% 74.20% 100.43% 18.212ms vgg16 15.63% 5.61% 0.10% 79.42% 100.75% 33.485ms BERT_pytorch 28.62% 4.82% 14.88% 33.32% 81.64% 7.162ms BERT_pytorch 14.43% 13.41% 18.19% 49.24% 95.27% 10.395ms densenet121 11.89% 2.14% 3.86% 16.36% 34.25% 16.531ms densenet121 10.37% 2.06% 4.09% 31.46% 47.98% 16.934ms hf_Bert 23.94% 0.00% 29.88% 46.09% 99.90% 7.766ms hf_Bert 11.65% 10.54% 20.26% 61.66% 104.11% 11.892ms nvidia_deeprecommender 42.92% 0.00% 0.00% 56.75% 99.67% 3.476ms nvidia_deeprecommender 31.36% 3.44% 0.46% 65.20% 100.45% 3.872ms alexnet 30.99% 0.00% 0.00% 69.16% 100.14% 3.169ms alexnet 24.41% 4.83% 0.17% 71.09% 100.50% 4.709ms mobilenet_v2 29.21% 27.79% 2.49% 44.00% 103.49% 10.160ms mobilenet_v2 17.50% 15.05% 1.06% 69.68% 103.29% 20.715ms resnext50_32x4d 18.96% 9.28% 2.31% 28.79% 59.33% 5.899ms resnext50_32x4d 18.48% 11.01% 1.86% 53.80% 85.14% 7.167ms mnasnet1_0 19.07% 14.52% 3.01% 35.43% 72.03% 6.028ms mnasnet1_0 14.17% 12.00% 1.87% 67.56% 95.60% 9.225ms squeezenet1_1 38.56% 0.00% 1.77% 56.21% 96.53% 2.221ms squeezenet1_1 21.26% 7.57% 1.05% 67.30% 97.18% 4.942ms timm_vision_transformer 17.05% 0.00% 18.80% 65.79% 101.64% 9.608ms timm_vision_transformer 9.31% 9.07% 10.32% 73.25% 101.96% 16.814ms ``` ## how to use `python {compiled_module_wrapper.py} -p` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97723 Approved by: https://github.com/jansel	2023-04-01 08:04:14 +00:00
Jerry Zhang	553bb01df9	[quant][pt2e][refactor] Remove extra arguments of _maybe_insert_observers_before_graph_output (#98029 ) Summary: This PR allows _maybe_insert_observers_before_graph_output to be reused by pt2e flow Test Plan: python test/test_quantization.py TestQuantizeFx python test/test_quantization.py TestQuantizeFxOps python test/test_quantization.py TestQuantizeFxModels Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/98029 Approved by: https://github.com/vkuzo	2023-04-01 05:38:36 +00:00
Milos Puzovic	2630144786	Call to mkldnn_matmul from aten::addmm on AArch64 (#91763 ) We have noticed that on BERT_pytorch in torchbenchmark majority of time is spent in running GEMM in aten:addmm. At the moment this calls into BLAS routine, but on AArch64 it will be faster if it calls into mkldnn_matmul. Performance wise compared to build with OpenBLAS it runs faster 1.2x faster on 16 cores with batch size of 8 on Graviton3, while if fast math mode (mkldnn_matmul exposes through oneDNN and Arm Compute Library option to run GEMM with FP32 inputs using BBF16 operations) is enabled then it is 2.3x Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91763 Approved by: https://github.com/jgong5, https://github.com/ngimel, https://github.com/malfet	2023-04-01 04:25:57 +00:00
PyTorch MergeBot	57c6f3fe90	[vision hash update] update the pinned vision hash (#98108 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98108 Approved by: https://github.com/pytorchbot	2023-04-01 03:07:44 +00:00
Edward Z. Yang	5df59f957f	Fix G001,G002,G003 in logs to % syntax (#97812 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97812 Approved by: https://github.com/Skylion007, https://github.com/kiukchung, https://github.com/malfet, https://github.com/mlazos	2023-04-01 01:43:33 +00:00
Tugsbayasgalan Manlaibaatar	7f9533e224	[Dynamo] Add UserError type (#97705 ) To get started the dynamo error message improvement effort, we discussed about adding new user error type which covers cases where the user used something that TorchDynamo doesn't support and there is clear actions they can take. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97705 Approved by: https://github.com/anijain2305, https://github.com/yanboliang	2023-04-01 01:18:00 +00:00
Michael Lazos	ee9a9b7add	Remove old logging callsites (#98095 ) Get around GH first issue, OSS only changes for https://github.com/pytorch/pytorch/pull/97182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98095 Approved by: https://github.com/anijain2305	2023-04-01 00:57:37 +00:00
Jason Ansel	7c60d7a24d	Move CudaReproTests to its own file (#97942 ) test_torchinductor has gotten too big (almost 10k lines), this stack is trying to split it into smaller pieces. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97942 Approved by: https://github.com/ngimel	2023-04-01 00:47:42 +00:00
Yanbo Liang	df216b5736	Disable dynamo tracing torchrec.distributed (#97824 ) This was used to unblock Meta internal use cases, where ```torchrec.distributed``` was used, however, it can't be traced by dynamo properly right now. We were sending the same fix(#90087) several months ago, but was reverted due to ```fbgemm``` conflicts. This PR catches ```Exception``` rather than ```ImportError``` which can handle the conflicts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97824 Approved by: https://github.com/wconstab	2023-04-01 00:39:59 +00:00
Huy Do	b89f74aa35	Mark Vulkan test as unstable (#98106 ) While investigating the new flaky issue in trunk, i.e. `f9ca48ddb5`. Curiously, this starts to become flaky after https://github.com/pytorch/pytorch/pull/97698, so I'll reach out to the author. Fixes https://github.com/pytorch/pytorch/issues/98071 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98106 Approved by: https://github.com/clee2000	2023-04-01 00:39:18 +00:00
Ivan Zaitsev	7aa010dcc9	[stronghold][bc-linter] add BC linter suppression by `suppress-api-compatibility-check` PR label (#97727 ) Adds the ability to suppress BC-linter by adding `suppress-api-compatibility-check` label to the PR. See #96977 for the context on the BC-linter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97727 Approved by: https://github.com/osalpekar	2023-04-01 00:25:40 +00:00
Animesh Jain	6b319d1525	[dynamo][graph break fix] inplace add for empty tuple (#97923 ) Fixes one of the frequent graph breaks in HF models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97923 Approved by: https://github.com/yanboliang, https://github.com/jansel	2023-04-01 00:11:16 +00:00
Jerry Zhang	7dde61ce46	[quant][pt2e][refactor] Remove extra arguments of `_maybe_insert_output_observer_for_node` (#97959 ) Summary: The goal is for this function to be reused by the pt2e flow Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/97959 Approved by: https://github.com/andrewor14	2023-03-31 23:59:43 +00:00
Elias Ellison	8313b852cb	Fallback getitem fix (#98041 ) I'm working on enabling complex fallback. we will be getting additional coverage when that lands, but I also did a run through the inductor test suite. Differential Revision: [D44564138](https://our.internmc.facebook.com/intern/diff/D44564138) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98041 Approved by: https://github.com/davidberard98	2023-03-31 23:51:00 +00:00
Nadav Rotem	091177516e	Rearrange the fields in at::OperandInfo to reduce padding. (#98037 ) Summary: Rearrange the fields in at::OperandInfo to reduce padding. The current class layout is {64,3,1,1,8,1,1,1,16,16,8,8}. Moving the 5th element in the class allows the small bytes/bools to be packed together. This class is frequently read from places like the stack trace below, so compacting the class could speed things up. c10/util/MaybeOwned.h:198 operator* aten/src/ATen/TensorIterator.h:187 tensor_base aten/src/ATen/TensorIterator.h:322 tensor_base aten/src/ATen/TensorIterator.cpp:1194 compute_mem_overlaps aten/src/ATen/TensorIterator.cpp:1475 build Test Plan: Rely on unit tests. Differential Revision: D44559604 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98037 Approved by: https://github.com/swolchok	2023-03-31 23:45:52 +00:00
Yanbo Liang	9be9592f28	[Dynamo] Code refactor: move context managers out of misc.py (#97958 ) misc.py and test_misc.py is too big, moving context managers to context.py and test_context.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97958 Approved by: https://github.com/ezyang, https://github.com/anijain2305, https://github.com/mlazos, https://github.com/voznesenskym	2023-03-31 23:15:39 +00:00
pbialecki	3c7b2b730f	use libcusolver_lapack_static.a for CUDA>=12 (#98072 ) Needed for https://github.com/pytorch/builder/pull/1374 to enable nightly CUDA12.1 builds. From the cuSOLVER release notes (https://docs.nvidia.com/cuda/cusolver/index.html#link-third-party-lapack-library): > The `liblapack_static.a` library is deprecated and will be removed in the next major release. Use the `libcusolver_lapack_static.a` instead. Note that "next major release" corresponds to CUDA 12, not 13. The fix was verified locally on an H100 using https://github.com/pytorch/builder/pull/1374 and pip wheels were properly built: ``` >>> torch.version.cuda '12.1' >>> torch.backends.cudnn.version() 8801 >>> conv =nn.Conv2d(3, 3, 3).cuda() >>> x = torch.randn(1, 3, 224, 224).cuda() >>> out = conv(x) >>> out.sum() tensor(5386.9219, device='cuda:0', grad_fn=<SumBackward0>) ``` CC @malfet @atalman @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/98072 Approved by: https://github.com/malfet, https://github.com/atalman	2023-03-31 20:47:53 +00:00
Vivek Khandelwal	5810f5ad1a	Fix `aten::squeeze.dims` shape function (#98078 ) Signed-Off By: Vivek Khandelwal <vivek@nod-labs.com> Fixes https://github.com/llvm/torch-mlir/issues/1690#issuecomment-1491931180. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98078 Approved by: https://github.com/davidberard98	2023-03-31 20:24:09 +00:00
Jesse Cai	d158545b16	[pruning] Add gelu to list of supported activation functions (#95618 ) Summary: This PR adds nn.GELU and F.gelu respectively to the list of suppported activation functions Test Plan: ``` python test/test_ao_sparsity.py -- TestBaseSparsifier ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95618 Approved by: https://github.com/andrewor14	2023-03-31 19:55:12 +00:00
Wang, Yi A	8564ed24a8	do not need to check if element in dict input is Tensor. (#97866 ) sometimes it's a tuple with tensor element such as past value key in text generation case Fixes https://github.com/pytorch/pytorch/issues/97229 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97866 Approved by: https://github.com/jgong5, https://github.com/davidberard98	2023-03-31 19:39:00 +00:00
Scott Wolchok	794f6e50a1	[PyTorch] Accept string_view in Pickler::pushGlobal (#96402 ) This should make a difference for users building with libstdc++: we pass string literals to pushGlobal with length longer than 15 bytes, and 15 bytes is the maximum inline size of libstdc++'s std::string before it will heap allocate. Differential Revision: [D43930698](https://our.internmc.facebook.com/intern/diff/D43930698/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96402 Approved by: https://github.com/ezyang	2023-03-31 19:33:46 +00:00
Andrew Gu	fb7b398479	[FSDP] Do not `_unshard` if already prefetched (#97981 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97981 Approved by: https://github.com/fegin	2023-03-31 18:47:03 +00:00
Andrew Gu	195b92ab01	[FSDP][Easy] Minor cleanups to `_runtime_utils.py` (#97980 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97980 Approved by: https://github.com/H-Huang	2023-03-31 18:47:03 +00:00
Andrew Gu	adee9423bd	[FSDP][Docs] Tidy up FSDP ctor docs (#97979 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97979 Approved by: https://github.com/fegin	2023-03-31 18:47:00 +00:00
PyTorch MergeBot	3226ad21cf	Revert "[Reland] fix some MKL detection issues of CMake (#94924 )" This reverts commit dc2b7aa95554188155a4e2e087412f06f2f3b642. Reverted https://github.com/pytorch/pytorch/pull/94924 on behalf of https://github.com/atalman due to conda nightly build failures	2023-03-31 18:41:11 +00:00
Catherine Lee	0d73cfb3e9	Retry at test file level (#97506 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97506 Approved by: https://github.com/huydhn	2023-03-31 18:36:53 +00:00
Rodrigo Kumpera	3b188c5883	Don't use subclass when tracing and call wait_tensor immediately. (#98001 ) This change expects that proper scheduling of the wait_tensor call will happen over the traced graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98001 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2023-03-31 18:33:20 +00:00
Kaichen Liu	f2127bbf47	[PyTorch] Add Vulkan support and tests for at::upsample_bilinear2d (#98022 ) Summary: Bilinear upsampling is a [4D tensor upsampling operation](https://pytorch.org/docs/stable/generated/torch.nn.Upsample.html), this adds support for the operation on the Vulkan GPU backend. Test Plan: 1. `buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1` on Apple M1 MacBook 2. Confirm all tests pass with no regression, and the added tests `upsample_bilinear2d` pass 2a. All tests P669847383 2b. `upsample_bilinear2d` tests P669866631 3. Overview: ``` ... [ RUN ] VulkanAPITest.upsample_bilinear2d_align_false_small [ OK ] VulkanAPITest.upsample_bilinear2d_align_false_small (1 ms) [ RUN ] VulkanAPITest.upsample_bilinear2d_align_false_large [ OK ] VulkanAPITest.upsample_bilinear2d_align_false_large (2 ms) [ RUN ] VulkanAPITest.upsample_bilinear2d_align_true_small [ OK ] VulkanAPITest.upsample_bilinear2d_align_true_small (2 ms) [ RUN ] VulkanAPITest.upsample_bilinear2d_align_true_large [ OK ] VulkanAPITest.upsample_bilinear2d_align_true_large (1 ms) ... [==========] 209 tests from 1 test suite ran. (6317 ms total) [ PASSED ] 201 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log [ FAILED ] 7 tests, listed below: [ FAILED ] VulkanAPITest.cat_dim1_singledepth_success [ FAILED ] VulkanAPITest.gru_success [ FAILED ] VulkanAPITest.gru_mclareninputs_success [ FAILED ] VulkanAPITest.gru_prepack_success [ FAILED ] VulkanAPITest.lstm_success [ FAILED ] VulkanAPITest.lstm_mclareninputs_success [ FAILED ] VulkanAPITest.lstm_prepack_success ``` Differential Revision: D43142564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98022 Approved by: https://github.com/SS-JIA	2023-03-31 18:32:42 +00:00
Kazuaki Ishizaki	64b8d20a5c	Fix typos under c10 directory (#98079 ) This PR fixes typos in comments and messages of files under `c10` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/98079 Approved by: https://github.com/Skylion007	2023-03-31 18:31:11 +00:00
William Wen	762a2079c7	[dynamo 3.11] make create_instruction kwarg mandatory (#98032 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98032 Approved by: https://github.com/albanD	2023-03-31 18:20:51 +00:00
William Wen	089134bf66	[dynamo 3.11] implement 3.11 linetable (#96509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96509 Approved by: https://github.com/jansel	2023-03-31 18:20:28 +00:00
William Wen	14ef91cea6	[dynamo 3.11] small bug fixes (#96508 ) Bugs fixed: - CALL_FUNCTION_EX expects null pop in symbolic_convert - make_function_with_closure codegen requires a push_null - copy over the closure in eval_frame.c - add JUMP_FORWARD to terminal opcodes - enum repr fix in utils.py - fix symbolic_convert's break_graph_if_unsupported wrapper Pull Request resolved: https://github.com/pytorch/pytorch/pull/96508 Approved by: https://github.com/jansel	2023-03-31 18:18:12 +00:00
William Wen	cb4bc8e0f5	[dynamo 3.11] support prefix instructions MAKE_CELL, COPY_FREE_VARS, RETURN_GENERATOR, RESUME (#96506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96506 Approved by: https://github.com/jansel	2023-03-31 18:16:17 +00:00
William Wen	05641b81e5	[dynamo 3.11] fix jump if (not) none (#96505 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96505 Approved by: https://github.com/jansel	2023-03-31 18:05:54 +00:00
Catherine Lee	27e06e1a28	Print test times for pytest in verbose mode (#98028 ) Adds test time like ``` e.py::test1 PASSED [0.0001s] [ 33%] e.py::test2 PASSED [1.0075s] [ 66%] e.py::test3 PASSED [0.0002s] [100%] ``` but they also get colored Pull Request resolved: https://github.com/pytorch/pytorch/pull/98028 Approved by: https://github.com/huydhn	2023-03-31 18:04:54 +00:00
shibo	d03799f9a5	optimize the AMP func name in custom_device_mod (#98052 ) Fixes #ISSUE_NUMBER 1、optimize the func name of AMP in custom device module，use `torch.foo.set_autocast_enable` instead of `torch.foo.set_autocast_foo_enable`. 2、In AMP with custom device，use `custom_device_mod.set_autocast_enable` instead of `getattr(custom_device_mod, "set_autocast_enable"`, because we have check that `custom_device_mod` hasattr `set_autocast_enable` before. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98052 Approved by: https://github.com/bdhirsh	2023-03-31 17:04:32 +00:00
Bin Bao	c699ac17df	[CI] Bump up torchbench version to fix dynamo graph breaks in transformers (#98003 ) Summary: When we bump up the torchbench version pin last time, we found there were new graph breaks introduced with the trasformers version upgrade, see https://github.com/pytorch/pytorch/pull/96782. Turns out they are already fixed upstream, see https://github.com/huggingface/transformers/pull/21648 and https://github.com/pytorch/benchmark/pull/1511 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98003 Approved by: https://github.com/ngimel	2023-03-31 16:52:09 +00:00
PyTorch MergeBot	9e3b34775b	Revert "[dtensor] switch mean to use reduction linear (#97996 )" This reverts commit 1b323b313ce35c03583ece017f928079f4a86882. Reverted https://github.com/pytorch/pytorch/pull/97996 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it fails a test on CPU `1b323b313c`	2023-03-31 16:44:01 +00:00
Sam Gross	87f5e92916	[dynamo] Add guards for deterministic algos (#96695 ) Inductor now falls back to eager mode for deterministic algos. Add guards in dynamo to check if the deterministic algos mode changes. See #93537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96695 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-03-31 16:28:45 +00:00
Brian Hirsh	864ab93656	aot_autograd: avoid using intermediate_base logic unnecessarily (#97786 ) fixes https://github.com/pytorch/pytorch/issues/97691, see the issue for the proposed design. Now that we are employing AOTAutograd's "intermediate base" logic a lot less frequently, we might see some speedups in the benchmark suite. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97786 Approved by: https://github.com/jansel, https://github.com/soulitzer	2023-03-31 16:25:13 +00:00
Iris	4e26ad786d	fix load_sharded_optimizer_state_dict error on multi node (#98063 ) Fixes #95892 This PR fixes the placement error in ChunkShardingSpec when training with multi nodes. 'rank:{global_rank}/cuda:{local_rank}' should be used but 'rank:{global_rank}/cuda:{global_rank}' is used so this would result in a CUDA error: invalid device ordinal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98063 Approved by: https://github.com/kumpera	2023-03-31 16:07:09 +00:00
mikey dagitses	cb8c0be54d	add StorageImpl::mutable_unsafe_data (#97648 ) See D44409928. Differential Revision: [D44409945](https://our.internmc.facebook.com/intern/diff/D44409945/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97648 Approved by: https://github.com/ezyang	2023-03-31 16:04:07 +00:00
PyTorch MergeBot	f4f1a5b5b3	Revert "Move functional collectives to the right namespace (#97793 )" This reverts commit 184bfbc3d7b37e8f202f4938f6ea9ba557c93b1e. Reverted https://github.com/pytorch/pytorch/pull/97793 on behalf of https://github.com/atalman due to breaks internal builds	2023-03-31 16:02:07 +00:00
Anton Bushuiev	fa1a8b9f96	Fix device handling in `nn.utils.rnn.unpad_sequence` (#98042 ) Without this change I get the following error. ``` line 444, in unpad_sequence mask = idx < length RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98042 Approved by: https://github.com/mikaylagawarecki	2023-03-31 16:00:49 +00:00
Jerry Zhang	1c21cd2213	[quant][pt2e][refactor] Add input_output_share_observers to node.meta["target_dtype_info"] (#97949 ) Summary: The goal for this PR is to unify the flow of information to reduce fragmentation of implementations between fx graph mode quantization and quantize_pt2e, since quantize_pt2e will be using node.meta to store this information, we'd like to make sure fx graph mode quantization get this information from the same place Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/97949 Approved by: https://github.com/andrewor14	2023-03-31 15:54:19 +00:00
Richard Zou	6b9e22f3f6	Clarify the saving of intermediates in the "extending torch.func" docs (#98020 ) Fixes https://github.com/pytorch/pytorch/issues/97260 We got some feedback that the page reads like "in order to save an input for backward, you must return it as an output of the autograd.Function.forward". Doing so actually raises an error (on master and as of 2.1), but results in an ambiguous situation on 2.0.0. To avoid more users running into this, we clarify the documentation so it doesn't read like the above and clearly mentions that you can save things from the inputs or outputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98020 Approved by: https://github.com/soulitzer, https://github.com/kshitij12345	2023-03-31 13:57:37 +00:00
Edward Z. Yang	91ad5984d8	Add script to summarize performance from CI performance run (#97977 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97977 Approved by: https://github.com/wconstab	2023-03-31 12:44:48 +00:00
Xia, Weiwen	e073979794	[Quant][FX] Add test case for lowering conv_transpose with kwargs (#97311 ) Summary As the title Test plan python test/test_quantization.py -k test_lowering_functional_conv_transpose_with_kwargs Pull Request resolved: https://github.com/pytorch/pytorch/pull/97311 Approved by: https://github.com/jerryzh168	2023-03-31 10:39:29 +00:00
Li-Huai (Allan) Lin	efdd08a8d0	[MPS] Move impl functions to mps namespace (#97238 ) This PR moves impl functions to `at::native::mps` to prevent them from being exposed in `at::native`. Because of the moves of functions being hard to review, this PR only refactors part of functions in the MPS codebase. Will check everything is correctly moved again before merging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97238 Approved by: https://github.com/kulinseth	2023-03-31 09:55:14 +00:00
Xia, Weiwen	e61b842001	[Quant][FX] lower functional conv_transpose ops (#97126 ) Summary Support quantizing and lowering functional `conv_transpose1d`, `conv_transpose2d` and `conv_transpose3d`. Please note that - `conv_tranpose + relu` fusion is not supported. Remember to keep `relu` node in graph when lowering. - `conv_tranpose` requires `per-tensor` scheme for weight. Use default `qconfig_mappings` instead of deprecated `qconfig_dict` for test cases. Test plan python test/test_quantization.py -k test_conv_transpose_not_reference python test/test_quantization.py -k test_conv_transpose_reference python test/test_quantization.py -k test_conv_transpose_relu_not_reference python test/test_quantization.py -k test_conv_transpose_relu_reference Pull Request resolved: https://github.com/pytorch/pytorch/pull/97126 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-03-31 07:17:29 +00:00
Catherine Lee	c797c7bc8b	Clean up duplicate function run_test.py (#97914 ) afaict theyre the same thing Pull Request resolved: https://github.com/pytorch/pytorch/pull/97914 Approved by: https://github.com/huydhn	2023-03-31 06:31:17 +00:00
PyTorch MergeBot	675dfd2c1f	Revert "Retry at test file level (#97506 )" This reverts commit 7d5d5beba27050a8da68675a0ae97a12b26b8a40. Reverted https://github.com/pytorch/pytorch/pull/97506 on behalf of https://github.com/clee2000 due to test_jit_cuda_fuser having a rough time	2023-03-31 06:22:14 +00:00
maxren	3a5ca4bdd4	[quant][pt2e] Add support for conv bn fusion in et backend config (#97389 ) Batch Norm was supported by XNNPACK via fusion with the preceding convolution op. We do the same here by fusing across q -> dq nodes. We must update the original pass in order to fuse convolution weight/bias with batch norm parameters, this way quantization is supported for batch norm Differential Revision: [D43976324](https://our.internmc.facebook.com/intern/diff/D43976324/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97389 Approved by: https://github.com/salilsdesai	2023-03-31 05:33:42 +00:00
PyTorch MergeBot	c091aa9a2c	[vision hash update] update the pinned vision hash (#98043 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98043 Approved by: https://github.com/pytorchbot	2023-03-31 05:19:21 +00:00
maxren	fe2bdfb2cd	[Executorch][XNNPACK] Quantized mean (#97388 ) Support Quantized Mean.dim for xnnpack Adding another pattern for Quantized Partitioner and test to ensure quantized operator works Differential Revision: [D43915706](https://our.internmc.facebook.com/intern/diff/D43915706/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97388 Approved by: https://github.com/salilsdesai	2023-03-31 05:08:53 +00:00
Jerry Zhang	f78b44b2d9	[quant][pt2e][refactor] Refactor prepare to remove the use of qconfig in `_maybe_insert_input_observer_for_arg_or_kwarg` (#97948 ) Summary: The goal is for this function to be reused by quantize_pt2e Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D44558929](https://our.internmc.facebook.com/intern/diff/D44558929) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97948 Approved by: https://github.com/andrewor14	2023-03-31 05:07:58 +00:00
maxren	f9ca48ddb5	[Executorch][XNNPACK] Quantized hardtanh (#97387 ) Lower Quantized Hardtanh to XNNPACK Also add symmetric quantization support for hardtanh in executorch backend config Differential Revision: [D43901222](https://our.internmc.facebook.com/intern/diff/D43901222/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97387 Approved by: https://github.com/salilsdesai	2023-03-31 04:58:24 +00:00
maxren	ae5b044ccb	[XNNPACK] Enable S8 Operators (#97386 ) Enabling S8 Operators for quantized Clamp. This is only for clamp nodes by themselves. I believe once they are fused with the previous node they are no longer needed Differential Revision: [D43901200](https://our.internmc.facebook.com/intern/diff/D43901200/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43901200/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/97386 Approved by: https://github.com/salilsdesai	2023-03-31 04:32:29 +00:00
maxren	4befb84d49	[XNNPACK] Allow VCVT Operators (#97385 ) Allow VCVT operators to allow operators that change datatype. We want xnn_define_convert, to convert from one data type to another. Explanation for usage provided in follow up diff Differential Revision: [D43844094](https://our.internmc.facebook.com/intern/diff/D43844094/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43844094/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/97385 Approved by: https://github.com/salilsdesai	2023-03-31 03:56:53 +00:00
Aaron Gokaslan	9c3fbe7475	[BE] Enable flake8-simplify checks (#97984 ) Enable some sensible flake8-simplify rules. Mainly wanted to enable the SIM101, and `yield from` SIM103 checks. @kit1980 since you wanted to be tagged on this CI check. Enabling this check also helped flag one logical bug so it's definitely beneficial (also fixed in this PR). Pull Request resolved: https://github.com/pytorch/pytorch/pull/97984 Approved by: https://github.com/ezyang	2023-03-31 03:40:21 +00:00
Antoni Viros i Martin	3dc4405278	Add a unit test for negative torch.arange() incorrect numerical behavior with dynamic shapes (#97926 ) This unit test is for the fix in #97777 for issue #96971 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97926 Approved by: https://github.com/ezyang	2023-03-31 03:04:50 +00:00
cyy	dc2b7aa955	[Reland] fix some MKL detection issues of CMake (#94924 ) This is reland of PR #94402 that tries to solve the additional link issues. The PR #94402 failed because caffe2::mkl had been converted to private dependency while libtorch_cuda_linalg hadn't linked to it explicitly. This is fixed in commit 4373bf0ae3dee32afc178f9d51a4154d6c5904c6 We also replace more references of MKL_LIBRARIES by caffe2::mkl in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94924 Approved by: https://github.com/malfet	2023-03-31 02:01:52 +00:00
Nikita Shulga	a1dc2b1774	[BE] Remove bool dtype from `masked_scatter` (#98015 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at a9fa438</samp> Simplified a test function for `torch.masked_scatter` in `test/test_torch.py` by removing redundant and unnecessary code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98015 Approved by: https://github.com/ezyang	2023-03-31 01:45:57 +00:00
XiaobingSuper	26a90fb9c2	using accumulate type to do the computation of mean reduce(CPU) (#97351 ) This PR will use accumulate type to do the computation of mean reduce for CPU path as GPU path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97351 Approved by: https://github.com/jgong5, https://github.com/peterbell10, https://github.com/ezyang	2023-03-31 01:27:49 +00:00
drisspg	a5b6f10c5d	Fix format bug in NT docs (#97998 ) Fixes a formatting bug in the NT docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/97998 Approved by: https://github.com/jbschlosser	2023-03-31 01:00:25 +00:00
Yu Guo	fae28fcdf5	separate deterministic scatter_add as a helper function (#97922 ) separate it for better readability and and this helper function can be reused for the deterministic implementation of `scatter` and `scatter_reduce` with sum reduction mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97922 Approved by: https://github.com/ngimel	2023-03-31 00:01:18 +00:00
Manuel Candales	99f25c2920	[Vulkan] Fix divide-by-zero with padded tensors (#97698 ) Summary: This fixes the divide-by-zero that arises when performing a division in which the denominator has a number of channels that isn't a multiple of 4, and therefore the channel dimension has been padded with 0s. More details in the comments of this post: https://fb.workplace.com/groups/pytorch.edge.users/permalink/1288546972015593/ Test Plan: ``` buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 ``` ``` buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` Differential Revision: D44392406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97698 Approved by: https://github.com/SS-JIA	2023-03-30 23:05:47 +00:00
Catherine Lee	38207a9e53	[ci][easy] Only print remaining logs if test step ran (#97713 ) it sometimes spits out left over logs from a previous run on the windows ephemeral runner, but this might have been fixed by now. I get a bit annoyed when the step runs even though it obviously isnt going to be useful since the test step didnt run, always() is needed to ensure that it runs on test step failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/97713 Approved by: https://github.com/huydhn	2023-03-30 23:03:41 +00:00
Wanchao Liang	1b323b313c	[dtensor] switch mean to use reduction linear (#97996 ) mean is actually a reduction linear formula if the final reduction is partial sum (which currently is), so switching to use that instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/97996 Approved by: https://github.com/XilunWu, https://github.com/yifuwang	2023-03-30 22:48:16 +00:00
Rodrigo Kumpera	184bfbc3d7	Move functional collectives to the right namespace (#97793 ) This moves them from `torch._C._nn` to `torch._C._dist` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97793 Approved by: https://github.com/albanD	2023-03-30 22:18:13 +00:00
PyTorch MergeBot	45acfc8574	Revert "[BE][autograd Function] Raise an error if input is returned as-is and saved for forward or backward in setup_context (#97212 )" This reverts commit 313db584f33991c8c2520c79b6dbe11fd93d4179. Reverted https://github.com/pytorch/pytorch/pull/97212 on behalf of https://github.com/soulitzer due to Internally someone is rely on _wrap_outputs and we updated its signature	2023-03-30 22:03:07 +00:00
David Berard	c218309f88	[dynamo] profiler.record_function on all dynamo_timed functions (#96495 ) Summary: profiler.record_function inserts an event into the chrome trace generated by the pytorch profiler. This PR adds record_function everywhere that @dynamo_timed is annotated. dynamo_timed and the CLI viewer torch._dynamo.utils.compile_times() are already useful on their own; but for identifying _when_ these get called, it's nice to be able to view in the profiler chrome trace. Why not just turn on python stack traces in the profiler to get this information? Dynamo compilation is implemented in python and therefore produces a huge amount of events when it records compilation steps. The resulting trace files are often too large to load in chrome://tracing, and they take a long time to generate. Additionally, the stack traces are deep enough that they are often hard to read. This approach produces much more readable traces with lower overhead. Tests: - Added in test/dynamo/test_profiler.py. Verified in https://github.com/pytorch/pytorch/actions/runs/4559322864/jobs/8043307798?pr=96495 that the tests are actually running. - Performance run with `ciflow/inductor-perf-compare` shows no noticeable change in compilation time or speedup numbers. Geomean speedup changes from 1.275 -> 1.277. Geomean compilation times change from 54.2s -> 53.8s. That's likely just due to noise. All individual benchmark numbers regressed by no more than 5% between the two runs; and we see improvements of around the same magnitude, suggesting this is, again, just noise. For meta employees, you can see the results in a google sheets here: https://docs.google.com/spreadsheets/d/1Ki69XvcgxcA3ZnqC5n_jav5KiD4u7Wojlad3VTnIdlk/edit?usp=sharing Example: Run this: ```python import torch def gn(x): return x.sin().cos() def fn(x, y): return x.sin() * y.cos() x, y = [torch.rand((2, 2), device='cuda') for _ in range(2)] # just to clear out any lazy initialization with torch.profiler.profile() as prof: torch.compile(gn)(x) with torch.profiler.profile() as prof: torch.compile(fn)(x, y) prof.export_chrome_trace("./dynamo_timed_profile.json") ``` and we can see that the resulting trace shows important dynamo steps, even when python tracing is turned off. <img width="867" alt="Screenshot 2023-03-29 at 7 26 15 PM" src="https://user-images.githubusercontent.com/5067123/228712263-8ae67ab9-1a52-4765-a9c2-7c5cf0abe2f5.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96495 Approved by: https://github.com/ngimel, https://github.com/mlazos	2023-03-30 21:49:02 +00:00
Scott Wolchok	ca135ed6b5	[PyTorch] Optimize TupleType::annotation_str_impl for small tuples (#97910 ) In general, we can't profitably gather an array of all the elements' annotation strings so that we can reserve the final string because we'd have to heap-allocate that array. If we do it as a fast path for small tuples (which Tuple itself sets precedent for!), we can stack-allocate the array of annotation strings and make it profitable. Differential Revision: [D44519675](https://our.internmc.facebook.com/intern/diff/D44519675/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97910 Approved by: https://github.com/suo, https://github.com/Skylion007	2023-03-30 21:38:44 +00:00
Elias Ellison	cadccf0daf	Flip switch (#97993 ) Turn on cudagraph trees (delay in fbcode for now). Pull Request resolved: https://github.com/pytorch/pytorch/pull/97993 Approved by: https://github.com/davidberard98	2023-03-30 21:35:09 +00:00
Edward Z. Yang	f2e6b0837a	make triton uses the wheel script now (#97995 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97995 Approved by: https://github.com/colesbury	2023-03-30 21:23:49 +00:00
Catherine Lee	1f85390eb2	Skip test_batch_norm in test_jit_fuser_te for asan (#98016 ) it takes 10+ minutes on asan? Pull Request resolved: https://github.com/pytorch/pytorch/pull/98016 Approved by: https://github.com/huydhn	2023-03-30 20:58:41 +00:00
Richard Zou	7bb5fb3c6d	[vmap] Fix index_select support when dim is negative (#97916 ) Fixes https://github.com/pytorch/pytorch/issues/96854 Previously, this would segfault (via indexing -2 into a SmallVector). This PR fixes it so that we wrap negative dimensions. Test Plan: - changed the index_select OpInfo to use dim=-1 instead of dim=1, because it's much more common that the negative dimension doesn't work instead of the positive one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97916 Approved by: https://github.com/ngimel, https://github.com/janeyx99	2023-03-30 20:57:38 +00:00
PyTorch MergeBot	7868e4b45b	Revert "Disable dynamo tracing torchrec.distributed (#97824 )" This reverts commit 9d1d95099b0689e8bbd0be3e4fafbad76d8ca524. Reverted https://github.com/pytorch/pytorch/pull/97824 on behalf of https://github.com/yanboliang due to need to catch more exception	2023-03-30 20:43:00 +00:00
soulitzer	ee1c539ecf	Fix module backward pre-hooks to actually update gradient (#97983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97983 Approved by: https://github.com/albanD	2023-03-30 20:33:44 +00:00
William Wen	06d677f41d	[dynamo 3.11] fix push null timing in resume functions (#96504 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96504 Approved by: https://github.com/jansel, https://github.com/albanD	2023-03-30 20:29:49 +00:00
William Wen	5b6e4c48b1	[dynamo 3.11] properly determine cell/freevar index in bytecode_transformation.py (#96503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96503 Approved by: https://github.com/jansel	2023-03-30 20:23:59 +00:00
William Wen	ba52268da5	[dynamo 3.11] properly copy free/cell vars in eval_frame.c (#96501 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96501 Approved by: https://github.com/jansel, https://github.com/albanD	2023-03-30 20:23:38 +00:00
Shunting Zhang	c681c52e01	[inductor] fix TritonTemplateCaller.__str__ (#97578 ) We remove TritonTemplateCaller.to_callable previously. But this method is still used in `TritonTemplateCaller.__str__` . The to_callable method in the base class will be used and raise an exception. This PR fix TritonTemplateCaller.__str__ to return the string representation without calling to_callable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97578 Approved by: https://github.com/nmacchioni, https://github.com/ngimel	2023-03-30 20:23:02 +00:00
William Wen	c905251f9f	[dynamo 3.11] fix eval_frame.c debug prints for 3.11 (#96500 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96500 Approved by: https://github.com/jansel, https://github.com/albanD	2023-03-30 20:20:12 +00:00
Wanchao Liang	848bf8103b	fix functional collective to not generate getattr node (#97924 ) use mesh.get_dim_groups directly instead of doing mesh tensor operations This help us get rid of the getattr ops during tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/97924 Approved by: https://github.com/kumpera	2023-03-30 20:14:50 +00:00
eqy	2fddcf0fc0	[CUDA][CUDA 11] Remove more CUDA 11 version checks (#92934 ) Working on removing stragglers missed in previous CUDA version < 11.0 cleanup PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92934 Approved by: https://github.com/ngimel	2023-03-30 19:49:52 +00:00
Nikita Karetnikov	90f69cad9a	[inductor] test codegen with dynamic shapes (#96934 ) Adds new tests that check for patterns in generated C++/Triton code to see if it's dynamic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96934 Approved by: https://github.com/ezyang	2023-03-30 19:38:38 +00:00
mikey dagitses	da28af3286	distinguish mutability of StorageImpl::data_ptr() member (#97651 ) See D44409928. Differential Revision: [D44410323](https://our.internmc.facebook.com/intern/diff/D44410323/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97651 Approved by: https://github.com/ezyang	2023-03-30 19:13:56 +00:00
Natalia Gimelshein	35090b869d	set num_warps to at least 4 (#97950 ) To avoid IMAs in https://gist.github.com/ngimel/25e81c996d9c8c652d97e33cc9c7d5f4 This is not a general fix (e.g. if inputs were a bit larger, num_warps would naturally be 4, and we could still have spills and hit ptxas bugs), but will do for now. Longer term, we should check spills in kernels we generate and recompile with more warps if there are spills. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97950 Approved by: https://github.com/bertmaher	2023-03-30 18:58:14 +00:00
rusty1s	19706356b5	Fix TorchScript support in `as_nested_tensor` (#97960 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97960 Approved by: https://github.com/cpuhrsch	2023-03-30 18:55:26 +00:00
Masaki Kozuki	b235e1f737	Compare `len(fw_derivatives)` with 0 w/o using `not` (#97953 ) `fw_derivatives` is a list as in the permlink so we can make use of the fact that the empty list is evaluated as `False` but from I prefer `len(some_list) > 0` thanks to its clarity and implication that the variable is a container. `53c9bc8c68/tools/autograd/gen_variable_type.py (L942)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97953 Approved by: https://github.com/soulitzer	2023-03-30 18:42:18 +00:00
Edward Z. Yang	97fc8ea5f4	Run the benchmark suite with dynamic batch only (#97912 ) Symbolic shapes compile time on full CI with inductor is horribly long (even though our aot_eager local runs seemed to suggest that the added latency was only 10s per model.) To patch over the problem for now, run the benchmark suite with dynamic batch only. This should absolve a lot of sins. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97912 Approved by: https://github.com/janeyx99, https://github.com/desertfire	2023-03-30 18:04:48 +00:00
Jason Ansel	4cce60751b	Move TestIndexingSimplification to its own file (#97941 ) test_torchinductor has gotten too big (almost 10k lines), this stack is trying to split it into smaller pieces. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97941 Approved by: https://github.com/ngimel	2023-03-30 17:55:34 +00:00
lantiankaikai	94bae36a1f	Fix strip_function_call in GuardBuilder (#97810 ) repo: from #92670 this address one of the bug for TorchDynamo pytest ./generated/test_PeterouZh_CIPS_3D.py -k test_003 Issue: In GuardBuilder, when parsing argnames with "getattr(a.layers[slice(2)][0]._abc, '0')" it returns "getattr(a", where it suppose to return "a", and thus causing SyntaxError. This PR fix the regex and add couple test cases. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97810 Approved by: https://github.com/yanboliang	2023-03-30 17:46:10 +00:00
kshitij12345	ffd76d11c9	[fix] take : backward batching rule (#95772 ) Fixes https://github.com/pytorch/pytorch/issues/95738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95772 Approved by: https://github.com/zou3519	2023-03-30 17:18:17 +00:00
Catherine Lee	7d5d5beba2	Retry at test file level (#97506 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97506 Approved by: https://github.com/huydhn	2023-03-30 17:12:19 +00:00
William Wen	24a5d006f2	[dynamo 3.11] Refactor create_instruction (#96499 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96499 Approved by: https://github.com/jansel, https://github.com/albanD	2023-03-30 17:05:27 +00:00
Xuehai Pan	e6888697c4	Revisit `torch._six.string_classes` removal (#94709 ) (#97863 ) Revisit `torch._six.string_classes` (which is `(str, bytes)`) removal: `isinstance(obj, string_classes) -> isinstance(obj, str)`. Both `str` and `bytes` are `Sequence` classes. ```python In [1]: from typing import Sequence In [2]: issubclass(bytes, Sequence) Out[2]: True In [3]: issubclass(str, Sequence) Out[3]: True ``` Re-add `bytes` to type guards like: ```python def is_seq(obj): return isinstance(obj, Sequence) and not isinstance(obj, (str, bytes)) ``` Ref: - https://github.com/pytorch/pytorch/pull/94709#issuecomment-1487282912 - #97737 - #97789 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97863 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-03-30 17:02:45 +00:00
Shen Li	9ec6fdb29b	Enable adam foreach in full train step tracing (#97897 ) Main changes: 1. Registered several foreach ops to both meta and DTensor 2. Skip redundant getitem node when expanding foreach ops with DTensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/97897 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-03-30 16:47:10 +00:00
kshitij12345	19dcf55a6f	[functorch] .data should not work for grad, jvp, vjp (#94817 ) Improve error message Fixes https://github.com/pytorch/pytorch/issues/94514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94817 Approved by: https://github.com/zou3519	2023-03-30 16:46:57 +00:00
Catherine Lee	96dbca69e6	Add unstable workflow to upload test stats (#97918 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97918 Approved by: https://github.com/huydhn	2023-03-30 16:30:18 +00:00
Zaccharie Ramzi	65e8c14948	Corrected batch norm docs with the exact computations of the standard deviation (#97974 ) Fixes #77427 @jbschlosser sory for taking so long to submit this, I just realized this had been sitting in my backlog for too long. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97974 Approved by: https://github.com/albanD	2023-03-30 16:29:57 +00:00
Animesh Jain	cdb32dad3d	[minifier] cuda.synchronize to better detect IMA (#97962 ) Sometimes IMA can trigger much later than the kernel invocation call, and they escape minifier. Calling cuda.synchronize fixes this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97962 Approved by: https://github.com/mlazos	2023-03-30 15:46:52 +00:00
Joel Schlosser	0e4ddc2b40	NT: Refactor for lazy computation of opt_sizes (#97895 ) This PR changes the `opt_sizes_` metadata to be computed lazily if needed rather than at construction. Since this metadata is data-dependent, we can't calculate it if we have symbolic metadata (i.e. for dynamic shapes). Notes: * `opt_size()` is the only public accessor of `opt_sizes_`; several kernels use it. During the first call to this, the metadata is computed. * `size()` / `sym_size()` use `opt_size()`. For the symbolic case, this will have to change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97895 Approved by: https://github.com/drisspg	2023-03-30 14:58:26 +00:00
Aaron Gokaslan	47dca20d80	[BE] Enable flake8-comprehension rule C417 (#97880 ) Enables flake8-comprehension rule C417. Ruff autogenerated these fixes to the codebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97880 Approved by: https://github.com/ezyang, https://github.com/kit1980, https://github.com/albanD	2023-03-30 14:34:24 +00:00
Angela Yi	1d08b5b103	[fx] Replace literals with placeholder helper (#97683 ) Helper function to replace literals that show up in call_function nodes in the graph to become placeholders so that they can be represented as wildcards when matching with the SubgraphMatcher. This pass causes the resulting graph to not be runnable with the original inputs since adding placeholders to the graph will change the number of inputs needed for the graph. Test: `python test/test_fx.py TestMatcher` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97683 Approved by: https://github.com/kimishpatel, https://github.com/SherlockNoMad	2023-03-30 12:13:28 +00:00
vfdev-5	19162083f8	Improved perfs for vectorized bilinear interpolate cpu uint8 RGB-case (channels last) (#96848 ) ## Description - Based on https://github.com/pytorch/pytorch/pull/96651 - Improved perfs for vectorized bilinear interpolate uint8 RGB-case, channels last - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] \| Pillow (9.0.0.post1) \| torch (2.1.0a0+gitd6e220c) PR \| torch (2.1.0a0+git2b75955) nightly \| Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True \| 38.674 (+-0.323) \| 57.591 (+-0.244) \| 131.033 (+-1.448) \| 2.275 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False \| \| 39.471 (+-0.166) \| 113.911 (+-1.736) \| 2.886 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True \| 128.512 (+-1.916) \| 161.592 (+-1.242) \| 299.679 (+-2.099) \| 1.855 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False \| \| 150.994 (+-1.180) \| 285.331 (+-1.919) \| 1.890 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True \| 180.045 (+-2.223) \| 220.581 (+-1.363) \| 431.057 (+-3.536) \| 1.954 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False \| \| 219.391 (+-1.409) \| 429.410 (+-3.620) \| 1.957 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True \| 113.911 (+-1.024) \| 129.457 (+-1.295) \| 459.610 (+-13.322) \| 3.550 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False \| \| 59.800 (+-0.199) \| 400.015 (+-11.815) \| 6.689 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True \| 283.050 (+-2.664) \| 339.143 (+-1.209) \| 683.555 (+-4.466) \| 2.016 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False \| \| 250.601 (+-1.236) \| 603.545 (+-2.644) \| 2.408 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True \| 186.723 (+-2.213) \| 199.960 (+-1.343) \| 860.867 (+-21.763) \| 4.305 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False \| \| 79.188 (+-0.261) \| 703.019 (+-25.805) \| 8.878 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True \| 412.353 (+-4.476) \| 462.230 (+-1.983) \| 1101.673 (+-49.299) \| 2.383 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False \| \| 327.973 (+-1.852) \| 941.062 (+-5.549) \| 2.869 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True \| 61.191 (+-0.926) \| 80.795 (+-0.518) \| 160.853 (+-1.506) \| 1.991 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True \| 134.488 (+-2.129) \| 169.147 (+-1.324) \| 327.343 (+-2.846) \| 1.935 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True \| 1037.045 (+-24.982) \| 938.623 (+-9.010) \| 2603.360 (+-20.530) \| 2.774 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True \| 52.792 (+-0.613) \| 73.692 (+-0.264) \| 131.829 (+-1.333) \| 1.789 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True \| 139.596 (+-1.944) \| 173.778 (+-1.039) \| 320.063 (+-2.562) \| 1.842 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True \| 690.132 (+-10.946) \| 772.758 (+-2.864) \| 2036.860 (+-36.109) \| 2.636 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False \| \| 78.747 (+-0.799) \| 158.479 (+-1.702) \| 2.013 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False \| \| 167.046 (+-1.077) \| 322.104 (+-2.764) \| 1.928 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False \| \| 918.967 (+-5.251) \| 2611.388 (+-29.917) \| 2.842 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False \| \| 55.336 (+-0.251) \| 113.869 (+-1.243) \| 2.058 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False \| \| 156.505 (+-1.095) \| 299.861 (+-2.710) \| 1.916 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False \| \| 514.344 (+-1.905) \| 1776.796 (+-19.660) \| 3.454 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md) ## Context - https://github.com/pytorch/pytorch/pull/90771 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96848 Approved by: https://github.com/NicolasHug, https://github.com/peterbell10	2023-03-30 11:51:02 +00:00
Shen Li	379fb47654	[SPMD] Support foreach optimizers with functionalization (#97853 ) My first attempt was to apply the same solution as how proxy_tensor.py handles other inplace ops. However, foreach is different in the way that it's schema is `native_functions.yaml` does not return anything, whereas ops like `addcmul_` and `addcdiv_` do return Tensors (Thanks bdhirsh for teaching me this!). As a result, the proxy output during tracing does not wrap anything, and hence we cannot correctly connect it with subsequent operators. Modifying `native_functions.yaml` is not a preferred solution. After discussing with bdhirsh, the temporary solution is to do foreach functionalization as a graph pass for now. Later, when https://github.com/pytorch/pytorch/issues/97852 is addressed, we will switch to default functionalization. Edit: the latest version follows @bdhirsh 's suggestion on using `make_fx` `decomposition_table` instead of implementing manual fx.Graph tranforms to functionalize `_foreach_add_`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97853 Approved by: https://github.com/fegin, https://github.com/wanchaol	2023-03-30 11:27:10 +00:00
mikey dagitses	0f3ffaf798	extract torch.proto to its own library (#97614 ) This is used in libtorch. Differential Revision: [D44400084](https://our.internmc.facebook.com/intern/diff/D44400084/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97614 Approved by: https://github.com/PaliC	2023-03-30 10:35:03 +00:00
mikey dagitses	428cb3a868	distinguish mutability of untyped StorageImpl::data() member (#97647 ) To implement the warning when transitioning reshape to copy-on-write storage, we want to be able to detect a write to one view family following by a read or a write to another one that shares the same copy-on-write storage. Because we have historically not been strict about the mutability of our data pointers, any warning we have would likely be far too aggressive. Therefore, this is the first PR in a long series to ensure a strict distinction between mutable and const data accessors in TensorBase, TensorImpl, Storage, and StorageImpl. The rough plan is to give the mutable accessor a new name that is explicit about mutation, this will also force us to rewrite any code that really needs a mutation. Differential Revision: [D44409928](https://our.internmc.facebook.com/intern/diff/D44409928/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97647 Approved by: https://github.com/ezyang	2023-03-30 09:45:09 +00:00
mikey dagitses	0770ad3cae	extract caffe2.proto to its own library (#97613 ) This reduces the footprint of the caffe2_pb library. Differential Revision: [D44400083](https://our.internmc.facebook.com/intern/diff/D44400083/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97613 Approved by: https://github.com/PaliC	2023-03-30 09:16:25 +00:00
Sergii Dymchenko	5ab50cf048	Fix shoud/shoudl typos (#97930 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97930 Approved by: https://github.com/clee2000	2023-03-30 08:27:16 +00:00
shaoyf42	7554c10899	Fix typos under tools directory (#97779 ) Fix typos under tools directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/97779 Approved by: https://github.com/clee2000, https://github.com/kit1980	2023-03-30 08:21:35 +00:00
Driss Guessous	5a81508bb6	Add NestedTensor ops: logical_not, logical_not_, masked_fill (#97934 ) # Summary <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 7954302</samp> This pull request adds support for `logical_not` and `masked_fill` operations on nested tensors, which are tensors that can have tensors as elements. It modifies the `native_functions.yaml` file to dispatch these operations to the nested tensor backend, implements the logic for these operations in `NestedTensorBinaryOps.cpp` and `NestedTensorUnaryOps.cpp`, adds documentation in `nested.rst`, and adds tests in `test_nestedtensor.py`. ## Description <!-- copilot:walkthrough --> ### <samp>🤖 Generated by Copilot at 7954302</samp> * Implement `logical_not` operation on nested tensors ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991R1164), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991R1172), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-f7c94671810b3ce652f9ad5458518cb7bbd67e8bf7e84e0a2fba641d878ba7c5R45-R56), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-c8b131d009badb3f92031b2aaa6e7f93a793f13caee278ea78e1c57d78c0399eR203), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-6eef496a8ec635930b6e52507358e069c80021f3535b8737d39e14ffc38950c0L854-R867)) - Add `NestedTensor_logical_not` and `NestedTensor_logical_not_` functions to `native_functions.yaml` for CPU and CUDA dispatch ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991R1164), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991R1172)) - Define `NestedTensor_logical_not` and `NestedTensor_logical_not_` functions in `NestedTensorUnaryOps.cpp` using `map_nt` and `get_buffer` ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-f7c94671810b3ce652f9ad5458518cb7bbd67e8bf7e84e0a2fba641d878ba7c5R45-R56)) - Document `torch.logical_not` function for nested tensors in `nested.rst` ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-c8b131d009badb3f92031b2aaa6e7f93a793f13caee278ea78e1c57d78c0399eR203)) - Add subtest for `logical_not` function in `test_activations` method in `TestNestedTensorDeviceType` class in `test_nestedtensor.py` ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-6eef496a8ec635930b6e52507358e069c80021f3535b8737d39e14ffc38950c0L854-R867)) * Implement `masked_fill` operation on nested tensors ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991R7439), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-f847e41e3d373230df0b25574e993ec0e6b699bf16796b3df9ae9fb518048e25L210-R224), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-c8b131d009badb3f92031b2aaa6e7f93a793f13caee278ea78e1c57d78c0399eR197), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-6eef496a8ec635930b6e52507358e069c80021f3535b8737d39e14ffc38950c0R677-R688), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-6eef496a8ec635930b6e52507358e069c80021f3535b8737d39e14ffc38950c0R2515-R2528)) - Add `NestedTensor_masked_fill` function to `native_functions.yaml` for CPU and CUDA dispatch ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991R7439)) - Define `NestedTensor_masked_fill` function in `NestedTensorBinaryOps.cpp` using `NestedTensor_elementwise_Tensor` ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-f847e41e3d373230df0b25574e993ec0e6b699bf16796b3df9ae9fb518048e25L210-R224)) - Document `torch.Tensor.masked_fill` function for nested tensors in `nested.rst` ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-c8b131d009badb3f92031b2aaa6e7f93a793f13caee278ea78e1c57d78c0399eR197)) - Add test case for `masked_fill` function in `TestNestedTensorDeviceType` class in `test_nestedtensor.py` ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-6eef496a8ec635930b6e52507358e069c80021f3535b8737d39e14ffc38950c0R677-R688)) - Add test case for backward pass of `masked_fill` function in `TestNestedTensorAutograd` class in `test_nestedtensor.py` ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-6eef496a8ec635930b6e52507358e069c80021f3535b8737d39e14ffc38950c0R2515-R2528)) * Improve error message for unsupported element-wise binary operations on nested dense tensors ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-f847e41e3d373230df0b25574e993ec0e6b699bf16796b3df9ae9fb518048e25L142-R150)) - Modify `NestedTensor_elementwise_Tensor` function in `NestedTensorBinaryOps.cpp` to include operation name in error message ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-f847e41e3d373230df0b25574e993ec0e6b699bf16796b3df9ae9fb518048e25L142-R150)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97934 Approved by: https://github.com/cpuhrsch	2023-03-30 08:14:39 +00:00
Huy Do	f92cae4849	Fix a grep-itself bug when checking for GPU healthcheck (#97929 ) The logic works https://github.com/pytorch/pytorch/actions/runs/4558327458, but it also grep itself due to `set -x` is set (Ugh, debug message) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97929 Approved by: https://github.com/malfet, https://github.com/weiwangmeta	2023-03-30 08:14:01 +00:00
PyTorch MergeBot	b093dfaefa	Revert "Fix a grep-itself bug when checking for GPU healthcheck (#97929 )" This reverts commit f40b2ed59c4d1aca36d42ed208cfa9356fbe672d. Reverted https://github.com/pytorch/pytorch/pull/97929 on behalf of https://github.com/huydhn due to Rework to get rid of grep completely	2023-03-30 07:52:20 +00:00
Li-Huai (Allan) Lin	7776653a0c	Add linear gradgrad (#97151 ) Fixes #92206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97151 Approved by: https://github.com/albanD	2023-03-30 07:25:02 +00:00
Jerry Zhang	15271d353a	[quant][pt2e] Support convtranspose + bn fusion (#97933 ) Summary: This PR extends `_fuse_conv_bn_` function to support fusing convtranspose and bn Test Plan: python test/test_quantization.py TestQuantizePT2E.test_transposed_conv_bn_fusion Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/97933 Approved by: https://github.com/vkuzo	2023-03-30 07:02:39 +00:00
Kazuaki Ishizaki	f7fe6e148e	[test] Make environment variable name better (#97356 ) This PR intends to use better (or correct?) environment variable name (`TORCH_DOCTEST_ANOMALY` instead of `TORCH_DOCTEST_ANOMOLY`) in test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97356 Approved by: https://github.com/malfet, https://github.com/kit1980	2023-03-30 06:21:28 +00:00
Yu, Guangye	53c9bc8c68	Add DLPack support for XPU backend by mapping to kDLOneAPI in DLPack … (#94968 ) # Motivation The DLPack device type kDLOneAPI stands for the Unified Shared Memory allocated on a oneAPI device. The corresponding Pytorch backend type is XPU. Support to export/import the Pytorch XPU tensor as a DLPack tensor of kDLOneAPI device. # Solution 1. Update the DLPack protocol to v0.7. 2. Add the XPU hooks to map the Aten device and DLPack device with the address value and device information. # Additional Context Reopen (#82867) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94968 Approved by: https://github.com/kit1980	2023-03-30 04:32:15 +00:00
Kazuaki Ishizaki	88234540e7	Fix typo under torch/csrc/jit/tensorexpr directory (#97218 ) This PR fixes typo in comments and messages under `torch/csrc/jit/tensorexpr` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97218 Approved by: https://github.com/davidberard98, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/kit1980	2023-03-30 04:21:24 +00:00
Charlie Yan	721260e966	[3/n] Consolidate `replicate` and `DDP`: update `replicate` to reuse functions in `DDP` (#96660 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96660 Approved by: https://github.com/rohan-varma	2023-03-30 03:54:34 +00:00
Nikita Shulga	af0264ae08	[BE] Pass `-faligned-new` if supported by compiler (#97887 ) <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 507f7a2</samp> > _`-faligned-new` flag_ > _always on for C++17_ > _simpler winter code_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/97887 Approved by: https://github.com/atalman, https://github.com/Skylion007	2023-03-30 03:16:19 +00:00
QiangZiBro	a95815c6b7	fix compiler version detection on MacOS (#97883 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 43c1df6</samp> Fix build error on macOS with Xcode 12 or newer by updating clang version detection in `CMakeLists.txt`. Fixes https://github.com/pytorch/pytorch/issues/97882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97883 Approved by: https://github.com/malfet	2023-03-30 02:56:22 +00:00
Jason Ansel	1432a893ef	Fix issue with single input cat (#97822 ) Fixes #97695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97822 Approved by: https://github.com/ngimel, https://github.com/anijain2305	2023-03-30 02:51:43 +00:00
shibo	7fc100a290	support random for custom device (#97420 ) Fixes #ISSUE_NUMBER set seed for custom device, @bdhirsh , please review my changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97420 Approved by: https://github.com/bdhirsh	2023-03-30 02:12:52 +00:00
Jason Ansel	3eecca764a	Skip test_cpp_wrapper on mac (#97911 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97911 Approved by: https://github.com/clee2000	2023-03-30 00:41:45 +00:00
Huy Do	f40b2ed59c	Fix a grep-itself bug when checking for GPU healthcheck (#97929 ) The logic works https://github.com/pytorch/pytorch/actions/runs/4558327458, but it also grep itself due to `set -x` is set (Ugh, debug message) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97929 Approved by: https://github.com/malfet, https://github.com/weiwangmeta	2023-03-30 00:25:43 +00:00
Yanbo Liang	b23cfe5465	[Inductor] Remove fb custom ops dependency (#97907 ) As it conflicts with other dependencies Pull Request resolved: https://github.com/pytorch/pytorch/pull/97907 Approved by: https://github.com/anijain2305	2023-03-29 23:53:21 +00:00
Philip Meier	2f6c18d1a2	improve memory footprint of torch.testing.assert_close (#96131 ) Redo of #90172 out of stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96131 Approved by: https://github.com/pearu, https://github.com/mruberry	2023-03-29 23:49:56 +00:00
PyTorch MergeBot	8e5c5d2023	Revert "Propogate dynamo shape_env to make_fx (#96437 )" This reverts commit 3a22916c7a501499eec9053e3a568f2b1f49938c. Reverted https://github.com/pytorch/pytorch/pull/96437 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally	2023-03-29 23:47:59 +00:00
Donny You	3460b2b7d3	Add support for pin memory on custom device. (#97621 ) Add support for pin memory on custom device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97621 Approved by: https://github.com/NivekT	2023-03-29 23:45:52 +00:00
Driss Guessous	f603873c1b	add various NT ops needed for testing (#97837 ) # Summary Add some Simple unary and binary NT ops - Sub - sgn - abs Pull Request resolved: https://github.com/pytorch/pytorch/pull/97837 Approved by: https://github.com/cpuhrsch	2023-03-29 23:43:37 +00:00
Jay Chae	2b56da139c	[kineto] init kineto_activity for each event (#97550 ) Summary: There was a refactoring while back to address Kineto <--> PyTorch Profiler buffer management issues. This made the Profiler API path safer but it regressed the OnDemand path. The proper long term solution is to merge those paths which would significantly improve the maintainability of the codebase. Test Plan: # Test on Resnet integration test ``` buck2 run mode/opt kineto/libkineto/fb/integration_tests:pytorch_resnet_integration_test dyno gputrace ``` # Trace https://fburl.com/perfdoctor/t8nkda9z Differential Revision: D44362040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97550 Approved by: https://github.com/aaronenyeshi	2023-03-29 23:33:28 +00:00
Wanchao Liang	47ce41e732	[dtensor] remove DeviceMesh typing hack guard type imports (#97889 ) This PR relands https://github.com/pytorch/pytorch/pull/94526 and tries to guard the type import for older version numpy Pull Request resolved: https://github.com/pytorch/pytorch/pull/97889 Approved by: https://github.com/fegin	2023-03-29 23:29:41 +00:00
eqy	aa4ea6e1f3	[cuDNN][cuDNN V8 API] Fix incorrect use of `emplace` in the benchmark cache (#97838 ) `emplace` does not overwrite the existing mapped value in a map if it already exists, which can lead to repeated execution of a plan that e.g., tries to allocate an OOM-inducing workspace size and retriggers either a heuristic run (or worse, a benchmark run). CC @ptrblck @ngimel @Fuzzkatt @syed-ahmed Pull Request resolved: https://github.com/pytorch/pytorch/pull/97838 Approved by: https://github.com/ngimel	2023-03-29 23:14:05 +00:00
Joel Schlosser	35be579701	Refactor TENSOR_MATCH guards to check dim (for NT support) (#97896 ) Tweaks the TENSOR_MATCH guard logic to avoid saving sizes / strides for the case of dynamic shapes. Instead, the dim() is stored, which is enough for both dense tensors and NTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97896 Approved by: https://github.com/ezyang	2023-03-29 23:08:03 +00:00
Jason Ansel	04ca3a289d	Disable modes in preserve_rng_state (#97738 ) This one allows make_fx to be called when already in a faketensor mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97738 Approved by: https://github.com/ezyang	2023-03-29 22:46:51 +00:00
Tugsbayasgalan Manlaibaatar	3a22916c7a	Propogate dynamo shape_env to make_fx (#96437 ) Currently, when we use assume_static_by_default flag, dynamo won't produce any symbols for input tensors. But when we pass the dynamo generated graph onto make_fx via torchdynamo.export(aten_graph=True), there is no way to pass this flag. We enable this by directly passing the fake tensors dynamo used to make_fx and call make_fx with "real" mode with fake tensors from dynamo. Note that this is modified version of (https://github.com/pytorch/pytorch/pull/96143) Differential Revision: [D43994693](https://our.internmc.facebook.com/intern/diff/D43994693) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96437 Approved by: https://github.com/jansel, https://github.com/ezyang	2023-03-29 22:34:37 +00:00
Bug Hunter Yan	7257de6eac	Fix typos in torch/fx/_compatibility.py (#97618 ) Fixes #ISSUE_NUMBER Modify the _compatibility.py file global variable name and modify its test file simultaneously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97618 Approved by: https://github.com/ezyang	2023-03-29 21:55:13 +00:00
Will Constable	2f86c9bc0b	Update query version for update_expected.py (#97898 ) Unclear why this wobbled, but rocks had an outage and fixed it, maybe new endpoints were generated as a result of that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97898 Approved by: https://github.com/huydhn	2023-03-29 21:50:19 +00:00
Huy Do	099b2801db	Stop runner service when its GPU crashes (#97585 ) Per title, I'm looking for a way to take the runner out of service when its GPU crashes and couldn't recover. Taking the faulty runner out of service would prevent future jobs to be assigned to it as they will surely fail. This is based on the observation that GPU crash usually happen in the middle of the test or in the next `setup-nvidia` step. This is only happens on G5 runner with A10G GPU, so the suspicion is that this is a hardware failure. Updating to the newer NVIDIA driver (525.85.06) might or might not help with the issue (https://github.com/pytorch/pytorch/pull/96904), so I'm preparing this PR as a preemptive measure. Here are the symptoms when the GPU crashes: * Test fails with "No CUDA GPUs are available" error when initialize CUDA. For examples: * https://github.com/pytorch/pytorch/actions/runs/4506110581/jobs/7932832519 * https://github.com/pytorch/pytorch/actions/runs/4507220502/jobs/7935084759 * Calling nvidia-smi timeouts after 60 second. For example: * https://github.com/pytorch/pytorch/actions/runs/4496201282/jobs/7910938448 * Fail to run nvidia-smi with an unable to determine the device handle for GPU unknown error * https://github.com/pytorch/pytorch/actions/runs/4546343549/jobs/8015359600 * Run `docker --gpus all` fails with error response from daemon while the command `nvidia-container-cli` fails with `detection error: nvml error: unknown error` * https://github.com/pytorch/pytorch/actions/runs/4545579871/jobs/8013667872 I'm assume that an offline runner with a stopped runner service would be teardown and recycle properly by infra scaling process. ### Testing https://github.com/pytorch/pytorch/actions/runs/4517112069/jobs/7956204805. When it runs, the code fetches the service name from `${{ RUNNER_WORKSPACE }}/../../.service` file and issue `sudo systemctl stop ${RUNNER_SERVICE_NAME}` to stop the self-hosted runner service. The job will show its status as `The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97585 Approved by: https://github.com/jeanschmidt	2023-03-29 21:17:13 +00:00
Huy Do	2806fa4470	Use the latest NVIDIA driver from setup-nvidia (#97840 ) This goes with https://github.com/pytorch/test-infra/pull/3949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97840 Approved by: https://github.com/ZainRizvi	2023-03-29 21:14:27 +00:00
William Wen	b93e1f377e	[dynamo, benchmarks] Add inductor-mode (for max-autotune) and warm start options to dynamo benchmarks (#97719 ) Title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97719 Approved by: https://github.com/shunting314	2023-03-29 21:09:00 +00:00
Chien-Chin Huang	942e587d40	[SPMD] Make compile cache the compilation result and add option to perform transformation (#97836 ) This PR changes ``compile()`` decorator to cache the compilation result so that the compilation is done once. An gm_transformation option is also added to ``compile()`` so that after the compilation is done, users can perform any graph transformation with the compiled graph module. Differential Revision: [D44484033](https://our.internmc.facebook.com/intern/diff/D44484033/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97836 Approved by: https://github.com/mrshenli, https://github.com/wconstab	2023-03-29 20:51:22 +00:00
Kazuaki Ishizaki	d70f9c7888	Fix typo under torch/csrc/jit/runtime directory (#97243 ) This PR fixes typo in comments and messages under `torch/csrc/jit/runtime` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97243 Approved by: https://github.com/davidberard98	2023-03-29 20:17:10 +00:00
Yu Guo	1f71ac785c	[RFC][inductor][index_put] fallback to aten in torch deterministic mode (#96898 ) Fixes #93537 fallback to aten for index_put and scatter ops in torch deterministic mode Pull Request resolved: https://github.com/pytorch/pytorch/pull/96898 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-03-29 19:28:37 +00:00
Michael Lazos	e6909f6ccc	[Dynamo] Fix for tuple construction from tuple iterators (#97862 ) Fixes #93405 In short - when calling the builtin function `Tuple` on a list variable we added a list length guard. This paired with converting tuple iterators to a ListIteratorVariable resulted in this guard being improperly added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97862 Approved by: https://github.com/yanboliang, https://github.com/jansel	2023-03-29 19:20:05 +00:00
Sergii Dymchenko	477f3f555f	Simplify by using yield from (#97831 ) The issues were found by SIM104 flake8-simplify in a local run. I'll take a look on adding the check to the CI separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97831 Approved by: https://github.com/Skylion007	2023-03-29 19:15:24 +00:00
Aaron Gokaslan	22b723132b	Update ufmt to v2.1.0 (#97900 ) Updates ufmt to the latest version with all the relevant bugfixes and performance improvements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97900 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-03-29 19:01:21 +00:00
Michael Lazos	e626be79a4	Add config setting to error on recompile (#97829 ) Adds a config setting `error_on_recompile` - when set dynamo will raise an exception after compiling a function for the second time. This was requested to help debugging in pyper Pull Request resolved: https://github.com/pytorch/pytorch/pull/97829 Approved by: https://github.com/bertmaher	2023-03-29 19:00:43 +00:00
Edward Z. Yang	bb40b62501	Delete fusions_possible counter (#97881 ) This not used by anything, and it doesn't mean anything either. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97881 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-03-29 18:27:24 +00:00
soulitzer	313db584f3	[BE][autograd Function] Raise an error if input is returned as-is and saved for forward or backward in setup_context (#97212 ) Fixes https://github.com/pytorch/pytorch/issues/96887 We error out in BOTH the case when graph is created and when it is not created. Still bc-breaking, but not as severe because we are limiting to the case where someone uses setup_context. This makes setup_context and non-setup_context versions diverge in their behavior - With the non-setup_context version, saved variables are assumed to have the grad_fn of the inputs. - But now with the setup_context version, we produce an error for this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97212 Approved by: https://github.com/zou3519	2023-03-29 17:54:00 +00:00
PyTorch MergeBot	4114c1ea02	Revert "[dtensor] remove typing hack of DeviceMesh (#94526 )" This reverts commit 70b063db0e2b55b24c096884b2375376c0925453. Reverted https://github.com/pytorch/pytorch/pull/94526 on behalf of https://github.com/atalman due to breaking internal builds	2023-03-29 17:33:58 +00:00
Edward Z. Yang	8372c5dc68	Refactor dynamic dims api, stateless internals, higher level export API (#96699 ) The purpose of this API is to execute a few large components of work: 1) Refactor all the internals of plumbing dynamic dimension information after dynamo to be stateless 2) Decouple allocation controls around dynamic dimensions from verification 3) For (2), for allocation, create an enum that dictates whether we are in DUCK (default today), STATIC (aka assume_static_default in the past), or DYNAMIC (aka user constrained, do not duck shape) 4) For (2), for verification, we separate out the list of dynamic ranges entirely from allocation. This means shape_env does not tracking for what we verify on, and instead, it is the callers job to invoke produce_guards() with the various things they want verified, specifically, with the valid ranges. We do use constrain ranges to refine value ranges when doing analysis. 5) We have decided, therefore, as an extension of (4) to double down on "late" checks versus "eager" checks, primarily because the mechanisms for gathering what actually matters happens during guards, and should be a purview of the caller seeking guards, not the shape env. However, for dynamo, these structures are essentially one and the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96699 Approved by: https://github.com/avikchaudhuri, https://github.com/ezyang	2023-03-29 16:55:49 +00:00
Nikita Shulga	2c16b73a1b	Remove comma from parametrized test name (#97844 ) Using `name_fn` argument of `@paramterize` decorator. As internal test runner can't figure out how to parse those, otherwise this is a no-op. For those with intern access, see [T149211516](https://www.internalfb.com/intern/tasks/?t=149211516) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97844 Approved by: https://github.com/weiwangmeta	2023-03-29 14:20:13 +00:00
Charlie Yan	44e73db3c2	[2/n] Consolidate `replicate` and `DDP`: split `forward` function (#96658 ) Split `forward` function into `pre_forward` and `post_forward`, so that they can be reused in the composable API of `replicate`. Differential Revision: [D44377456](https://our.internmc.facebook.com/intern/diff/D44377456) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96658 Approved by: https://github.com/rohan-varma	2023-03-29 13:57:16 +00:00
Justin Chu	170a1c3ace	[ONNX] Fix typo "scipt" -> "script" (#97850 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97850 Approved by: https://github.com/BowenBao	2023-03-29 13:42:13 +00:00
Nikita Karetnikov	2ce6ad9aa9	[inductor] make `run_and_get_cpp_code` signature match `run_and_get_triton_code` (#97826 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97826 Approved by: https://github.com/ezyang	2023-03-29 12:29:03 +00:00
Vik Paruchuri	4ae4c6f68a	Fix typo when setting FSDP state dict config (#97110 ) `get_state_dict_type` in FSDP looks for a key called `_optim_state_dict_config` when getting the optimizer state dict config. However, `set_state_dict_type` sets the config at a key called `_optimstate_dict_config`. This looks like a typo. This fixes the discrepancy, so that when you set the state dict type, it is correctly used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97110 Approved by: https://github.com/awgu, https://github.com/fegin	2023-03-29 10:46:26 +00:00
chunyuan	004bb34f42	inductor: fix vision_maskrcnn dynamic_shapes error on CPU (#97312 ) Fix several c++ compilation errors in `vision_maskrcnn` in dynamic_shapes cases: 1. convert `ceiling` to `std::ceil` in `CppPrinter`: ```bash error: ‘ceiling’ was not declared in this scope 17 \| for(long i1=0; i1<ceiling(1.8735363483429ks1); i1+=1) ``` 2. convert index in `store` to `INDEX_TYPE`: ```bash error: invalid types ‘float[double]’ for array subscript 52 \| out_ptr0[i2 + (i1(floor(1.8735363483429ks2))) + (i0(std::ceil((1.87353634834290ks1)))(floor(1.8735363483429ks2)))] = tmp30; ``` 3. convert offset, size, steps in loop to `INDEX_TYPE`: ```bash error: invalid controlling predicate 16 \| for(long i1=0; i1<std::ceil((1.87353634834290ks1)); i1+=1) ``` 4. convert index in `load` to `INDEX_TYPE`: ```bash error: invalid types ‘float[double]’ for array subscript 64 \| auto tmp0 = out_ptr0[i1 + (i0(floor(1.8735363483429ks2)))]; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97312 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel	2023-03-29 10:24:57 +00:00
Philip Meier	2f06fc2422	prepare doc preview s3-prefix for future change (#97433 ) Sister patch for pytorch/test-infra#3917. TL;DR this moves the doc preview scheme to $OWNER/$REPO rather than $REPO that we are currently rolling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97433 Approved by: https://github.com/seemethere, https://github.com/malfet	2023-03-29 10:13:43 +00:00
Wang, Eikan	faccd87658	[NNC] Fix the issue that the void** could not store a scalar if the bit width of the scalar is greater than 32bit on a 32bit platform (#97669 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97669 Approved by: https://github.com/jgong5	2023-03-29 09:22:13 +00:00
lezcano	6871665a97	Avoid copies in matmul (no ghstack) (#97355 ) Resubmit of https://github.com/pytorch/pytorch/pull/76828 without using ghstack so that @ngimel can import it and help me debug the issue why it was reverted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97355 Approved by: https://github.com/ngimel, https://github.com/malfet	2023-03-29 06:54:09 +00:00
Sergii Dymchenko	46faa79e09	Simplify by using yield from in torch/utils/data (#97839 ) Also see https://github.com/pytorch/pytorch/pull/97831 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97839 Approved by: https://github.com/NivekT, https://github.com/Skylion007	2023-03-29 04:51:26 +00:00
Yanbo Liang	f388bec985	[Dynamo] torch.Generator state should have a source and be reconstructed properly (#97403 ) Fixes #97077 partially. During FX graph propagation, we request every tensor should have source: `a524123c91/torch/_dynamo/variables/builder.py (L929)` However, the output of ```torch.Generator.get_state()``` is a tensor but without source, since it's generated inside of the FX graph. My change is following what we did for [Python random functions](https://github.com/pytorch/pytorch/blob/master/torch/_dynamo/variables/user_defined.py#L260), to have a dedicated ```GeneratorStateSource```. We have to also update the reconstruction logics, since we will reuse the ```TensorVariable``` reconstruction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97403 Approved by: https://github.com/jansel, https://github.com/mlazos	2023-03-29 04:31:23 +00:00
Yanbo Liang	9d1d95099b	Disable dynamo tracing torchrec.distributed (#97824 ) This was used to unblock Meta internal use cases, where ```torchrec.distributed``` was used, however, it can't be traced by dynamo properly right now. We were sending the same fix(#90087) several months ago, but was reverted due to ```fbgemm``` conflicts. This PR catches ```Exception``` rather than ```ImportError``` which can handle the conflicts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97824 Approved by: https://github.com/wconstab	2023-03-29 04:29:51 +00:00
Will Constable	f4ac8e0052	Add dynamo config skip_nnmodule_hook_guards (#97830 ) This lets users that are sure they won't use hooks avoid overhead related to dynamo guards on (assumedly) empty hook dicts on all nn modules. Only enable this flag if you are sure you won't change hook-behavior after compiling. It is ok to register a hook and then compile, if you promise never to remove/alter the hook. It is also ok to not register a hook and compile, if you never register a hook later. Note- this is not the best we can do, and hopefully in the future we can avoid the need for this option following some of these paths - make guards fast enough to not be an issue when guarding on hook dicts - make a mode where dynamo actually skips tracing __call__ so hooks are consistently ignored by compiled programs - use nnmodule versioning so hook changes can be guarded without explicit hook dict guards Pull Request resolved: https://github.com/pytorch/pytorch/pull/97830 Approved by: https://github.com/jansel	2023-03-29 04:25:27 +00:00
atalman	91166ef7e7	Remove rocm python 3.11 restriction (#97818 ) Removes restrictions for rocm 5.4.2 and python 3.11 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97818 Approved by: https://github.com/malfet	2023-03-29 02:51:13 +00:00
Edward Z. Yang	f754be897a	Disable speedup_experiment_ds (#97806 ) It seems to be broken. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97806 Approved by: https://github.com/jansel	2023-03-29 01:27:31 +00:00
Huy Do	60631aefe5	Disable test_variable_sharing on ASAN due to non-deterministically hang (#97742 ) See https://github.com/pytorch/pytorch/issues/94024. I disabled this test on ASAN a while ago for this exact issue. The issue, unfortunately, was hard to reproduce and flaky bot closed it 3 weeks ago. ASAN job has been hanging flakily since then, i.e. `8313becefa`. I don't want to reopen the issue and forget about it after 2 weeks, so let's disable the test for ASAN and be at peace (for now). Interesting, there are other tests here also hanging on ASAN, i.e. `test_leaf_variable_sharing`: ``` # See https://github.com/pytorch/pytorch/issues/14997 @unittest.skipIf(TEST_WITH_ASAN, "non-deterministically hangs with ASAN") def test_leaf_variable_sharing(self): ``` I suspect that they have the same root cause. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97742 Approved by: https://github.com/clee2000	2023-03-29 01:18:44 +00:00
Shunting Zhang	9e2e345af7	[inductor] avoid kernel cache miss because of different arg name (#97755 ) We previously use buffer name for the variable containing randomly generated kernel input in the kernel benchmark. This has a big drawback. The same kernel may be used for different buffers. However if we use buffer name as argument name, the kernel source code for different invocation of the kernel will be different. This cause the following downsides: - compile time will be longer since we can not reused compiled kernel due to cache miss - this cause inconsistent behavior with TORCHINDUCTOR_BENCHMARK_KERNEL enabled or disabled. We may see more kernels (some are essentially duplicated) in the compiled module if TORCHINDUCTOR_BENCHMARK_KERNEL is enabled. - this obscure some optimization opportunities. E.g., a kernel spend 6% time is worth looking at. But if the kernel is called 20 times and now it show up as 20 different kernels each spend 0.3% of time, it would be less obvious that we should optimize this kernel. In this PR, we just use canonical name like `arg_{i}` rather than the buffer name to avoid all the issues above. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97755 Approved by: https://github.com/jansel	2023-03-29 01:07:46 +00:00
Shen Li	5949d86bec	[Easy] Remove unnecessary graph lint (#97815 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97815 Approved by: https://github.com/fegin	2023-03-29 00:41:00 +00:00
Wanchao Liang	70b063db0e	[dtensor] remove typing hack of DeviceMesh (#94526 ) This removes the typing hack, part of https://github.com/pytorch/pytorch/pull/92931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94526 Approved by: https://github.com/XilunWu	2023-03-29 00:23:47 +00:00
Wanchao Liang	8a45befcec	[reland] add numpy typing plugin to mypy config (#94525 ) reland of https://github.com/pytorch/pytorch/pull/92930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94525 Approved by: https://github.com/huydhn	2023-03-29 00:23:47 +00:00
Edward Z. Yang	2490ac561f	Propagate inductor guards to ShapeEnv (#97777 ) Fixes https://github.com/pytorch/pytorch/issues/96971 Antoni is going to help us get a small test case to put in the test suite. The root cause of the problem was Alibi, which has arange with negative step. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97777 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-03-29 00:09:43 +00:00
Aaron Gokaslan	597b558c51	[BE]: Update flake8 and plugins and fix bugs (#97795 ) Update flake8 and flake8-plugins in lintrunner to a modern version. Enables more checks and makes flake8 checks significantly faster. Added a few additional rule ignores that will need to be fixed in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97795 Approved by: https://github.com/alexsio27444, https://github.com/janeyx99, https://github.com/ezyang	2023-03-28 23:51:55 +00:00
jjsjann123	7282be3d91	Patch for nvfuser build (#97404 ) 1. Packaging nvfuser header for support c++ build against nvfuser; 2. Moving `#include <torch/csrc/jit/codegen/fuser/interface.h>` from `torch/csrc/jit/runtime/register_ops_utils.h` to `torch/csrc/jit/runtime/register_prim_ops_fulljit.cpp` to avoid missing header, since pytorch doesn't package `interface.h`; 3. Patching DynamicLibrary load of nvfuser to leak the handle, this avoids double de-allocation of `libnvfuser_codegen.so`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97404 Approved by: https://github.com/davidberard98	2023-03-28 23:36:08 +00:00
Natalia Gimelshein	e0a647d8b5	new pin (#97278 ) contains Fix for https://github.com/openai/triton/issues/1372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97278 Approved by: https://github.com/Skylion007	2023-03-28 23:30:23 +00:00
Jason Ansel	bc86af0d37	Remove DeferredIndentedBuffer (#97616 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97616 Approved by: https://github.com/desertfire	2023-03-28 23:13:41 +00:00
Manuel Candales	c92dfe2694	[Vulkan] Add convert_qconv2d_context op (#97714 ) Summary: This diffs adds a convert_qconv2d_context op, which converts a cpu quantized Conv2dPackedParamsBase object (used by quantized::conv2d) into a vulkan Conv2dPackedContext object. This op is used in a later diff (D44189363), to do a graph rewrite of quantized conv2d and conv2d_relu ops Test Plan: On Mac ``` cd ~/fbsource buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` On Android ``` cd ~/fbsource buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test adb shell "/data/local/tmp/vulkan_quantized_api_test" ``` Reviewed By: SS-JIA Differential Revision: D41595032 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97714 Approved by: https://github.com/SS-JIA	2023-03-28 23:06:16 +00:00
Andrew Gu	662a8cf74d	[FSDP][8/N] Simplify addr padding internals (#97796 ) This is a follow-up to the last PR to greatly simplify the approach. This should be much cleaner. Details Let `N` denote the number of original parameters flattened into a given flat parameter with `M` extra padding tensors. - `_numels_with_padding`: length `N + M` - `_is_padding_mask`: length `N + M` - `_numels`, `_param_infos`, `_shapes`, `_fqns`, `_param_extensions`: length `N` `_shard_param_indices` and `_shard_param_offsets` were used to determine (1) if a given original parameter is in the local shard and if so, then (2) what is its offset in the _sharded_ flat parameter, and (3) how many numel are in the _sharded_ flat parameter. This PR reworks how to achieve (1), (2), and (3) to allow for simplifying the previously mentioned data structures. In particular, it saves one extra tuple `_shard_param_infos: Tuple[_ShardParamInfo, ...]` of length `N` where each `_ShardParamInfo` entry gives exactly the needed info. For example, the offset into the sharded flat parameter is now pre-computed, so we do not need to do `offset = 0; offset += numel_in_shard` over a `for` loop each time now. For optimizer state dict, `FSDPParamInfo.param_indices` now maps to the indexes with respect to the length `N` data structures, not the length `N + M` ones. The only purpose of `param_indices` is to be able to index into `flat_param._shard_param_infos[i]` to get the contained info to flatten the unsharded original parameter optimizer state and extract the part in the local shard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97796 Approved by: https://github.com/rohan-varma	2023-03-28 22:19:44 +00:00
PyTorch MergeBot	aee96e2cb3	Revert "[inductor] Refactor cpp_wrapper to be an attribute of GraphLowering (#97709 )" This reverts commit 8710dc8d5a09204391ebcaaed9839c1d885bdc44. Reverted https://github.com/pytorch/pytorch/pull/97709 on behalf of https://github.com/malfet due to Broke cpu_wrapper tests on MacOS, see https://github.com/pytorch/pytorch/actions/runs/4545603517/jobs/8014327136#step:13:868	2023-03-28 22:07:33 +00:00
David Berard	dc3d6fe6b0	[jit][easy] add missing quotes in namedtuple forwardref tests (#97736 ) Follow-up to #96933. This test was intended to have quotes around the type annotations for the attributes of the NamedTuple. This PR adds this missing quotes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97736 Approved by: https://github.com/eellison	2023-03-28 21:44:01 +00:00
Scott Wolchok	79d2a8dd9e	[PyTorch] Second try: use c10::FastMap for memoizing in Pickler (#96688 ) These maps don't rely on reference stability, so FastMap should be fine. First try (#96360) was reverted because it broke internal tests. Differential Revision: [D43995796](https://our.internmc.facebook.com/intern/diff/D43995796/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43995796/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/96688 Approved by: https://github.com/malfet	2023-03-28 21:23:13 +00:00
Zain Rizvi	d4829bd6c7	Use remote master as the linter merge base (#97800 ) Fixes https://github.com/pytorch/pytorch/issues/96794 Sometimes people never update their local `master` branch. Their workflow instead consists of fetching commits from git and directly creating branches off of the remote `master` branch (e.g. via `git checkout -b <mybranch> origin/master` For those people, their local `master` is very old and out of date, creating an unreasonably old lint base that tends to catch all sorts of unrelated linter errors. Anyone with an updated `master` branch will naturally have an updated pointer to the remote `master`, so this change makes lintrunner friendly to both behavior patterns Pull Request resolved: https://github.com/pytorch/pytorch/pull/97800 Approved by: https://github.com/huydhn	2023-03-28 21:23:05 +00:00
Shen Li	c39f1c1490	Allow DTensor to trigger collecives before inplace ops (#97787 ) Mainly two fixes: 1. `make_fx` seems trace through DeviceMesh operations. This commit removes that from the DTensor expanded graph 2. During DTensor expansion, autograd complains about inplace changes on leaf node. This commit wraps entire DTensor expansion code with `torch.no_grad()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97787 Approved by: https://github.com/wanchaol	2023-03-28 21:06:51 +00:00
PyTorch MergeBot	35a13a593e	Revert "Updates NCCL to 2.17.1 (#97407 )" This reverts commit b113a09ef90decbc703722bfdc2064fc5eb54a19. Reverted https://github.com/pytorch/pytorch/pull/97407 on behalf of https://github.com/clee2000 due to looks like it broke inductor distributed tests `b113a09ef9 (12344853677)`	2023-03-28 21:04:18 +00:00
Nikita Shulga	b443198966	Fix sparse addmv ref impl for non-contig tensors (#97730 ) Fix logic in `test_block_addmm` that tested op against itself rather than against dense implementation, by implementing `ref_addvm` function that converts tensor back to dense before multiplying it with vector. Fix reference implementation by passing stride for vector and result. (Not sure wether it will be more perf efficient to iterate over strided tensor or request a dense copy as MKL implementation does) Print more verbose error message if values differ. Fixes https://github.com/pytorch/pytorch/issues/97629 , https://github.com/pytorch/pytorch/issues/97589 , https://github.com/pytorch/pytorch/issues/97563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97730 Approved by: https://github.com/cpuhrsch	2023-03-28 20:46:32 +00:00
Kevin Tse	bb42104fe8	[DataLoader] Fix collation logic (#97789 ) Similar to #97737, a previous auto-refactor changed how `bytes` are handled during collation, which can potentially lead to performance regression. This PR undoes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97789 Approved by: https://github.com/albanD	2023-03-28 20:25:34 +00:00
albanD	ae3316c16e	Update CODEOWNERS for torch data (#97797 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97797 Approved by: https://github.com/NivekT	2023-03-28 20:15:16 +00:00
PyTorch MergeBot	fee1407c8d	[xla hash update] update the pinned xla hash (#91874 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91874 Approved by: https://github.com/pytorchbot, https://github.com/huydhn	2023-03-28 19:52:01 +00:00
Edward Z. Yang	fb7f983357	Graph break on operators that fake tensor doesn't support (#97708 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97708 Approved by: https://github.com/eellison	2023-03-28 19:49:54 +00:00
Pruthvi Madugundu	08f125bcac	[ROCm] Remove usage of deprecated ROCm component header includes (#97620 ) - clang parameter 'amdgpu-target' changed to 'offload-arch' - HIP and MIOpen includes path updated for extensions Pull Request resolved: https://github.com/pytorch/pytorch/pull/97620 Approved by: https://github.com/ezyang, https://github.com/jithunnair-amd	2023-03-28 19:28:38 +00:00
Li-Huai (Allan) Lin	4afef85dda	[MPS] Fix index_select_scalar test (#97773 ) #96408 introduced a check that prevents the index to scalar from being non-singleton. Fixes #94162 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97773 Approved by: https://github.com/kulinseth	2023-03-28 19:23:59 +00:00
Nikita Shulga	196acc84b1	[UX] Advise users to rebase-and-merge for stale PRs (#97808 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97808 Approved by: https://github.com/clee2000, https://github.com/kit1980	2023-03-28 19:03:16 +00:00
Syed Tousif Ahmed	b113a09ef9	Updates NCCL to 2.17.1 (#97407 ) This PR updates NCCL submodule to 2.17.1. Closes https://github.com/NVIDIA/nccl/issues/750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97407 Approved by: https://github.com/ngimel, https://github.com/ptrblck, https://github.com/malfet	2023-03-28 19:01:37 +00:00
Aleksei Nikiforov	8289120ef0	Revert "test/test_torch.py: fix TestTorch::test_from_buffer test (#96952 )" (#97759 ) Tests were already fixed in https://github.com/pytorch/pytorch/pull/92834, and these changes instead of also fixing tests are now breaking them again. This reverts commit 7f94ea84927844842a1d0892b7a5e6a41518430b. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97759 Approved by: https://github.com/janeyx99	2023-03-28 18:43:08 +00:00
PyTorch MergeBot	2ef6ffdfa1	Revert "[BE][autograd Function] Raise an error if input is returned as-is and saved for forward or backward in setup_context (#97212 )" This reverts commit f3aca45a163cf1aafd4f5fa65a0adce53b33abfa. Reverted https://github.com/pytorch/pytorch/pull/97212 on behalf of https://github.com/soulitzer due to TestAutogradFunctionCUDA.test_function_returns_input_inner_requires_grad_True_save_for_vjp_save_tensors_output_mark_dirty_True_cuda leaks	2023-03-28 18:30:51 +00:00
Elias Ellison	6854fd7189	Add Config to Skip Cpp Codegen, Enable in FBCode (#97204 ) Differential Revision: [D44353662](https://our.internmc.facebook.com/intern/diff/D44353662) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97204 Approved by: https://github.com/ngimel, https://github.com/bertmaher, https://github.com/mikekgfb, https://github.com/cpuhrsch	2023-03-28 18:21:15 +00:00
XiaobingSuper	c0e0fbb6e1	inductor: fix _dynamic_reshape_indexer issue when tail index is sym (#97502 ) For TIMM swin_base_patch4_window7_224 dynamic shape case, there has an error for ```view``` op: ``` File "/home/xiaobing/pytorch-offical/torch/_inductor/lowering.py", line 229, in wrapped out = decomp_fn(args, kwargs) File "/home/xiaobing/pytorch-offical/torch/_inductor/lowering.py", line 665, in view return TensorBox(View.create(x.data, sizes)) File "/home/xiaobing/pytorch-offical/torch/_inductor/ir.py", line 1325, in create reindex = cls.dynamic_reshape_indexer(old_size, new_size) File "/home/xiaobing/pytorch-offical/torch/_inductor/ir.py", line 1351, in dynamic_reshape_indexer reindex2 = cls._dynamic_reshape_indexer(flat, new_size) File "/home/xiaobing/pytorch-offical/torch/_inductor/ir.py", line 1406, in _dynamic_reshape_indexer assert size_new == 1 torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: LoweringException: AssertionError: target: aten.view.default args[0]: TensorBox(StorageBox( Pointwise( 'cpu', torch.float32, def inner_fn(index): i0, i1, i2, i3 = index tmp0 = ops.load(buf37, i3 + 49 i2 + 2401 * i1 + 9604 * i0) tmp1 = ops.load(arg35_1, i3 + 49 * i2) tmp2 = ops.load(arg1_1, i1 + 4 * (tmp1)) tmp3 = tmp0 + tmp2 return tmp3 , ranges=[64, 4, 49, 49], origins={add_12} ) )) args[1]: [64//s3, s3, 4, 49, 49] ``` the target shaps of ```view``` is ```[64//s3, s3, 4, 49, 49]```, and ```Sym(s3)``` is 64, see ``` sym_size_16: Sym(s3) = torch.ops.aten.sym_size(arg34_1, 0) floordiv_3: Sym(64//s3) = sym_size_13 // sym_size_16 view_33: f32[64//s3, 64//(64//s3), 4, 49, 49] = torch.ops.aten.view.default(add_12, [floordiv_3, sym_size_16, 4, sym_size_14, sym_size_14]); add_12 = floordiv_3 = sym_size_16 = None ``` For the tail index of the new size is ```Sym(64//s3)```, it is not a number, we shouldn't directly compare it with ```1```. Currently, I didn't find a simple test case to reproduce it, I just test it for the real model. ``` python -m torch.backends.xeon.run_cpu --core_list 0 --ncores_per_instance 1 benchmarks/dynamo/timm_models.py --performance --float32 -dcpu -n50 --inductor --only swin_base_patch4_window7_224 --batch_size 1 --threads 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97502 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-03-28 18:00:07 +00:00
Han Qi (qihqi)	b895a0a675	[BE] Move flatbuffer related python C bindings to script_init (#97476 ) Summary: Extra C binding module for flatbuffer was introduced because not all dependencies of Pytorch want (or can) bundle in flatbuffer. However, flatbuffer is in by default now so this separate binding is not longer needed. Test Plan: existing unit tests Differential Revision: D44352583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97476 Approved by: https://github.com/dbort	2023-03-28 17:56:32 +00:00
Eric Zhang	d8cc8ffebc	[DataLoader] Short circuit pin_memory recursion when operating on bytes (#97737 ) Slack thread: https://pytorch.slack.com/archives/GEEQ2K4MD/p1679962409906099 I was seeing some massive (~2x) slowdowns on a job after running it on PyTorch 2.0. From some profiling in `py-spy` it looked like the pin_memory thread was doing a lot more work than before. Looking at a trace in `nsys` I saw the thread doing the forward pass having a bunch of `pthread_cond_timedwait` with GIL reacquire calls in it’s call stack, and it seemed like the thread doing the forward pass was getting blocked (waiting for the GIL) by the pin memory thread (which was holding the GIL). After some debugging I found out the issue. If a `bytes` was passed into `pin_memory`, previously in 1.13 (before https://github.com/pytorch/pytorch/pull/94709) it would short-circuit and return here `d922c29a22/torch/utils/data/_utils/pin_memory.py (L54-L55)` since `bytes` was in `torch._six.string_classes`: ``` >>> from torch._six import string_classes >>> string_classes (<class 'str'>, <class 'bytes'>) >>> ``` However after https://github.com/pytorch/pytorch/pull/94709, if a `bytes` was passed into `pin_memory` it would fall into here instead `c263bd43e8/torch/utils/data/_utils/pin_memory.py (L68-L73)` because the previous check is now doing `isinstance(data, str)` instead of `isinstance(data, (str, bytes))`! `c263bd43e8/torch/utils/data/_utils/pin_memory.py (L56-L57)` As a result, `pin_memory` gets called recursively for each element in the `bytes` leading to a ton of wasted recursion. This also explains the slowdown / GIL contention I was seeing. This PR simply changes `isinstance(data, str)` to `isinstance(data, (str, bytes))` to match the behavior before https://github.com/pytorch/pytorch/pull/94709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97737 Approved by: https://github.com/albanD, https://github.com/NivekT	2023-03-28 17:39:23 +00:00
Mikayla Gawarecki	1a2dcff127	Added ModuleInfos for remaining activation functions (#97704 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97704 Approved by: https://github.com/albanD	2023-03-28 17:11:41 +00:00
Catherine Lee	dbd41cfa91	Add arm tests to mps workflow (#97279 ) Fixes #ISSUE_NUMBER https://github.com/pytorch/pytorch/pull/94417/files broke mac tests but was an mps pr, so add arm tests to mps <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 4b0f4ed</samp> > _`PyTorch` on `macOS`_ > _Testing new Python and ARM_ > _Autumn of Silicon_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/97279 Approved by: https://github.com/ZainRizvi	2023-03-28 16:55:53 +00:00
jcwchen	9eea9d21a4	Update ONNX submodule from ONNX 1.13.1 with Protobuf 4.21 updates (#96138 ) ~Test ONNX 1.13.1 package which was built with Protobuf v21. onnx-test-protobufv21 package was created by this PR: https://github.com/onnx/onnx/pull/4973, which was based with the rel-1.13.1 release branch (which is what PyTorch is using now).~ Update ONNX submodule from https://github.com/onnx/onnx/tree/1.13.1-protobuf4.21, which was created with rel-1.13.1 branch with updated for Protobuf 4.21. Please note that PyTorch should still be able to build ONNX from source by Protobuf 3 with the same source code. BTW, https://github.com/onnx/onnx/pull/4956/files. is the PR targeting ONNX's main branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96138 Approved by: https://github.com/kit1980, https://github.com/BowenBao	2023-03-28 16:55:10 +00:00
Catherine Lee	22e3f67cd2	Update vision pinned hash (#97706 ) after https://github.com/pytorch/vision/pull/7448, the test is now ok Pull Request resolved: https://github.com/pytorch/pytorch/pull/97706 Approved by: https://github.com/tugsbayasgalan	2023-03-28 16:54:01 +00:00
Bin Bao	8710dc8d5a	[inductor] Refactor cpp_wrapper to be an attribute of GraphLowering (#97709 ) Summary: to prepare for further AOT Inductor changes <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 7dff885</samp> This pull request adds support for AOT compilation and C++ wrapper code generation for inductor models. It modifies the `GraphLowering` class in `torch/_inductor/graph.py` and the `compile_fx` function in `torch/_inductor/compile_fx.py` to enable this feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97709 Approved by: https://github.com/jansel	2023-03-28 16:50:36 +00:00
kshitij12345	2b369eb3c2	[fix] jacrev and jacfwd : support non-tensor args again (#97746 ) Fixes https://github.com/pytorch/pytorch/issues/97636 The code to check if argument tensor are complex assumed that all arguments are tensor (which is not the case) which lead to the error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97746 Approved by: https://github.com/zou3519	2023-03-28 16:37:33 +00:00
Zachary DeVito	1c83888be8	[memory profiling] show pre-existing memory in trace_plot (#97590 ) Previously we only plotted memory if it was allocated or freed while trace recording was active. This change also adds any pre-existing blocks to the visualization. This helps because it is common to enable trace recording later and then not realize that there is a lot of allocated memory in the trace eventhough a lot was allocated beforehad. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97590 Approved by: https://github.com/eellison	2023-03-28 16:31:10 +00:00
Zachary DeVito	b1a83c4da4	[memory history] cleanup recording API (#97406 ) This makes the options for recording memory history easier to understand and makes the default to record the most information. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 4706acf</samp> This pull request enhances the memory profiling and debugging capabilities of PyTorch on CUDA devices. It introduces a new API for memory history recording in `torch/cuda/memory.py` and `test/test_cuda.py`, and adds new functions for memory snapshot management and visualization in `torch/cuda/memory.py`. Also adds a quick _dump_snapshot function to make it easier to look at the common visualizations. <!-- copilot:walkthrough --> ### <samp>🤖 Generated by Copilot at 4706acf</samp> * Modify the `_record_memory_history` function to use a new API that accepts a string argument for the `enabled` parameter and more parameters to control the stack trace collection and memory event history ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L620-R696)) * Add a new function `_dump_snapshot` that allows users to dump a memory snapshot to a directory with HTML plots of the memory segments and events ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377R703-R713)) * Update the test cases in `test/test_cuda.py` to use the new API for memory history recording and check the expected output of the memory plots ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4946-R4946), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4984-R4984), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5000-R5000), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5015-R5015), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5035-R5038), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R5045-R5046), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5060-R5059), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5068-R5065), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5088-R5085)) * Add missing imports and types to the `torch/cuda/memory.py` module ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L5-R15)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97406 Approved by: https://github.com/ezyang	2023-03-28 16:31:10 +00:00
Edward Z. Yang	9e029f44b5	[EASY] Fix test that does nothing (#97722 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97722 Approved by: https://github.com/jansel	2023-03-28 16:31:03 +00:00
Tommy Chiang	0176fb4cd6	Remove fast_nvcc entry in README.md (#97624 ) After https://github.com/pytorch/pytorch/pull/96665 landed, fast_nvcc tool is no longer available. This commit removes the documentation for it so as not to confuse users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97624 Approved by: https://github.com/drisspg	2023-03-28 16:23:09 +00:00
Nikita Shulga	26c5e34b47	Re-enable ProcessGroupMPITest in CI (#97687 ) As https://github.com/pytorch/pytorch/issues/60756 been fixed a while back Pull Request resolved: https://github.com/pytorch/pytorch/pull/97687 Approved by: https://github.com/clee2000, https://github.com/kit1980	2023-03-28 16:04:16 +00:00
Vivek Khandelwal	428540001d	Add shape function for squeeze.dims op (#93919 ) Changes to `_native_batch_norm_legit` and `upsample_nearest2d` in `serialized_shape_function_registry.cpp` are made just because this file is auto-generated, and the file was not auto-generated after the changes in `_shape_functions.py` for those two ops. Signed-Off By: Vivek Khandelwal <vivek@nod-labs.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/93919 Approved by: https://github.com/davidberard98	2023-03-28 14:55:00 +00:00
Edward Z. Yang	b2f1edabfe	Renaming all_known_overloads to all_py_loaded_overloads and add comment (#97672 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97672 Approved by: https://github.com/Skylion007	2023-03-28 14:10:38 +00:00
Jason Ansel	bb85b43c0b	Move test_cpp_wrapper to its own file (#97634 ) This test takes 5+ minutes to finish, this breaks it into smaller pieces. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97634 Approved by: https://github.com/eellison	2023-03-28 14:08:51 +00:00
Edward Z. Yang	c785f1903a	Add dynamic shapes to perf dashboard (#97673 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97673 Approved by: https://github.com/desertfire	2023-03-28 13:18:22 +00:00
Xia, Weiwen	08766b23de	[Quant][FX] lower ConvTranspose3d (#97125 ) Summary Enable quantization and lowering of `ConvTranspose3d`. Add test cases for `ConvTranspose1d`, `ConvTranspose2d` and `ConvTranspose3d` since there were no such test cases. Test plan python test/test_quantization.py -k test_conv_transpose_not_reference python test/test_quantization.py -k test_conv_transpose_reference Pull Request resolved: https://github.com/pytorch/pytorch/pull/97125 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-03-28 11:58:29 +00:00
Nikita Karetnikov	a8065cc61f	[dynamo] simplify `get_item_dyn` (#97637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97637 Approved by: https://github.com/ezyang	2023-03-28 08:34:50 +00:00
Ramil Nugmanov	867b07b424	Sampler API described for customization. (#97338 ) Explanation with examples of sampler customization added. * fixed TypeVar * removed unused init from Sampler class * added examples for custom sampler and batch sampler * Distributed sampler typing fixed. * _InfiniteConstantSampler fixed Fixes #92268 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97338 Approved by: https://github.com/NivekT	2023-03-28 06:40:38 +00:00
Kimish Patel	100b396b9b	[Pytorch][coreml]Pass backend and modelid by value (#97566 ) Summary: Due to async dispatch passing by reference may cause crash. Test Plan: CI Reviewed By: mcr229 Differential Revision: D44386623 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97566 Approved by: https://github.com/mcr229	2023-03-28 06:34:55 +00:00
Justin Chu	5aa4046743	[ONNX] Remove the `_jit_pass_onnx_scalar_type_analysis` pass (#97729 ) It doesn't do anything because it doesn't analyze function calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97729 Approved by: https://github.com/titaiwangms	2023-03-28 05:17:36 +00:00
Yuxin Wu	8624a2e88a	Include missing header (#97453 ) `std::exception_ptr` is defined in `<exception>`. This works in the past because the header is transitively included by other headers. The situation has changed in most recent llvm (`c9d36bd807`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97453 Approved by: https://github.com/ngimel	2023-03-28 05:12:47 +00:00
maksimovich.sam	6df18260fc	Fix typo in error message (#97716 ) The check makes sure qkv_bias is 1 dimensional, but the error message expects it to be 2-D. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97716 Approved by: https://github.com/drisspg	2023-03-28 04:36:43 +00:00
Will Constable	c1a6dde79e	Make dynamo-FSDP skip guards (#97463 ) Create a new GuardSource for FSDP modules, and use it to opt out of guard installation. Based on @awgu's work in https://github.com/pytorch/pytorch/pull/97091 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97463 Approved by: https://github.com/voznesenskym, https://github.com/jansel, https://github.com/awgu	2023-03-28 04:04:34 +00:00
mikey dagitses	e9050ef74e	explicitly list out caffe2 protos (#97612 ) This will make it easier to split up this library. Differential Revision: [D44400049](https://our.internmc.facebook.com/intern/diff/D44400049/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97612 Approved by: https://github.com/PaliC	2023-03-28 03:54:02 +00:00
kshitij12345	1726c6f7a7	[fix] vmap: fix segfault on data access (#97237 ) Fixes #97161 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97237 Approved by: https://github.com/zou3519	2023-03-28 03:35:44 +00:00
mikey dagitses	403905a37b	cleanup caffe2/proto package (#97601 ) Name according to Bazel recommendations, adjust order from public to private, and restrict visibility. Differential Revision: [D44396068](https://our.internmc.facebook.com/intern/diff/D44396068/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97601 Approved by: https://github.com/PaliC	2023-03-28 03:34:58 +00:00
mpearce25	5e6e984835	flake8 version reporting in collect_env (#94573 ) Fixes #94571 # Testing `[pip3] flake8==3.9.2` now appears under `Versions of relevant libraries:` when running: `python torch/utils/collect_env.py` ### Output with this change ``` Collecting environment information... PyTorch version: N/A Is debug build: N/A CUDA used to build PyTorch: N/A ROCM used to build PyTorch: N/A OS: macOS 13.1 (x86_64) GCC version: Could not collect Clang version: 14.0.0 (clang-1400.0.29.202) CMake version: Could not collect Libc version: N/A Python version: 3.9.12 (main, Apr 5 2022, 01:53:17) [Clang 12.0.0 ] (64-bit runtime) Python platform: macOS-10.16-x86_64-i386-64bit Is CUDA available: N/A CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: N/A GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: N/A CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz Versions of relevant libraries: [pip3] flake8==3.9.2 [pip3] mypy==0.971 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.21.5 [pip3] numpydoc==1.2 [conda] blas 1.0 mkl [conda] mkl 2021.4.0 hecd8cb5_637 [conda] mkl-service 2.4.0 py39h9ed2024_0 [conda] mkl_fft 1.3.1 py39h4ab4a9b_0 [conda] mkl_random 1.2.2 py39hb2f4e1b_0 [conda] numpy 1.21.5 py39h2e5f0a9_1 [conda] numpy-base 1.21.5 py39h3b1a694_1 [conda] numpydoc 1.2 pyhd3eb1b0_0 ``` ### Output before ``` Collecting environment information... PyTorch version: N/A Is debug build: N/A CUDA used to build PyTorch: N/A ROCM used to build PyTorch: N/A OS: macOS 13.1 (x86_64) GCC version: Could not collect Clang version: 14.0.0 (clang-1400.0.29.202) CMake version: Could not collect Libc version: N/A Python version: 3.9.12 (main, Apr 5 2022, 01:53:17) [Clang 12.0.0 ] (64-bit runtime) Python platform: macOS-10.16-x86_64-i386-64bit Is CUDA available: N/A CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: N/A GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: N/A CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz Versions of relevant libraries: [pip3] mypy==0.971 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.21.5 [pip3] numpydoc==1.2 [conda] blas 1.0 mkl [conda] mkl 2021.4.0 hecd8cb5_637 [conda] mkl-service 2.4.0 py39h9ed2024_0 [conda] mkl_fft 1.3.1 py39h4ab4a9b_0 [conda] mkl_random 1.2.2 py39hb2f4e1b_0 [conda] numpy 1.21.5 py39h2e5f0a9_1 [conda] numpy-base 1.21.5 py39h3b1a694_1 [conda] numpydoc 1.2 pyhd3eb1b0_0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94573 Approved by: https://github.com/malfet, https://github.com/kit1980	2023-03-28 03:24:41 +00:00
soulitzer	f3aca45a16	[BE][autograd Function] Raise an error if input is returned as-is and saved for forward or backward in setup_context (#97212 ) Fixes https://github.com/pytorch/pytorch/issues/96887 We error out in BOTH the case when graph is created and when it is not created. Still bc-breaking, but not as severe because we are limiting to the case where someone uses setup_context. This makes setup_context and non-setup_context versions diverge in their behavior - With the non-setup_context version, saved variables are assumed to have the grad_fn of the inputs. - But now with the setup_context version, we produce an error for this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97212 Approved by: https://github.com/zou3519	2023-03-28 03:14:32 +00:00
mikey dagitses	c7fa648ea1	move caffe2/proto/ to its own Bazel package (#97600 ) This is just to break up build files and make the system easier to reason about during the transition to the common build system. Differential Revision: [D44395826](https://our.internmc.facebook.com/intern/diff/D44395826/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97600 Approved by: https://github.com/PaliC	2023-03-28 03:14:26 +00:00
Shunting Zhang	e1f44ee3b3	[inductor] correctly setup constant in the wrapper (#97571 ) V.graph.constants like seed_cuda_0 is not handled properly in the wrapper. Recently we move the code that initializes constants from global scope to a function. That makes assigning to seed_cuda_0 creating a new local variable rather than setup the global variable. Add 'global var_name' lines to maintain the same behavior as before. Test: Run the forward graph for nvidia_deeprecommender's training run. Previous fail and now pass with the fix. Thanks @ngimel for report the issue with repro and @Chillee for pointing out the root cause. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97571 Approved by: https://github.com/ngimel	2023-03-28 03:10:53 +00:00
Nikita Shulga	b756fd98bb	Fix NumPy scalar arrays to tensor conversion (#97696 ) By performing cast from scalar to 0-dim array only if object is not an array already. Fixes https://github.com/pytorch/pytorch/issues/97021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97696 Approved by: https://github.com/albanD	2023-03-28 03:00:18 +00:00
Jason Ansel	b2be14bcca	Fix missing extra_traceback in InterpreterShim (#97615 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97615 Approved by: https://github.com/Chillee, https://github.com/desertfire	2023-03-28 02:40:35 +00:00
Wanchao Liang	08c1d1a871	[dtensor] set cuda device automatically, and refactor error handling (#97583 ) This PR would detect if device_type is cuda, if cuda passed in, we would set the current cuda device each process/thread automatically (This assumption is based on homogenous devices). Also refactored error handling code Pull Request resolved: https://github.com/pytorch/pytorch/pull/97583 Approved by: https://github.com/wz337, https://github.com/XilunWu	2023-03-28 02:25:45 +00:00
Wanchao Liang	e9c4904915	[dtensor] remove custom dispatch op (#95629 ) Since we removed all custom dispatch ops, we can safely delete this table as we won't use it for other purposes Pull Request resolved: https://github.com/pytorch/pytorch/pull/95629 Approved by: https://github.com/XilunWu	2023-03-28 02:25:45 +00:00
Rodrigo Kumpera	342ed0372f	[DCP] Expose create_read_items_for_chunk_list helper. (#97570 ) This function is needed by all ReadPlanner subclasses that are trying to implement support for a custom distributed tensor. Better expose it than have users reimplement this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97570 Approved by: https://github.com/wz337	2023-03-28 02:25:04 +00:00
Andrew Gu	1c15cd48e2	[FSDP][7/N] Add alignment padding for `use_orig_params=True` (#97667 ) This PR adds intra-`FlatParameter` 16-byte alignment padding to the `use_orig_params=True` code path to avoid clones in TorchInductor. Approach The `FlatParameter` maintains several data structures about its original parameters. Notably, the data structures `_param_infos`, `_shapes`, `_numels`, and `_fqns` have the same length and index in the same way. This PR treats alignment padding _like_ an original parameter in that the padding gets flattened into the `FlatParameter`. Therefore, it must be reflected in the aforementioned data structures. However, given the way in which the data structures are used, we choose to do the following if the `i`th tensor flattened into the `FlatParameter` is padding: - `_numels[i]` is the numel of padding - `_param_infos[i] == _shapes[i] == _fqns[i] == None` This choice is because (1) we must record the padding numel to account for it (e.g. for views) and (2) we prefer to preserve the invariant that the data structures index in the same way over avoiding `None` entries. To ease the burden of other FSDP developers, we separate the parameter flattening logic: - `_init_flat_param_and_metadata()`: This should be called only once in the `FlatParamHandle` constructor. The `FlatParameter` metadata is assumed to be static thereafter. - `flatten_tensors()` / `flatten_tensors_into_flat_param()`: These can be used for optimizer and model state dict and can be called after construction time. This separation allows `_init_flat_param_and_metadata()` to contain the much heavier metadata logic, while keeping the latter methods to be much lighter. The only constraint is that the alignment padding logic must be kept consistent between the two, but this should be worth the simper interface. Testing - This PR directly modifies the `use_orig_params=True` code path, so all existing tests passing gives good signal. - Some existing unit tests had to be adjusted to account for the alignment padding. - This PR adds two tests in `test_fsdp_flatten_params.py` to explicitly test the sharding metadata with alignment for both parameter full precision and mixed precision since the latter requires possibly more padding elements due to the decreased per-element size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97667 Approved by: https://github.com/rohan-varma	2023-03-28 01:46:43 +00:00
Andrew Gu	b9049a7f11	[FSDP][6/N] Rename param/module name helpers for clarity (#97666 ) This is an easy PR. It has some remaining local changes that I had that I felt clarified naming. - `_param_fqns` -> `_param_name_infos` since it returns a tuple of `fqn, param_name, module_name`, not only `fqn`. (similarly for `_shared_param_fqns` -> `_shared_param_name_infos`) - nit: `parameter_module_names` -> `param_module_names` for consistency since we almost never fully spell out `parameter`. (similarly for `shared_parameter_module_names` -> `shared_param_module_names`) - nit: `full_fqn` -> `fqn_from_global_root` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97666 Approved by: https://github.com/rohan-varma	2023-03-28 01:46:43 +00:00
Andrew Gu	30a6ed34a0	[FSDP][5/N] Lift `FSDPParamInfo` to use `FlatParamHandle` (#97665 ) This PR changes `FSDPParamInfo` in `_optim_utils.py` to save the `FlatParamHandle`, not directly the `FlatParameter`. This is in preparation for subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97665 Approved by: https://github.com/rohan-varma	2023-03-28 01:46:43 +00:00
Andrew Gu	5d554ca26f	[FSDP][4/N] Document `use_orig_params: bool` (#97664 ) This adds long-awaited documentation for `use_orig_params: bool`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97664 Approved by: https://github.com/rohan-varma	2023-03-28 01:46:43 +00:00
Andrew Gu	c622559968	[FSDP][3/N] Minor fixes (rename, assert message) (#97663 ) This is an easy PR. - It renames `_shard_indices` to `_shard_param_indices` for consistency. - It fixes an old mention of `comm_module` in an assert message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97663 Approved by: https://github.com/rohan-varma	2023-03-28 01:46:43 +00:00
Andrew Gu	a27882ecd1	[FSDP][2/N] Rename "flattened parameter" -> "flat parameter" (pt. 2) (#97662 ) From our recent experience, we refer to FSDP's `FlatParameter` as "flat parameter", not "flattened parameter". This PR renames that in `_optim_utils.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97662 Approved by: https://github.com/rohan-varma	2023-03-28 01:46:43 +00:00
Andrew Gu	bd979737cd	[FSDP][1/N] Rename "flattened parameter" to "flat parameter" (#97661 ) From our recent experience, we refer to FSDP's `FlatParameter` as "flat parameter", not "flattened parameter". This PR renames that in `flat_param.py`. This PR only changes documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97661 Approved by: https://github.com/rohan-varma	2023-03-28 01:46:43 +00:00
Manuel Candales	2bca64ae28	[Vulkan] Merge upsample_nearest2d and quantized_upsample_nearest2d (#97467 ) Summary: Merging quantized_upsample_nearest2d into upsample_nearest2d. Therefore, at::upsample_nearest2d can handle quantized vulkan input tensors. Test Plan: On Mac ``` cd ~/fbsource buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` On Android ``` cd ~/fbsource buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test adb shell "/data/local/tmp/vulkan_quantized_api_test" ``` Reviewed By: SS-JIA Differential Revision: D44118212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97467 Approved by: https://github.com/SS-JIA	2023-03-28 01:13:18 +00:00
Bin Bao	a9a81ab7e3	[CI] Run benchmark test with dynamo_eager in periodic (#97543 ) Summary: The idea is to catch any dynamo_eager regression earlier, and also we can take that off the dashboard run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97543 Approved by: https://github.com/huydhn	2023-03-28 01:02:49 +00:00
mikey dagitses	82592f7e53	remove dead torch_pb.h library (#97599 ) This is only used in one place, ensure it still builds. Differential Revision: [D44395699](https://our.internmc.facebook.com/intern/diff/D44395699/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44395699/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/97599 Approved by: https://github.com/PaliC	2023-03-28 00:55:17 +00:00
Mikayla Gawarecki	a283c15e34	Added ModuleInfos for {*}LU modules (#97375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97375 Approved by: https://github.com/albanD, https://github.com/jbschlosser	2023-03-28 00:36:31 +00:00
Shunting Zhang	1c3ec7c4c5	[eazy][inductor] fix typo in mm max-autotune log (#97486 ) max autotune log like ``` AUTOTUNE bias_addmm(512x197951, 512x512, 512x197951) triton_mm_61 1.2882s 100.0% triton_mm_62 1.3036s 98.8% bias_addmm 1.4889s 86.5% triton_mm_60 1.6159s 79.7% triton_mm_63 1.7060s 75.5% triton_mm_64 1.7777s 72.5% triton_mm_67 1.9722s 65.3% addmm 2.0603s 62.5% triton_mm_70 2.0675s 62.3% triton_mm_68 2.3552s 54.7% SingleProcess AUTOTUNE takes 2.949904441833496 seconds ``` is confusion since the sum of runtime of all the kernels is larger than the total time used for tuning. In fact, `triton.testing.do_bench` return milliseconds scale time rather than seconds scale. Fix the typo in the log message to make that clear. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97486 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-03-28 00:02:27 +00:00
Edward Z. Yang	32fdd44577	SymIntify maybe_multiply (#97675 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97675 Approved by: https://github.com/albanD	2023-03-27 23:20:23 +00:00
Brian Hirsh	35c9ea89fa	dont bake in defaults when tracing *_like factories (#97564 ) quick fix for https://github.com/pytorch/pytorch/issues/97541. letting CI run to see if there's any fallout Pull Request resolved: https://github.com/pytorch/pytorch/pull/97564 Approved by: https://github.com/ezyang	2023-03-27 22:53:44 +00:00
Brian Hirsh	2ca911f2ac	make_fx, make pre_autograd a kwarg (#97559 ) Some of the other inputs to make_fx() should probably be kwargs too - I didn't want to risk dealing with internal failures in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97559 Approved by: https://github.com/wconstab, https://github.com/SherlockNoMad, https://github.com/ezyang	2023-03-27 22:53:44 +00:00
Catherine Lee	6c450c7880	Allow -ic when no pending jobs (#97707 ) fixes https://github.com/pytorch/test-infra/issues/3933 https://github.com/pytorch/test-infra/pull/3937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97707 Approved by: https://github.com/kit1980	2023-03-27 22:14:44 +00:00
Shunting Zhang	652592efa9	[inductor] use torch.prifiler in the triton wrapper (#97405 ) I think it's helpful to use torch.profiler to profile the triton wrapper. E.g., I tried it for nvidia_deeprecommender's infernece graph. Even with max-autotune, we see the majority of the time the GPU is running 2 mm/addmm op. That's why max autotune does not help for this model since tuning does not affect the external mm ops. <img width="711" alt="Screenshot 2023-03-22 at 5 49 28 PM" src="https://user-images.githubusercontent.com/52589240/227072474-2f0d7205-4a10-4929-b1b7-551214788c61.png"> next step I'll check why the triton mm kernels are not picked. EDIT: the above screenshot is captured without max-autotune due to a typo. below is the trace with max-autotune enabled: <img width="712" alt="Screenshot 2023-03-22 at 6 43 26 PM" src="https://user-images.githubusercontent.com/52589240/227077624-fdccf928-be08-4211-871b-a9e3d7b76fbe.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97405 Approved by: https://github.com/ngimel	2023-03-27 21:54:25 +00:00
Huy Do	6c43e9fdbd	Run _calculate-docker-image on 2xlarge with a larger disk space (#97551 ) Not quite sure why I chose `linux.large` here, probably a copy paste mistake. `linux.large` is a small runner with too small disk space (15GB), so there is a chance that it could run out of space as in https://github.com/pytorch/pytorch/actions/runs/4513709983/jobs/7948892825 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97551 Approved by: https://github.com/clee2000	2023-03-27 21:36:40 +00:00
Catherine Lee	7d94493392	[easy] Update xla hash pin merge rule (#97700 ) Fixes #ISSUE_NUMBER should really make this some sort of regex Pull Request resolved: https://github.com/pytorch/pytorch/pull/97700 Approved by: https://github.com/huydhn	2023-03-27 21:32:52 +00:00
mikey dagitses	5d33596c5f	remove dead proto_convert library (#97598 ) This has no code, only a collection of headers. Just make sure the only thing that includes it still builds. Differential Revision: [D44395700](https://our.internmc.facebook.com/intern/diff/D44395700/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44395700/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/97598 Approved by: https://github.com/PaliC	2023-03-27 21:19:29 +00:00
Kazuaki Ishizaki	35fd5c548e	Fix typos under torch/distributed directory (#95638 ) This PR fixes typos in comments and messages of `.py` files under torch/distributed directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638 Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980	2023-03-27 21:13:44 +00:00
Edward Z. Yang	8313becefa	With Chillee's permission, add me to all Chillee's diffs (#97632 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97632 Approved by: https://github.com/kit1980	2023-03-27 21:10:49 +00:00
Edward Z. Yang	6430dad700	Apparently aot_function doesn't cache anymore (#97610 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97610 Approved by: https://github.com/albanD	2023-03-27 21:07:20 +00:00
Sergii Dymchenko	b24052b1d9	Make test_binary_shape_functions actually test the ops (#90566 ) Because of the break, only operator.__mul__ was actually tested. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 0e6aaa1</samp> > _`break` statement gone_ > _loop tests all shape functions_ > _symbolic winter_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/90566 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-03-27 21:03:59 +00:00
Manuel Candales	b66a121c5e	[Vulkan] Fix broadcasting in quantized elementwise ops (#97554 ) Summary: Fixes broadcasting along the channel and batch dimensions in quantized add, sub, mul and div Test Plan: ``` buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` Reviewed By: SS-JIA Differential Revision: D44359706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97554 Approved by: https://github.com/SS-JIA	2023-03-27 20:17:18 +00:00
Vivek Khandelwal	5da86bbb68	Add decomposition for aten.squeeze.dims op (#97020 ) Signed-Off By: Vivek Khandelwal <vivek@nod-labs.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97020 Approved by: https://github.com/jansel	2023-03-27 20:13:19 +00:00
Peter Bell	91cce4c09a	Sort: Use cub::WarpMergeSort for small sorts (32 < n <= 128) (#96223 ) We currently use `bitonicSortKVInplace` for sorts of size `n <= 32` but use `radixSortKVInplace` for `32 < n <= 4096`. Bitonic sort is also unstable, which forces stable sorts fall back to which is up to 4x slower in this small regime. This PR adds a new kernel `warpMergeSortKVInplace` using `cub::WarpMergeSort` to implement sorts with `32 < n <= 128` and all stable sorts with `n < 128`. This results in up to a 2x speedup for unstable sorts and up to 15x for stable sorts, depending on the input geometry. This also doesn't increase the total number of kernels since we are replacing radix-sorts of size 32 and 128. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96223 Approved by: https://github.com/ngimel	2023-03-27 19:48:45 +00:00
Mikayla Gawarecki	236bac811a	Add ModuleInfos for Adaptive{Max/Avg}Pool ops (#97291 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97291 Approved by: https://github.com/albanD	2023-03-27 19:45:37 +00:00
Rodrigo Kumpera	8177081848	Add gather to MTPG (#97555 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97555 Approved by: https://github.com/H-Huang	2023-03-27 19:37:02 +00:00
Zachary DeVito	759e527ea1	Use internal symbolizer for FBCODE (#97172 ) Summary: addr2line does not work fast on fbcode binaries, so use the internally symbolize pathway. Test Plan: sandcastle Differential Revision: D44227690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97172 Approved by: https://github.com/eellison	2023-03-27 19:24:12 +00:00
mikey dagitses	008be795ce	run buildifier on the root BUILD.bazel file (#97611 ) Just a no-op cleanup. Differential Revision: [D44400008](https://our.internmc.facebook.com/intern/diff/D44400008/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97611 Approved by: https://github.com/PaliC	2023-03-27 19:03:07 +00:00
cybershiptrooper	bbc7c79b20	add device checks for sparse csr (#97520 ) Fixes #95373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97520 Approved by: https://github.com/cpuhrsch	2023-03-27 18:57:27 +00:00
Nikita Shulga	96e3b3ac72	[BE] Cleanup CMake flag suppressions (#97584 ) Use `append_cxx_flag_if_supported` to determine whether or not `-Werror` is supported Do not suppress deprecation warnings if glog is not used/installed, as the way check is written right now, it will suppress deprecations even if `glog` is not installed. Similarly, do not suppress deprecations on MacOS simply because we are compiling with protobuf. Fix deprecation warnings in: - MPS by replacing `MTLResourceOptionCPUCacheModeDefault`->`MTLResourceCPUCacheModeDefaultCache` - In GTests by replacing `TYPED_TEST_CASE`->`TYPED_TEST_SUITE` - In `codegen/onednn/interface.cpp`, by using passing `Stack` by reference rathern than pointer. Do not guard calls to `append_cxx_flag_if_supported` with `if(CLANG)` or `if(GCC)`. Fix some deprecated calls in `Metal` hide more complex exception under `C10_CLANG_DIAGNOSTIC_IGNORE` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97584 Approved by: https://github.com/kit1980	2023-03-27 18:46:09 +00:00
Huy Do	345714e372	Upload merge records to Rockset (#97471 ) This upload a record to a new Rockset `merges` collection in `commons` workspace in the following format: ``` { "id": comment_id, "pr_num": pr_num, "owner": owner, "project": project, "pending_checks": pending_checks, # At the time of the merge "failed_checks": failed_checks, # At the time of the merge "is_failed": is_failed, # This is set to True if the merge fails to get through for whatever reason "dry_run": dry_run, "skip_mandatory_checks": skip_mandatory_checks, "ignore_current": ignore_current, "error": error, # The same Exception message that will be shown on PR } ``` To achieve this, I need to tweak `find_matching_merge_rule` a bit to return the list of pending and failed checks in addition to the matching merge rule. As this function is also used internally, I have confirmed that the internal call doesn't need the return values. Thus, the change is safe to land. ### Testing * Unit testing * Dry-run locally `python3 .github/scripts/trymerge.py --comment-id 1478678477 --dry-run 97293` using an older PR. The merge obviously failed, but the record was created successfully on Rockset ``` { "_id": "52d3152b-ec35-4b5a-91fc-0e7298fc54b5-1", "_event_time": "2023-03-23T21:10:32.754368Z", "_meta": null, "owner": "pytorch", "is_failed": true, "id": 1478678477, "failed_checks": [], "dry_run": true, "error": "Command `git -C pytorch cherry-pick -x cc0d2e0fba648bb5deda34a9056f2c4192b22314` returned non-zero exit code 1...", "ignore_current": false, "project": "pytorch", "pr_num": 97293, "skip_mandatory_checks": false, "pending_checks": [] } ``` * Dry-run locally with this PR `python3 .github/scripts/trymerge.py --comment-id 1481949104 --dry-run --force 97471` with `--force` ``` { "_id": "dd7d2580-f6e5-47e7-9441-17df86056c14-1", "_event_time": "2023-03-23T21:43:53.915911Z", "_meta": null, "owner": "pytorch", "is_failed": true, "id": 1481949104, "failed_checks": [], "dry_run": true, "error": "PR #97471 has not been reviewed yet", "ignore_current": false, "project": "pytorch", "pr_num": 97471, "skip_mandatory_checks": true, "pending_checks": [] } ``` * Dry-run locally with this PR `python3 .github/scripts/trymerge.py --comment-id 1481949104 --dry-run 97471` again with approval rule commented out ``` { "_id": "5d7de4e3-1af1-4869-a3b7-d1a9dbced6ce-1", "_event_time": "2023-03-24T00:10:41.914111Z", "_meta": null, "is_failed": false, "id": 1481949104, "failed_checks": [], "error": "", "last_commit_sha": "4657400513f0360a0a4f73d46e1aff0882221687", "merge_commit_sha": "416bac5b813a181753afade781ae30f4f0843586", "ignore_current": false, "pending_checks": [ [ "pull / linux-focal-py3.8-gcc7 / test (default, 1, 3, linux.2xlarge)", "https://github.com/pytorch/pytorch/actions/runs/4506464828/jobs/7933518379", 12239935788 ], ... [ "trunk / linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 5, 5, linux.4xlarge.nvidia.gpu)", "https://github.com/pytorch/pytorch/actions/runs/4506465633/jobs/7933621958", 12240067113 ], ... ], "owner": "pytorch", "skip_mandatory_checks": true, "author": "Huy Do <huydhn@gmail.com>", "project": "pytorch", "merge_base_sha": "a3b30c5025e3381022fa00b127b0d881f4ef66d4", "pr_num": 97471 } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97471 Approved by: https://github.com/clee2000	2023-03-27 18:42:00 +00:00
shibo	2ea097071a	fix device type bug for custom device (#97213 ) Fixes #ISSUE_NUMBER support the custom renamed device ,@bdhirsh , please review my changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97213 Approved by: https://github.com/bdhirsh, https://github.com/kit1980	2023-03-27 18:36:47 +00:00
Aaron Gokaslan	fcc312e945	[BE] Update flake8-comprehensions to 3.11.1 (#97671 ) Updates flake8-comprehensions in lintrunner so we can enforce new checks that have been implemented since the last update (including one implemented by me). I also added C417 to the flake8 ignore codes for now since we do not yet conform to that check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97671 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-03-27 18:07:37 +00:00
feifan	4d2611375b	Fix typo in throughput_benchmark. (#97619 ) Fixes #ISSUE_NUMBER Fix typo in throughput_benchmark. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97619 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-03-27 18:07:30 +00:00
Bin Bao	97711ac6db	[CI] Reduce perf nightly run frequency and bump up its timeout limit (#97682 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97682 Approved by: https://github.com/weiwangmeta	2023-03-27 17:22:02 +00:00
Huy Do	6db196b744	Specify the head branch when upload perf stats to Rockset (#97643 ) Before this, my assumption was that the workflow was only run on the main branch. This is not correct anymore as it could also now be run as part of the PR, i.e. https://hud.pytorch.org/pr/91316. So this change does two things: * Always upload inductor-A100-perf-nightly artifacts to S3 once completed by removing the main branch gating. * Add `head_branch` to Rockset records, so that the [dashboard](https://torchci-git-fork-huydhn-add-compilers-bench-74abf8-fbopensource.vercel.app/benchmark/compilers) knows if the records come from the daily schedule on the main branch or from experimental PR. The `head_branch` would be set to `master` in the former. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97643 Approved by: https://github.com/desertfire	2023-03-27 17:17:52 +00:00
Christian Puhrsch	9d37cefcb0	Resubmit _int_mm (#96685 ) Avoids any changes to gemm_and_bias Pull Request resolved: https://github.com/pytorch/pytorch/pull/96685 Approved by: https://github.com/drisspg, https://github.com/ngimel	2023-03-27 16:14:07 +00:00
Angela Yi	5f88d86142	Remove hacky python dispatcher fallthrough (#96635 ) Ed's previous PRs in stack https://github.com/pytorch/pytorch/pull/96306 fixes #89037, but this PR just removes the original hacky fallthrough. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96635 Approved by: https://github.com/zhxchen17	2023-03-27 16:09:45 +00:00
nima10khodaveisi	a6bc1f3a9f	Dynamo size dim kwargs (#97450 ) Fix https://github.com/pytorch/pytorch/pull/97098#discussion_r1145157874 @ngimel @voznesenskym Pull Request resolved: https://github.com/pytorch/pytorch/pull/97450 Approved by: https://github.com/ngimel	2023-03-27 15:36:46 +00:00
Stas Bekman	8275e5d2a8	[cpp_extension.py] fix bogus `_check_cuda_version` (#97602 ) Currently if `setuptools<49.4.0` and there is a minor version mismatch `_check_cuda_version` fails with a misleading non-actionable error: ``` 2023-03-24T20:21:35.0625644Z RuntimeError: 2023-03-24T20:21:35.0628441Z The detected CUDA version (11.2) mismatches the version that was used to compile 2023-03-24T20:21:35.0630681Z PyTorch (11.3). Please make sure to use the same CUDA versions. ``` This condition shouldn't be failing since minor version match isn't required. It fails because the other condition to have a certain version of `setuptools` isn't met. But that condition is written in a comment (!!!). So this PR changes it to actually tell the user how to fix the problem. While at it, I adjusted the version number as a lower `setuptools>=49.4.0` is sufficient for this to work. Thanks. p.s. this problem manifests on `nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04` docker image. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97602 Approved by: https://github.com/ezyang	2023-03-27 15:15:57 +00:00
haozhe.zhu	a1ada050f8	do not insert to_dtype for memory copy only buffers (#97147 ) Remove redundant to_dtype like `load_bf16 + to_fp32 + to_bf16 + store_bf16` => `load_bf16 + store_bf16` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97147 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-03-27 14:55:41 +00:00
Thomas Li	e1f153f3b1	Add support for copysign operator in functorch (#96018 ) Fixes #91176 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96018 Approved by: https://github.com/zou3519	2023-03-27 14:20:57 +00:00
soulitzer	d0abc31428	Remove unnecessary retain_grad call from gradcheck (#96923 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96923 Approved by: https://github.com/albanD	2023-03-27 13:38:28 +00:00
soulitzer	51c3fd39a5	Modify all calls to checkpoint pass use_reentrant explicitly (#97376 ) Fixes #ISSUE_NUMBER This is the first step toward making use_reentrant=False the default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97376 Approved by: https://github.com/albanD	2023-03-27 13:37:42 +00:00
yanbing-j	38da54e9c9	Split rnn primitive for inference and training (#96736 ) ## Description Currently, both inference and training will use `forward_training` in rnn primitive, which will bring performance downgrade for inference (The performance drop is from rnn primitive and unnecessary creation of `pd` and `workspace`). This PR is to split them into `forward_inference` and `forward_training` seperately. ## Performance With this fix PR, in RNN-T inference, the throughput reduction is 167 ms, which increases `3.7%` of E2E time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96736 Approved by: https://github.com/jgong5	2023-03-27 11:14:15 +00:00
Yanbo Liang	e3df6a7c8a	[Dynamo] Unspec int list if enabling dynamic_shapes (#97557 ) Fixes #97348 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97557 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-03-27 06:12:43 +00:00
Nikita Shulga	542fb0b1fa	Specify file encoding in test_torch.py (#97628 ) Attempt to fix ``` UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 5260: ordinal not in range(128) ``` in https://github.com/pytorch/pytorch/actions/runs/4522628359/jobs/7965372405 In general, it's a good practice to explicitly specify encoding, as otherwise it depends on environment variable and makes tests failures unpredicatble Pull Request resolved: https://github.com/pytorch/pytorch/pull/97628 Approved by: https://github.com/dagitses, https://github.com/kit1980	2023-03-26 20:03:25 +00:00
Nikita Shulga	b73e8cd4fa	[BE] Use nested namespaces in sparse (#97581 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 59a5205</samp> This pull request refactors the namespace declarations in several files under `aten/src/ATen/native/sparse` to use a more concise and consistent syntax. This improves the readability and reusability of the sparse tensor operations code. Also, do not rely on deprecated `TensorBase::data` and instead use `TensorBase::data_ptr` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97581 Approved by: https://github.com/kit1980, https://github.com/huydhn	2023-03-26 18:20:27 +00:00
mikey dagitses	461f088c96	add -std=c++17 to windows cuda compilations (#97515 ) add -std=c++17 to windows cuda compilations Summary: We're using C++17 in headers that are compiled by C++ extensions. Support for this was not added when we upgraded to C++17. Test Plan: Rely on CI. --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97515). * #97175 * __->__ #97515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97515 Approved by: https://github.com/ezyang	2023-03-26 15:23:52 +00:00
Huy Do	4c0dce50fd	[BE] Apply ufmt to run_test and GitHub Python util scripts (#97588 ) This has been bugging me for a while as I'm working on these Python scripts and they are not tracked by ufmt linter. So I add these script into that linter. ``` [[linter]] code = 'UFMT' include_patterns = [ '.github/*/.py', 'test/run_test.py', ``` This change should just work and not break anything as ufmt (black + usort) linter is very safe to use for standalone util scripts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97588 Approved by: https://github.com/kit1980	2023-03-26 04:52:55 +00:00
Natalia Gimelshein	f09347a9f1	[inductor] Fix broadcast of random seed in mm epilogue (#97591 ) Fixes #96468, #97553 In matmul codegen epilogue we use `mask` shape to infer the broadcasted shape in case we need to broadcast a scalar value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97591 Approved by: https://github.com/jansel	2023-03-26 03:35:03 +00:00
Natalia Gimelshein	4f2ac8abac	Fixes double printing of code in debug mode (#97608 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/97608 Approved by: https://github.com/mlazos	2023-03-26 02:39:38 +00:00
Nikita Karetnikov	dc45ad7024	[inductor] support SymPy exprs in `reflection_pad2d_backward` lowering (#97604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97604 Approved by: https://github.com/ezyang	2023-03-26 00:38:50 +00:00
Nikita Karetnikov	9585a7ffd3	[inductor] support non-tensor ops with dynamic shapes (#97519 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97519 Approved by: https://github.com/jansel	2023-03-26 00:38:50 +00:00
nima10khodaveisi	13dcf635e0	Dynamo stride dim kwargs (#97444 ) Fixes #97441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97444 Approved by: https://github.com/ezyang	2023-03-25 23:43:05 +00:00
Shen Li	233742cb2f	Add accuracy tests for traced optimizers (#97577 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97577 Approved by: https://github.com/yifuwang	2023-03-25 15:45:11 +00:00
Mark Saroufim	1b08a01361	Default to aot_eager for torch.compile on MPS (#96980 ) Fixes https://github.com/pytorch/pytorch/issues/96976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96980 Approved by: https://github.com/kulinseth, https://github.com/albanD, https://github.com/ZainRizvi	2023-03-25 14:21:39 +00:00
Shen Li	75fb0b6c9f	Enable full train_step tracing and customizable dist graph expansion (#97416 ) This commit adds an entry point for full `train_step` tracing and expansion. Model forward, backwrd, and optimizer step will be included in one graph. DTensor expansion will be applied on top to insert collective communications. Users can also provide an `Override` implementation to skip non-traceable submodules and directly install submodule logic to the DTensor-expanded graph by inserting `fx.Nodes`. Differential Revision: [D44325177](https://our.internmc.facebook.com/intern/diff/D44325177) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97416 Approved by: https://github.com/yifuwang, https://github.com/wanchaol	2023-03-25 09:24:21 +00:00
leslie-fang-intel	e67b58105a	Enable lowering to inductor (#96927 ) Summary Enable the lowering path from a quantized 2.0 fx graph into Inductor. The basic usage will be ``` export_module, guards = torchdynamo.export(m, args) prepare_module = prepare_pt2e(export_module, args) convert_module = convert_pt2e(prepare_module) ooptimized_module = compile_fx(convert_module, example_inputs) ``` Most of the issues we met previously has already been fixed in PyTorch Master. So in this PR, we mainly do 2 things: 1. Add the basic usage into a UT. 2. Move `handle_dynamo_export_graph` before the fusion passes, otherwise the dynamo_export_graph will hit the fusion passes twice which is un-expected. Test Plan ``` clear && python -m pytest test_quantization.py -k test_inductor_backend_config_conv ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96927 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel, https://github.com/jerryzh168	2023-03-25 08:29:40 +00:00
James Reed	3b1b585a59	[FSDP] Fix bug in determining whether parameters need to be materialized (#97488 ) Previously, `_need_to_materialize_module` would return false because: * `managed_params =_get_orig_params(module, ignored_params)` returns a generator * `is_meta_module = any(param.is_meta for param in managed_params)` exhausts the generator in its check * `any(fake.is_fake(param) for param in managed_params)` would try to iterate over the empty generator and get an empty sequence, thus returning `False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97488 Approved by: https://github.com/ngimel, https://github.com/awgu	2023-03-25 08:24:57 +00:00
Nikita Shulga	14177f0d3d	[BE] Make `USE_FLASH_ATTENTION` private (#97579 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at b07152e</samp> This pull request refactors the CMake configuration to enable the `USE_FLASH_ATTENTION` feature for the `torch_cuda` target only, using a target-specific macro. This avoids conflicts with other libraries that also use this feature, such as fairseq. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97579 Approved by: https://github.com/kit1980	2023-03-25 05:41:07 +00:00
kshitij12345	5e014bfbbd	[vmap] ldl_factor: batch rule (#97518 ) Ref https://github.com/pytorch/pytorch/issues/96855 Will look into `ldl_solve` separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97518 Approved by: https://github.com/zou3519	2023-03-25 04:37:32 +00:00
Ke Wen	f89af60183	Rewrite NCCL watchdog to more reliably throw timeout (#97066 ) Fixes #97191 This PR aims to propagate collective exceptions (async error or timeout) up to the program, so as to avoid silent stuck job. ### Previous output in #97191 ``` Rank 0 is the problematic rank Rank 4 completed Rank 5 completed Rank 3 completed Rank 6 completed Rank 2 completed Rank 7 completed Rank 1 completed [E ProcessGroupNCCL.cpp:464] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10917 milliseconds before timing out. Rank 0 completed [E ProcessGroupNCCL.cpp:478] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:483] To avoid data inconsistency, we are taking the entire process down. ``` Although it says that it is taking the process down, it sometimes fails to do so. ### New output after this PR: ``` ... [E ProcessGroupNCCL.cpp:459] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out. [E ProcessGroupNCCL.cpp:473] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:479] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:818] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 194470) of binary: /data/home/kw2501/repos/pytorch-dev-env/bin/python Traceback (most recent call last): File "/pytorch-dev-env/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')()) File "/pytorch-dev/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(args, *kwargs) File "/pytorch-dev/torch/distributed/run.py", line 794, in main run(args) File "/pytorch-dev/torch/distributed/run.py", line 785, in run elastic_launch( File "/pytorch-dev/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/pytorch-dev/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ hang.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-03-20_22:00:42 host : node0 rank : 0 (local_rank: 0) exitcode : -6 (pid: 194470) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 194470 ============================================================ ``` The log suggests that TorchX monitor is triggered, and job is torn down. ### Major changes in this PR: 1. Merge ncclWatchDog thread and workCleanupLoop thread into one so that the watch action and the throw action are streamlined. Previously, ncclWatchDog is responsible for watching comm error and timeout, and workCleanupLoop is responsible for watching Work item error and throwing exception. This two-thread design is not streamlined, raising the chance of missing the throw. Also, it is duplicated to watch at multiple level. 2. Rethrow exception at watchdog thread. 3. Clean up a bunch of duplicated functions, e.g. `checkAndThrowException` and `handleNcclException`. 4. Turn on ASYNC_ERROR_HANDLING by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/97066 Approved by: https://github.com/rohan-varma	2023-03-25 04:30:20 +00:00
BJ Hargrave	ee934fd633	Use unordered NEQ comparison for vec512 operator!= implementations (#97466 ) This is consistent with the vec256 operator!= implementations. _CMP_NEQ_UQ is the logical opposite of _CMP_EQ_OQ comparison used in the operator== implementations. Using the ordered NEQ operation results in nan != nan being false which is incorrect. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97466 Approved by: https://github.com/jgong5, https://github.com/sanchitintel	2023-03-25 03:29:45 +00:00
Michael Gschwind	c757647dd8	[Better Transformer] make is_causal a hint and force attn_mask to be set on `is_causal=True` in F.MHA (#97214 ) Summary: This fixes an issue raised in [is_causal parameter in torch.nn.TransformerEncoderLayer.forward does not work #96941](https://github.com/pytorch/pytorch/issues/96941) where results computed with is_causal do not properly reflect causal masking. In PyTorch 2.0, Accelerated PT Transformers added the is_causal parameter to legacy nn.Transformer* and nn.MHA APIs aligned with and intended to engage the is_causal parameter of the new scaled_dot_product_attention (SDPA) operator. At present is_causal works differently for Transformer* modules, the nn.MHA and F.MHA: * The nn.Transformer* modules treat is_causal as an optional indicator about the format of attn_mask. This is because some layers (such as the CLIP layer use the attention mask in the layer, and thus the attn_mask was a required feature.) * Initially, nn.MHA and F.MHA were defined to align with F.SDPA in behavior: a user may specify either the attention mask, or is_causal, but not both. It seemed to make sense at the time to align SDPA and MHA, esp since there was a larger overlap of parameters which have since changed, e.g., with the removal of need_weights from SDPA. (See below for why this makes sense.) Unfortunately, this does not work because of how MHA was changed to handle the need_weights parameter. When need_weights is present, we do not (any more) call SDPA because support for need_weights was removed from SDPA before the release. The rationale is that need_weights defeats all optimization at the foundation of SDPA performance. Having the flag might thus mislead users into thinking they get good performance and have them disappointed when they enable a legacy feature of MHA which massively degrades performance. (They might not think anything of enabling that, because it is on by default in MHA today, which leads to more issues.) Since SDPA does not (no longer) support need_weights, we need to pick a separate path which implements attention using a set of discrete operations that allocates a tensor for weights. Alas, this code path does not have support for is_causal, because attention is implemented as matmul and using the attention mask. Thus, is_causal has no impact. (A substantially similar situation arises with how kpm is implemented today because Nested Tensors are not supported by torch.compile() in 2.0) This problem was masked because all uses of legacy nn.MHA (and F.MHA) come through nn.Transformer* which called self-attention (i.e., nn.MHA) only ever with the attention mask attn_mask, and never with is_causal, a missed optimization opportunit that would have been addressed in a future performance update. Regrettably, always calling nn.MHA with attn_mask prevented diagnosing of the issue of not having a suitable attention mask when need_weights support was dropped from SDPA and a discrete implementation of attention was added for that scenario, and for the execution path with key_padding_mask. We have two options to address this issue: Solution 1: Whenever nn.MHA and F.MHA are executed with is_causal set, we internally create a causal mask at significant expense of allocating a tensor and filling it with a triangular causal matrix. This increases memory usage, and runtime, for allocating a causal mask. To add insult to injury, in all current (and likely future) execution scenarios, MHA is called by a model using the nn.Transformer API which already has that matrix and passes it from nn.module to nn.module. Then the passing in of attn_mask has to be suppressed by nn.TransformerEncoderLayer, only for nn.MHA to immediately allocate the very same tensor again to satisfy the requirement to have an attention mask for the computation. (We expect new use cases to use SDPA directly.) Solution 2: We align the behavior of nn.MHA and F.MHA with the rest of the existing nn.Transformer API, and require the attention mask to be passed into nn.MHA in addition to is_causal as an optional indicator about the nature of the attention mask rather than as an alternative to attn_mask. Then, when we choose the code path for processing MHA with need_weights or a key_padding_mask, we have the attn_mask passed down through the nn.Transformer* hierarchy, without the added overhead of allocating an attention mask as in scenario 1. This PR implements solution 2 which offers better performance and in retrospect aligns MHA better with the rest of the Transformer modules as the definition of SDPA evolved into a more streamlined high-performance operator. It ostensibly changes how is_causal works, by requiring the attention mask to be specified. However, as described here, and as shown in the submitted issue, is_causal is not working as intended today, so it requires a change regardless. In that sense, a change in API does not occur per-se, as the current implementation is not working, and a change has to occur either way to resolve the submitted issue, breaking any use cases that depend on the current implementation. Checks exist (and more can be added) that flag any scenarios where is_causal is passed as True, but no attention mask is provided, ensuring that there's not quiet change from even the faulty behavior present in 2.0. As an upside, the present implementation will improve performance by addressing the passing of the is_causal flag from Transformer modules to MHA, speeding up training for these examples, e.g., finetuning BERT, RoBERTa, XLM-R models. Differential Revision: D44245725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97214 Approved by: https://github.com/albanD	2023-03-25 01:36:30 +00:00
Mikayla Gawarecki	2e8086b0a1	Add myself to nn codeowners (#97277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97277 Approved by: https://github.com/awgu, https://github.com/albanD	2023-03-25 01:26:23 +00:00
Eddie Yan	0781188e64	[NCCL] Cleanup NCCL-no-record streams, move to `TORCH_NCCL_AVOID_RECORD_STREAMS` (#97053 ) Cleanup of #89880 including moving environment variable to `TORCH_*` prefix and a warning condition fix. CC @ptrblck @kwen2501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97053 Approved by: https://github.com/kwen2501	2023-03-25 01:10:06 +00:00
Shen Li	021de486ff	[Easy] Apply black to format _spmd files (#97534 ) No real changes. Format code to prepare for the PR on top. Differential Revision: [D44376380](https://our.internmc.facebook.com/intern/diff/D44376380) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97534 Approved by: https://github.com/wanchaol	2023-03-25 01:09:41 +00:00
Shen Li	a8f7e0b213	[Easy] Improve error message for meta_mm (#97533 ) Differential Revision: [D44376381](https://our.internmc.facebook.com/intern/diff/D44376381) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97533 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2023-03-25 01:09:41 +00:00
Aaron Enye Shi	b32afbbdb6	[Kineto] Improve Config Options Part 2 - update to new Kineto Submodule (#97556 ) Summary: Remove the old client code, and replace with the new client interface after updating Kineto submodule. Test Plan: CI Tests Reviewed By: chaekit Differential Revision: D44314909 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/97556 Approved by: https://github.com/anupambhatnagar, https://github.com/davidberard98	2023-03-25 00:52:23 +00:00
Philip Meier	129e03905d	disallow invalid value ranges in torch.testing.make_tensor (#96334 ) Fixes #96179. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96334 Approved by: https://github.com/mruberry	2023-03-24 23:55:17 +00:00
Philip Meier	47bfb192a7	deprecate low==high in torch.testing.make_tensor (#96333 ) Addresses #96179. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96333 Approved by: https://github.com/mruberry	2023-03-24 23:55:17 +00:00
Philip Meier	76fb9a1c7f	fix low and high in torch.testing.make_tensor for integral inputs (#96124 ) Fixes #96178. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96124 Approved by: https://github.com/mruberry	2023-03-24 23:55:17 +00:00
Philip Meier	779cd1f15b	only apply domain eps for floating and complex types (#97010 ) As discussed in https://github.com/pytorch/pytorch/pull/96124#issuecomment-1471973352. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97010 Approved by: https://github.com/mruberry	2023-03-24 23:55:17 +00:00
Philip Meier	9029361f24	honor low and high for torch.bool in torch.testing.make_tensor (#96332 ) Fixes #96101. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96332 Approved by: https://github.com/mruberry	2023-03-24 23:55:17 +00:00
Philip Meier	7602aade0f	fix random mask creation in test_maskedtensor (#97017 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97017 Approved by: https://github.com/pearu, https://github.com/mruberry	2023-03-24 23:55:17 +00:00
Philip Meier	303eb37e38	QoL improvements for torch.testing.make_tensor (#96125 ) Per title. The major ones: - Enforce keyword only parameters for `_modify_low_high`, which takes 7 parameters. `28aa2efd14/torch/testing/_creation.py (L147)` is just impossible to comprehend without multiple trips back and forth. - Improve the error messages by including the offending values in the message I'll highlight the smaller ones inline. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96125 Approved by: https://github.com/mruberry	2023-03-24 23:55:17 +00:00
Philip Meier	090af4aa71	add proper tests for torch.testing.make_tensor (#96331 ) We had some minimal tests for `torch.testing.make_tensor` before, but nothing exhaustive. This lead to quite few edge cases being undetected. This PR adds comprehensive tests and leaves a few FIXMEs in there for behavior that needs to be fixed in `make_tensor`. This will happen in later commits of this stack. Meaning, at the end of this stack, there shouldn't be any FIXME left in the tests added here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96331 Approved by: https://github.com/mruberry	2023-03-24 23:55:17 +00:00
PyTorch MergeBot	dbe6da797a	Revert "Sort: Use cub::WarpMergeSort for small sorts (32 < n <= 128) (#96223 )" This reverts commit 5d8c7e7ea47cb6e1faf333430889a804de87536e. Reverted https://github.com/pytorch/pytorch/pull/96223 on behalf of https://github.com/osalpekar due to Causing numerous Internal build failures emerging from SortUtils.cuh. Details in [D44378546](https://www.internalfb.com/diff/D44378546)	2023-03-24 23:48:04 +00:00
mikey dagitses	85885301fd	fix ignored qualifiers errors (#97443 ) fix ignored qualifiers errors Summary: These errors exist in GCC 11, which is the default compiler on CentOS 9. Test Plan: Rely on CI. Reviewers: sahanp Subscribers: Tasks: Tags: --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97443). * __->__ #97443 * #97442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97443 Approved by: https://github.com/ezyang	2023-03-24 23:05:50 +00:00
blzheng	39c8188194	Inductor: fall back bernoulli on cpu (#97002 ) data type: float32 Input size: torch.Size([64, 4, 128, 128]) single socket (32cores): ``` Before: bernoulli 0.001327775239944458 s dropout 0.0014216173489888509 s After: bernoulli 0.0002424612840016683 s dropout 0.00039757410685221353 s ``` single core: ``` Before: bernoulli 0.04154032731056213 s dropout 0.04382548745473226 s After: bernoulli 0.006143261671066284 s dropout 0.0065830423831939695 s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97002 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-24 22:13:51 +00:00
Bin Bao	2b75955c9f	[CI] Add missing --cold-start-latency for the dashboard run (#97547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97547 Approved by: https://github.com/huydhn	2023-03-24 20:15:11 +00:00
janEbert	95c166cd3d	Add `is_causal` API for `TransformerDecoder` (#97166 ) The same API is implemented for `TransformerEncoder`, where this argument is passed through to the sublayers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97166 Approved by: https://github.com/mikekgfb	2023-03-24 20:00:53 +00:00
Kwanghoon An	92605ee776	Support per channel tensor with unpacking in QNNPACK (#96268 ) Summary: Supporting Per Channel quantizer with unpacking for QNNPACK. Test Plan: buck2 test //caffe2/test/quantization:quantization -- test_qlinear_per_channel_qnnpack_free_memory_and_unpack Differential Revision: D43865268 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96268 Approved by: https://github.com/kimishpatel	2023-03-24 19:52:47 +00:00
Kevin Tse	c5135ff2a6	[DataPipe] Fix missing imports in DataPipe interface file (#97458 ) Fixes https://github.com/pytorch/data/issues/1106 Ran linter locally on `datapipes.pyi` (which is generated during installation) to confirm Pull Request resolved: https://github.com/pytorch/pytorch/pull/97458 Approved by: https://github.com/mikaylagawarecki	2023-03-24 19:25:43 +00:00
Will Constable	827b2aee97	Warn once on dynamo module w/ hooks (#97535 ) Fixes #97347 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97535 Approved by: https://github.com/jbschlosser	2023-03-24 19:13:07 +00:00
Aaron Enye Shi	197434df96	[Kineto] Improve Config Options for Input Shapes, Memory, Stack, Flops, and Modules - Part 1 (#97380 ) Summary: Improve On-Demand Kineto config to enable toggling of [profiler options](https://pytorch.org/docs/stable/profiler.html) via the config file. New config strings: - PROFILE_REPORT_INPUT_SHAPES - PROFILE_PROFILE_MEMORY - PROFILE_WITH_STACK - PROFILE_WITH_FLOPS - PROFILE_WITH_MODULES Also marked for deprecation, but still valid, old config options: - CLIENT_INTERFACE_ENABLE_OP_INPUTS_COLLECTION - PYTHON_STACK_TRACE Test Plan: CI Tests (internal testing) Reviewed By: leitian Differential Revision: D44275220 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/97380 Approved by: https://github.com/davidberard98	2023-03-24 19:12:19 +00:00
CedricPicron	cf0ba1b9c0	Use L1 loss for Smooth L1 loss with beta=0 (#97022 ) Fixes #96813. Comments: 1. Wasn't able to test since tools/nightly.py does not allow for GPU build (and I don't want to build from scratch). 2. In theory, the bug (i.e. NaNs) can still occur when beta is very small (e.g. `beta=1e-50`), but not sure whether anybody cares. 3. Some checks within the smooth_l1_loss C++ code could be changed to check for `beta > 0` instead of `beta >= 0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97022 Approved by: https://github.com/jbschlosser	2023-03-24 19:10:32 +00:00
Eric Sauser	17567e5b29	[pytorch@arvr/windows] Fix pytorch build/import on Windows @ ovrsource (#97193 ) Summary: - Importing torch on Windows can cause a crash within python. - The problem was introduced by the change in `Module.cpp` from https://github.com/pytorch/pytorch/pull/94927 - The cause is that a call to `PyObject* initModule(void)` declared with a `__declspec(dllimport)` specifier can lead to a crash if the definition doesn't include the `__declspec(dllexport)` counterpart. - To mitigate the problem without introducing customized macros and changing the build system (note, `#include <c10/macros/Export.h>` doesn't work in `stub.c`) is to simply remove the `__declspec(dllimport)` specifier. - According to https://web.archive.org/web/20140808231508/http://blogs.msdn.com/b/russellk/archive/2005/03/20/399465.aspx and other sources, `__declspec(dllimport)` only leads to some code optimizations, and since `initModule()` is only called once at startup, this is marginal. - Note: the `stub_with_flatbuffer.c` file counterpart wasn't affected, therefore, not touched. Differential Revision: D44236183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97193 Approved by: https://github.com/ezyang	2023-03-24 18:32:43 +00:00
Pruthvi Madugundu	baf71a8aad	[ROCm] Update clock intrinsic handling for AMD gfx11 family (#97005 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97005 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2023-03-24 18:29:49 +00:00
PyTorch MergeBot	5170995b2a	Revert "Upgrade NVTX to NVTX3 (#90689 )" This reverts commit e64ddd1ab9d46cfc921c19269969ffc5cd7d6f6c. Reverted https://github.com/pytorch/pytorch/pull/90689 on behalf of https://github.com/osalpekar due to Build Failures due to not being able to find one nvtx3 header in FRL jobs: [D42332540](https://www.internalfb.com/diff/D42332540)	2023-03-24 18:16:06 +00:00
vfdev-5	a96ccaa362	Code update for vectorized interpolate cpu uint8 (#96847 ) - code style update - use idx_ptr_xmin/idx_ptr_size instead of bounds - compute wt_max inside _compute_indices_weights_aa (no significant overhead) - added comments and explanations - renamed xmin/xmax into ids_min, ids_size Pull Request resolved: https://github.com/pytorch/pytorch/pull/96847 Approved by: https://github.com/peterbell10, https://github.com/NicolasHug, https://github.com/lezcano	2023-03-24 18:11:03 +00:00
mikey dagitses	4ff71c91d3	backport std::ssize to c10 (#97442 ) backport std::ssize to c10 Summary: Now that we have -Werror=sign-compare enabled, we encounter a lot of friction comparing standard containers and our tensors which are signed. std::ssize will make it easier and safer to succinctly convert container sizes to a signed type. Test Plan: Added a unit test. Reviewers: ezyang Subscribers: Tasks: Tags: --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97442). * #97443 * __->__ #97442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97442 Approved by: https://github.com/ezyang	2023-03-24 17:56:05 +00:00
Masaki Kozuki	b5edf18334	`GradScaler` recomputes `optimizer_state["found_inf_per_device"]` before `optimizer.step` (#97415 ) I found a discrepancy between non-fused and fused optimizers, which is to use `optimizer_state["found_inf"]` or to recompute `found_inf`. - non fused: `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L289)` - fused: `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353)` - where `_check_inf_per_device` is `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L564-L573)` The other way to align the behavior is to use the existing `found_inf` in `e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353)`. I'd say this PR is for the sake of "safety" and the alternative is to keep the existing behavior. I honestly have no idea if it's expected to double-check the sanity of gradients in `GradScaler.step`. --- what I've observed in huggingface/transformers T5-base example so far seems like that non-fused optimizers lead to invalid parameters while the fused not. The cause seems to be that `gradients` become inf/nan before `GradScaler.step(optimizer)` after `GradScaler._unscale_grads_` (more precicely, the call of `torch._amp_foreach_non_finite_check_and_unscale_`) in the script of the issue linked below, i.e. the gradient clipping and/or unscaling lead to inf/nan as these happen after the grad check. See `788300cc2a/aten/src/ATen/native/cuda/AmpKernels.cu (L165-L174)`. Fixes #96755 🙏 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97415 Approved by: https://github.com/ngimel, https://github.com/janeyx99	2023-03-24 17:36:47 +00:00
yhl48	6fcd671574	Complex support for expm1 (#96644 ) Fixes #92619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96644 Approved by: https://github.com/soulitzer	2023-03-24 17:24:50 +00:00
Pruthvi Madugundu	1b8b82f835	[ROCm] Update magma commit for ROCm (#97491 ) Updated magma to more recent commit Pull Request resolved: https://github.com/pytorch/pytorch/pull/97491 Approved by: https://github.com/jeffdaily	2023-03-24 17:15:51 +00:00
Bin Bao	c55d1a6049	[CI] Experiment with a newer CUDA driver (#96904 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96904 Approved by: https://github.com/huydhn, https://github.com/weiwangmeta	2023-03-24 17:05:18 +00:00
Kazuaki Ishizaki	622a11d512	Fix typos under torch/utils directory (#97516 ) This PR fixes typos in comments and messages of `.py` files under `torch/utils` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/97516 Approved by: https://github.com/ezyang	2023-03-24 16:53:39 +00:00
Yanbo Liang	d305d4a57f	[Dynamo] Fix TIMM benchmark compute_loss (#97423 ) Fixes #97382 #95416 fixed a critical bug in dynamo benchmark, where AMP tests fall back to eager mode before that PR. However, after that PR, we found [a list of TIMM models amp + eager + training testing failed](https://docs.google.com/spreadsheets/d/1DEhirVOkj15Lu4UNawIUon9MqkVLaWqyT-DQPif5NHk/edit#gid=0). Now we identified the root cause is: high loss values make gradient checking harder, as small changes in accumulation order upset accuracy checks. We should switch to the helper function ```reduce_to_scalar_loss``` which has been used by Torchbench tests. After switching to ```reduce_to_scalar_loss```, TIMM models accuracy pass rate grows from 67.74% to 91.94% in my local test. The rest 5 failed models(ese_vovnet19b_dw, fbnetc_100, mnasnet_100, mobilevit_s, sebotnet33ts_256) need further investigation and handling, but I think it should be similar reason. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97423 Approved by: https://github.com/Chillee	2023-03-24 16:50:28 +00:00
mikey dagitses	5f5d675587	remove unused CAFFE2_VERSION macros (#97337 ) remove unused CAFFE2_VERSION macros Summary: Nothing reads these and they are completely subsumed by TORCH_VERSION. Getting rid of these will be helpful for build unification, since they are also not used internally. Test Plan: Rely on CI. Reviewers: sahanp Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/97337 Approved by: https://github.com/malfet	2023-03-24 16:02:35 +00:00
Rohan Varma	605a77fd59	Log FSDP mixed precision (#97367 ) Log to clarify the mp config in jobs Differential Revision: [D44307044](https://our.internmc.facebook.com/intern/diff/D44307044/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97367 Approved by: https://github.com/awgu	2023-03-24 16:01:59 +00:00
AllenTiTaiWang	51ce02232b	[ONNX] Support converting fx graph with symbolic shape to ONNX (#96350 ) ~~Need https://github.com/microsoft/onnx-script/pull/484~~ Support dynamic export on fx-ONNX exporter. Essentially, we set inputs size and nodes all dynamic in torchscript, and leverage on `aten::sym_size` to catch dynamic size between each Op. 1. Add `dynamic_axes` switch between symbolic tracing (dynamic sizes) and fake mode (static). Set it to default True, as most of our tests are happy with sumbolic tracing. Except GPT2 stays with fake mode with error: https://github.com/microsoft/onnx-script/issues/523 2. Add test_fx_dynamic_onnruntime.py to test on some addhoc we have from old exporter. This can be removed once tests are integrated with https://github.com/pytorch/pytorch/pull/96479 3. Since `aten::sym_size` are operated with built-in function, a built-in function mapping is added to support SymFloat/SymInt. (FIXME: https://github.com/pytorch/pytorch/issues/97201). sym_size output value is also fx.Node, and can be found in `fx_name_to_onnxscipt_value`, so it's operation stays the same as other ONNX ops in ONNX graph. 4. Fully deprecated FakeTensorProp as make_fx() should provide all node meta info. 5. Put complicated fx.Node related ArgumentType into _type_utils.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/96350 Approved by: https://github.com/wschin, https://github.com/justinchuby	2023-03-24 15:47:55 +00:00
mikey dagitses	bcff4773da	add /std:c++17 to windows compilations when not using Ninja (#97445 ) add /std:c++17 to windows compilations when not using Ninja Summary: This was overlooked when we upgraded to C++17. Test Plan: Rely on CI. Reviewers: ezyang Subscribers: Tasks: Tags: --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97445). * #96603 * #97473 * #97175 * #97515 * __->__ #97445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97445 Approved by: https://github.com/ezyang	2023-03-24 14:52:29 +00:00
Nikita Karetnikov	6e46f47227	[inductor] xfail tests by default (#97331 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97331 Approved by: https://github.com/ezyang	2023-03-24 11:11:05 +00:00
Mark Saroufim	36d64760d9	Disable inductor developer warnings in official releases (#97451 ) Fixes https://github.com/pytorch/pytorch/issues/97449 We shouldn't match on: torch 2.0.0+cu118 We should match on: torch 2.1.0.dev20230323+cu118 We should match on: torch 2.1.0a0+git63e1f12 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97451 Approved by: https://github.com/SherlockNoMad	2023-03-24 07:16:51 +00:00
Elias Ellison	73fadd523b	Use a single stream for cuda graph pool (#97419 ) Previously, we would use the same memory pool but not actually reuse the same memory. The peak memory showed good numbers, but real memory use was much higher because we had a bunch of unallocated segments that could not be reused. As stated in comments: NB: cuda caching allocator will remember the stream a segment is allocated to and only allocate that segment to the same stream. we need to use a single stream for all allocations to the memory pool, otherwise the allocations to separate streams will not be reused; separate recordings would have use the same memory pool, but not the same memory. Thanks to @zdevito for help debugging this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97419 Approved by: https://github.com/ngimel	2023-03-24 07:04:12 +00:00
fduwjj	b11ce4bbca	Bring back tensor_has_compatible_shallow_copy_type (#97455 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97455 Approved by: https://github.com/clee2000	2023-03-24 06:43:20 +00:00
PyTorch MergeBot	f25cdf8aeb	Revert "Rewrite NCCL watchdog to more reliably throw timeout (#97066 )" This reverts commit 95e8d0c39ec523f5a35c31155285fd4242928d8a. Reverted https://github.com/pytorch/pytorch/pull/97066 on behalf of https://github.com/clee2000 due to sorry but I think this broke periodic mutigpu tests `416bac5b81` https://github.com/pytorch/pytorch/actions/runs/4505085943/jobs/7930826040	2023-03-24 06:27:00 +00:00
Nikita Shulga	ad5d81adda	[Sparse] Add reference implementation for addmv (#97353 ) Partially addresses the problem raised in https://github.com/pytorch/pytorch/issues/96972 Add `test_addmv` and enable `test_block_addmv` on all platforms (so the test could be run on M1) TODO: Make sure that test_block_addmv non-contiguous mode actually generate non-contiguous as rigth now it probably does not, as test passes assuming values are contiguous. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97353 Approved by: https://github.com/cpuhrsch	2023-03-24 06:14:32 +00:00
Michael Voznesensky	31e858e8fc	Add missing aot_autograd_arg_pos_to_source (#97487 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97487 Approved by: https://github.com/malfet, https://github.com/ezyang	2023-03-24 05:17:59 +00:00
Wei Wang	9320cae1da	Add GPU frequency lock option to inductor workflows running on A100 (#97465 ) Fixes #97459 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97465 Approved by: https://github.com/xuzhao9	2023-03-24 05:15:21 +00:00
Edward Z. Yang	fa4c77e39b	Rename PyOperator to HigherOrderOperator (#97493 ) Twice this week I have had people confuse "operator defined with Python operator registration aka torch.library" and "PyOperator which is used to define control flow operators and other operators that cannot be represented in JIT schema." Renaming PyOperator for clarity. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97493 Approved by: https://github.com/SherlockNoMad	2023-03-24 05:04:02 +00:00
Manuel Candales	763c5a33e7	[Vulkan] Fix quantized cpu to vulkan broken by padding (#97372 ) Summary: Previous diff D43068669 introduced channel padding, and in doing so, it broke the quantized copy of cpu to vulkan tensors. This diff updates the quantized nchw to image shaders, in order to work with padded channels. Test Plan: ``` buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` Differential Revision: D44309956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97372 Approved by: https://github.com/SS-JIA	2023-03-24 03:37:29 +00:00
Scott Wolchok	a66625da3b	[PyTorch] Optimize DictType::annotation_str_impl (#96498 ) stringstream construction is expensive, and we can exactly reserve space for the output string while doing the same number of string copies. (If we wanted to improve performance further, we could introduce annotation_str_out to append the output to a given std::string and thus avoid copying subtype annotation strings. It occurs to me that the existing approach is quadratic in the number of layers of nesting, so we should probably do this!) Differential Revision: [D43919651](https://our.internmc.facebook.com/intern/diff/D43919651/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96498 Approved by: https://github.com/Skylion007	2023-03-24 02:38:21 +00:00
Scott Wolchok	000cfeb848	[PyTorch] Optimize TupleType::annotation_str_impl (#96497 ) stringstream is expensive to create, we used stringstream instead of ostringstream, and we can easily specialize the empty tuple. Also, anybody compiling with C++20 support can move out of the stringstream and it shouldn't hurt people without C++20 support to do so. I would consider specializing the 1-element case as well but I don't have evidence that that's necessary right now. Differential Revision: [D43882402](https://our.internmc.facebook.com/intern/diff/D43882402/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96497 Approved by: https://github.com/Skylion007	2023-03-24 02:35:32 +00:00
Elias Ellison	33dfdedb28	CUDAGraph Trees - Warn on dealloc (#97171 ) Differential Revision: [D44228370](https://our.internmc.facebook.com/intern/diff/D44228370) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97171 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-03-24 01:19:19 +00:00
Yu Guo	24e280d5e2	clean up triton mathlib (#97460 ) now both OSS and internal use tl.math Pull Request resolved: https://github.com/pytorch/pytorch/pull/97460 Approved by: https://github.com/ngimel	2023-03-24 01:08:07 +00:00
Wei Wang	bb74d04353	Remove inductor-perf-test-nightly label (#97290 ) Remove labels according to Ed's suggestion "I do NOT think performance dashboard should be label triggered. It easy to put a label on the PR, and then forget about it and keep spamming our limited A100 capacity when you push updates to your PR." Instead, one can use the "Run workflow" option and specify their feature branch in https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-compare.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/97290 Approved by: https://github.com/ezyang, https://github.com/desertfire	2023-03-24 01:02:29 +00:00
Tailing Yuan	63e1f12b49	Speedup bincount and histc on CUDA (#97090 ) This is to speed up torch.bincount and torch.histc on CUDA. 1. Speed up int64_t gpuAtomicAdd, 2. and optimize the histogram kernel. # Fixes #96626 After speedup, time cost in #96626 would be ``` ... (run 2 times and ignore the first run) case 1 CPU 0.0003631114959716797 seconds case 1 CUDA 0.0005860328674316406 seconds case 2 CPU 0.0013742446899414062 seconds case 2 CUDA 0.0008623600006103516 seconds ``` Note that in "case 1 CUDA", the max op takes the most time, i.e., `5ee5a164ff/aten/src/ATen/native/cuda/SummaryOps.cu (L334-L335)`, which is not to be optimized in this PR. # Benchmark Time is measured on i7-10700 + RTX 3080, Ubuntu 22.04 (in WSL). The baseline is PyTorch 2.0.0+cu117. My dev version of PyTorch is compiled with CUDA 11.8. Each case is measured 15 times to take the median. ## torch.bincount #elem \| nbins \| distribution \| CPU \| PyTorch 2.0.0 \| this PR \| speedup -- \| -- \| -- \| -- \| -- \| -- \| -- 220 \| 80 \| random.uniform \| 0.000834 \| 0.005783 \| 0.000266 \| 21.8x 220 \| 80 \| narrow in 1 bin \| 0.001576 \| 0.003967 \| 0.000563 \| 7.0x 220 \| 500 \| random.uniform \| 0.000852 \| 0.003641 \| 0.000334 \| 10.9x 220 \| 500 \| narrow in 1% bins \| 0.000894 \| 0.001878 \| 0.000349 \| 5.4x 220 \| 2048 \| random.uniform \| 0.000891 \| 0.000820 \| 0.000298 \| 2.8x 220 \| 2048 \| narrow in 1% bins \| 0.000958 \| 1.043251 \| 0.000335 \| 3,116.6x 226 \| 80 \| random.uniform \| 0.067715 \| 0.322409 \| 0.003032 \| 106.3x 226 \| 80 \| narrow in 1 bin \| 0.110940 \| 0.194644 \| 0.017651 \| 11.0x 226 \| 500 \| random.uniform \| 0.066666 \| 0.192302 \| 0.002535 \| 75.8x 226 \| 500 \| narrow in 1% bins \| 0.066130 \| 0.092237 \| 0.005462 \| 16.9x 226 \| 2048 \| random.uniform \| 0.066371 \| 0.035308 \| 0.002476 \| 14.3x 226 \| 2048 \| narrow in 1% bins \| 0.068453 \| 72.122858 \| 0.003185 \| 22,644.3x ## torch.histc (float32) #elem \| nbins \| distribution \| CPU \| PyTorch 2.0.0 \| this PR \| speedup -- \| -- \| -- \| -- \| -- \| -- \| -- 220 \| 80 \| random.uniform \| 0.001261 \| 0.000145 \| 9.47E-05 \| 1.5x 220 \| 80 \| narrow in 1 bin \| 0.001074 \| 0.000356 \| 0.000311 \| 1.1x 220 \| 500 \| random.uniform \| 0.001162 \| 0.000227 \| 9.18E-05 \| 2.5x 220 \| 500 \| narrow in 1% bins \| 0.001082 \| 0.000201 \| 0.000152 \| 1.3x 220 \| 2048 \| random.uniform \| 0.001100 \| 0.000203 \| 0.000118 \| 1.7x 220 \| 2048 \| narrow in 1% bins \| 0.001089 \| 0.000396 \| 0.000107 \| 3.7x 226 \| 80 \| random.uniform \| 0.064219 \| 0.001170 \| 0.000786 \| 1.5x 226 \| 80 \| narrow in 1 bin \| 0.056471 \| 0.013283 \| 0.011939 \| 1.1x 226 \| 500 \| random.uniform \| 0.078183 \| 0.003411 \| 0.000562 \| 6.1x 226 \| 500 \| narrow in 1% bins \| 0.056711 \| 0.002763 \| 0.002738 \| 1.0x 226 \| 2048 \| random.uniform \| 0.059296 \| 0.003503 \| 0.000533 \| 6.6x 226 \| 2048 \| narrow in 1% bins \| 0.061754 \| 0.015703 \| 0.000962 \| 16.3x ## torch.histc (int64) #elem \| nbins \| distribution \| CPU \| PyTorch 2.0.0 \| this PR \| speedup -- \| -- \| -- \| -- \| -- \| -- \| -- 220 \| 80 \| random.uniform \| N/A \| 0.005614 \| 9.47E-05 \| 59.3x 220 \| 80 \| narrow in 1 bin \| N/A \| 0.003799 \| 0.000395 \| 9.6x 220 \| 500 \| random.uniform \| N/A \| 0.003665 \| 9.58E-05 \| 38.2x 220 \| 500 \| narrow in 1% bins \| N/A \| 0.001760 \| 0.000178 \| 9.9x 220 \| 2048 \| random.uniform \| N/A \| 0.000693 \| 0.000111 \| 6.2x 220 \| 2048 \| narrow in 1% bins \| N/A \| 1.082904 \| 0.000123 \| 8,802.4x 226 \| 80 \| random.uniform \| N/A \| 0.320400 \| 0.001145 \| 279.9x 226 \| 80 \| narrow in 1 bin \| N/A \| 0.193668 \| 0.015229 \| 12.7x 226 \| 500 \| random.uniform \| N/A \| 0.182897 \| 0.000823 \| 222.2x 226 \| 500 \| narrow in 1% bins \| N/A \| 0.089363 \| 0.00376 \| 23.8x 226 \| 2048 \| random.uniform \| N/A \| 0.033190 \| 0.000832 \| 39.9x 226 \| 2048 \| narrow in 1% bins \| N/A \| 71.721012 \| 0.001525 \| 47,017.8x ## Banchmark code Here is the benchmark code: ```python3 import time import torch cases = [ ("bincount bins=80 wide ", torch.randint(80, [220]), lambda x: torch.bincount(x, minlength=80)), ("bincount bins=80 narrow", torch.randint(1, [220]), lambda x: torch.bincount(x, minlength=80)), ("bincount bins=500 wide ", torch.randint(500, [220]), lambda x: torch.bincount(x, minlength=500)), ("bincount bins=500 narrow", torch.randint(5, [220]), lambda x: torch.bincount(x, minlength=500)), ("bincount bins=2048 wide ", torch.randint(2048, [220]), lambda x: torch.bincount(x, minlength=2048)), ("bincount bins=2048 narrow", torch.randint(20, [220]), lambda x: torch.bincount(x, minlength=2048)), ("histc_float bins=80 wide ", torch.rand(220), lambda x: torch.histc(x, bins=80, min=0., max=1.)), ("histc_float bins=80 narrow", torch.rand(220).01, lambda x: torch.histc(x, bins=80, min=0., max=1.)), ("histc_float bins=500 wide ", torch.rand(220), lambda x: torch.histc(x, bins=500, min=0., max=1.)), ("histc_float bins=500 narrow", torch.rand(220).01, lambda x: torch.histc(x, bins=500, min=0., max=1.)), ("histc_float bins=2048 wide ", torch.rand(220), lambda x: torch.histc(x, bins=2048, min=0., max=1.)), ("histc_float bins=2048 narrow", torch.rand(220).01, lambda x: torch.histc(x, bins=2048, min=0., max=1.)), ("histc_int bins=80 wide ", torch.randint(80, [220]), lambda x: torch.histc(x, bins=80, min=0., max=80.)), ("histc_int bins=80 narrow", torch.randint(1, [220]), lambda x: torch.histc(x, bins=80, min=0., max=80.)), ("histc_int bins=500 wide ", torch.randint(500, [220]), lambda x: torch.histc(x, bins=500, min=0., max=500.)), ("histc_int bins=500 narrow", torch.randint(5, [220]), lambda x: torch.histc(x, bins=500, min=0., max=500.)), ("histc_int bins=2048 wide ", torch.randint(2048, [220]), lambda x: torch.histc(x, bins=2048, min=0., max=2048.)), ("histc_int bins=2048 narrow", torch.randint(20, [2*20]), lambda x: torch.histc(x, bins=2048, min=0., max=2048.)), ] def test(case, device): name, x, func = case x = x.to(device) time_samples = [] for _ in range(15): torch.cuda.synchronize() t1 = time.time() func(x) torch.cuda.synchronize() t2 = time.time() time_samples.append(t2 - t1) median = sorted(time_samples)[len(time_samples) // 2] print(device, name, median) for case in cases: test(case, device="cuda") # for case in cases: # test(case, device="cpu") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97090 Approved by: https://github.com/ngimel	2023-03-24 00:25:34 +00:00
Chien-Chin Huang	f3cf3d7620	[DTensor] Fix the default PG condition for DeviceMesh (#97384 ) The current conditin to use the default PG is `len(unique_mesh_values) == WORLD_SIZE - 1`. The `- 1` is not correct and seems to be an incorrect fix from https://github.com/pytorch/pytorch/pull/96861. Differential Revision: [D44314317](https://our.internmc.facebook.com/intern/diff/D44314317/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97384 Approved by: https://github.com/wanchaol	2023-03-24 00:04:34 +00:00
Han Qi (qihqi)	e4b365a9a0	Use a equal operator that don't depend on nonzero for flatbuffer_serializer (#97298 ) Summary: call to is_nonzero here is not desirable: https://www.internalfb.com/code/fbsource/[ed0407ba3bf520baa2e9333483b274c5b40b54eb]/fbcode/caffe2/aten/src/ATen/core/ivalue.cpp?lines=278 Differential Revision: D44276685 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97298 Approved by: https://github.com/larryliu0820	2023-03-23 23:54:41 +00:00
PyTorch MergeBot	12da0c7037	Revert "remove dead torch_pb.h library (#97323 )" This reverts commit 364d92f9b6864ce284fa13519c7ca5c87460e477. Reverted https://github.com/pytorch/pytorch/pull/97323 on behalf of https://github.com/malfet due to Reverting as PR dependent on https://github.com/pytorch/pytorch/pull/97322 that has been reverted	2023-03-23 23:19:05 +00:00
PyTorch MergeBot	b531eb974a	Revert "move caffe2/proto/ to its own Bazel package (#97324 )" This reverts commit 6273c0af9513895f0597ae1801f37164d7b46d2a. Reverted https://github.com/pytorch/pytorch/pull/97324 on behalf of https://github.com/malfet due to Reverting as PR dependent on https://github.com/pytorch/pytorch/pull/97322 that has been reverted	2023-03-23 23:13:43 +00:00
PyTorch MergeBot	91a3040b4b	Revert "cleanup caffe2 cc_proto_library (#97325 )" This reverts commit 603a32c96458af870fd1653cdf57453bc7d9905d. Reverted https://github.com/pytorch/pytorch/pull/97325 on behalf of https://github.com/malfet due to Reverting as PR dependent on https://github.com/pytorch/pytorch/pull/97322 that has been reverted	2023-03-23 23:08:26 +00:00
Joel Schlosser	0d66db1b2a	Implement last dim split_with_sizes for NT (forward only, non-SymInt-ified) (#97446 ) This is needed for the HSTU model. Details: * ~~NT `chunk` now calls into NT `split_with_sizes` since the latter is more general~~ (removed; they're totally separate) * Throws for backward * Only operates over the last dim (`dim=-1`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97446 Approved by: https://github.com/cpuhrsch	2023-03-23 22:17:06 +00:00
Wanchao Liang	37f7c13b7b	[ci] disable some dtensor tests (#97358 ) fixes https://github.com/pytorch/pytorch/issues/96454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97358 Approved by: https://github.com/rohan-varma	2023-03-23 22:08:31 +00:00
PyTorch MergeBot	13fbf93238	Revert "remove dead proto_convert library (#97322 )" This reverts commit d850c33bfe3f1d1f0040738718cacb04ee449bdc. Reverted https://github.com/pytorch/pytorch/pull/97322 on behalf of https://github.com/osalpekar due to This broke a large number of internal builds due to not being able to find proto_convert.h. See here: [D44319486](https://www.internalfb.com/diff/D44319486)	2023-03-23 21:38:01 +00:00
Ke Wen	95e8d0c39e	Rewrite NCCL watchdog to more reliably throw timeout (#97066 ) Fixes #97191 This PR aims to propagate collective exceptions (async error or timeout) up to the program, so as to avoid silent stuck job. ### Previous output in #97191 ``` Rank 0 is the problematic rank Rank 4 completed Rank 5 completed Rank 3 completed Rank 6 completed Rank 2 completed Rank 7 completed Rank 1 completed [E ProcessGroupNCCL.cpp:464] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10917 milliseconds before timing out. Rank 0 completed [E ProcessGroupNCCL.cpp:478] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:483] To avoid data inconsistency, we are taking the entire process down. ``` Although it says that it is taking the process down, it sometimes fails to do so. ### New output after this PR: ``` ... [E ProcessGroupNCCL.cpp:459] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out. [E ProcessGroupNCCL.cpp:473] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:479] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:818] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 194470) of binary: /data/home/kw2501/repos/pytorch-dev-env/bin/python Traceback (most recent call last): File "/pytorch-dev-env/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')()) File "/pytorch-dev/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(args, *kwargs) File "/pytorch-dev/torch/distributed/run.py", line 794, in main run(args) File "/pytorch-dev/torch/distributed/run.py", line 785, in run elastic_launch( File "/pytorch-dev/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/pytorch-dev/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ hang.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-03-20_22:00:42 host : node0 rank : 0 (local_rank: 0) exitcode : -6 (pid: 194470) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 194470 ============================================================ ``` The log suggests that TorchX monitor is triggered, and job is torn down. ### Major changes in this PR: 1. Merge ncclWatchDog thread and workCleanupLoop thread into one so that the watch action and the throw action are streamlined. Previously, ncclWatchDog is responsible for watching comm error and timeout, and workCleanupLoop is responsible for watching Work item error and throwing exception. This two-thread design is not streamlined, raising the chance of missing the throw. Also, it is duplicated to watch at multiple level. 2. Rethrow exception at watchdog thread. 3. Clean up a bunch of duplicated functions, e.g. `checkAndThrowException` and `handleNcclException`. 4. Turn on ASYNC_ERROR_HANDLING by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/97066 Approved by: https://github.com/rohan-varma	2023-03-23 21:31:21 +00:00
Manuel Candales	416bac5b81	[Vulkan] Fix static analysis errors in vulkan_quantized_api_test.cpp (#97400 ) Summary: Fixes many static analysis and linter issues present in vulkan_quantized_api_test.cpp. Replaces C-style rand function by Cpp rand functions. Test Plan: ``` buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 ``` Differential Revision: D44315235 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97400 Approved by: https://github.com/kimishpatel	2023-03-23 20:32:43 +00:00
Xilun Wu	c2d7508276	[DTensor] default value for DTensor ops on non-participating devices (#95852 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95852 Approved by: https://github.com/wanchaol	2023-03-23 19:30:02 +00:00
Xilun Wu	103f4c99f0	[DTensor] implement aten.equal sharding prop (#97170 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97170 Approved by: https://github.com/wanchaol	2023-03-23 19:30:02 +00:00
Jason Ansel	5f57b36318	Rename torch._inductor.triton_ops.autotune to torch._inductor.triton_heuristics (#95558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95558 Approved by: https://github.com/Chillee	2023-03-23 17:41:19 +00:00
Jeremy Watt	f0649d4723	update flatten.py docstring (#97276 ) Carried over comment from tensor.flatten docstring to to clarify when a view vs copy is instantiated - this has been a [minor point of confusion in forums](https://discuss.pytorch.org/t/what-is-the-difference-of-flatten-and-view-1-in-pytorch/51790/5). This comment is: ``` Unlike NumPy’s flatten, which always copies input’s data, this function may return the original object, a view, or copy. If no dimensions are flattened, then the original object input is returned. Otherwise, if input can be viewed as the flattened shape, then that view is returned. Finally, only if the input cannot be viewed as the flattened shape is input’s data copied. See torch.Tensor.view() for details on when a view will be returned. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97276 Approved by: https://github.com/mikaylagawarecki	2023-03-23 17:10:10 +00:00
Yu Guo	a3b30c5025	update internal triton (#97422 ) Reviewed By: bertmaher Differential Revision: D44276873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97422 Approved by: https://github.com/brad-mengchi	2023-03-23 17:01:35 +00:00
Catherine Lee	29c061bb90	Remove non existent files in multigpu tests (#97393 ) They were removed in https://github.com/pytorch/pytorch/pull/96989/files and https://github.com/pytorch/pytorch/pull/96985/files Pull Request resolved: https://github.com/pytorch/pytorch/pull/97393 Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/fduwjj, https://github.com/malfet	2023-03-23 17:00:29 +00:00
Huy Do	4a88f71f65	Fix potential naming clash when writing traces with tensorboard_trace_handler (#97392 ) Fixes https://github.com/pytorch/pytorch/issues/82915 This rare flaky issue caught my attention today when it failed flakily on MacOS in https://github.com/pytorch/pytorch/actions/runs/4494182574/jobs/7906827531. The test expected 3 traces to be written but got only 2 of them. Looking a bit closer into the `tensorboard_trace_handler` function, it looks like there is a potential filename clash here. The millisecond since epoch `"{}.{}.pt.trace.json".format(worker_name, int(time.time() * 1000))` is used as part of the name. As `tensorboard_trace_handler` is used as a callback handle in the test, the names look too close to each other (1-millisecond apart), i.e. ``` huydo-mbp_13494.1679526197252.pt.trace.json huydo-mbp_13494.1679526197253.pt.trace.json huydo-mbp_13494.1679526197250.pt.trace.json ``` Switching to nanosecond reduces the chance of two or more of them having the same timestamp while keeping the naming convention intact, i.e. `huydo-mbp_13804.1679526325182878000.pt.trace.json` I suspect that this is also the cause of Windows flakiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97392 Approved by: https://github.com/malfet, https://github.com/aaronenyeshi	2023-03-23 16:53:11 +00:00
Bin Bao	d499b7d750	[inductor] Fix a multi-gpu context error (#97398 ) Summary: The problem only appears when we enable multi-gpu test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97398 Approved by: https://github.com/ngimel, https://github.com/shunting314	2023-03-23 15:59:22 +00:00
Danni Li	7711d24717	vmap support for linalg.lu_factor (#94328 ) Differential Revision: D43093457 Fix #91415 ### Expected behaviour No use warning. ```python from functorch import vmap x = torch.randn(4, 3, 2) z = vmap(torch.linalg.lu_factor)(x) ``` Same behaviour as for-loop: ```python x = torch.randn(4, 3, 2) results = [] for xi in x: y = torch.linalg.lu_factor(xi) results.append(y) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94328 Approved by: https://github.com/zou3519, https://github.com/Skylion007, https://github.com/Chillee	2023-03-23 14:18:57 +00:00
mikey dagitses	bdaf402565	build C++ extensions on windows with /std:c++17 (#97413 ) build C++ extensions on windows with /std:c++17 Summary: We added -std=c++17 to Posix builds, but neglected to add this for Windows. This just brings back parity. Test Plan: Rely on CI. Reviewers: ezyang Subscribers: Tasks: Tags: --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97413). * #97175 * __->__ #97413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97413 Approved by: https://github.com/ezyang	2023-03-23 13:31:29 +00:00
Nikita Karetnikov	feace5d66f	[inductor] handle integer `Symbol`s in `is_integer_type` (#97217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97217 Approved by: https://github.com/alexsio27444, https://github.com/ezyang	2023-03-23 13:29:45 +00:00
Liao, Xuan	a331cd4314	[inductor] fix cpp legalize bf16 reduction (#97228 ) When legalizing bf16 for reduction, operators with result dtype of torch.int64, like argmax, would encounter an assertion error now. The PR fixes for the case of int64, enabling several bf16 models (hf_Reformer, doctr_reco_predictor) to run successfully. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97228 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire	2023-03-23 08:52:25 +00:00
Chien-Chin Huang	580b4702bc	[FSDP][optim_state_dict] Consolidate the arguments and logic of optim_state_dict and optim_state_dict_to_load (#96534 ) Summary: The current `optim_state_dict()` does not require users to call `optim.state_dict()` first while `optim_state_dict_to_load()` requires users to call `optim.load_state_dict()`. This PR make both APIs provide the option for users not having to call the extra API. This PR also changes the arguments order of `optim_state_dict_to_load` which is a breaking change. So we should do this asap before the API is adopted in production cases. Test Plan: CI Differential Revision: D43925068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96534 Approved by: https://github.com/rohan-varma	2023-03-23 07:56:08 +00:00
Huy Do	1fb1c6e135	Retry download and install NDK when testing Android (#97067 ) As this step uses network to download and install NDK, it could fail flakily, i.e. https://github.com/pytorch/pytorch/actions/runs/4452757793/jobs/7820701670. So I'm adding retrying to the workflow. I could try figure out a way to Dockerize this, but not sure yet how to handle the GitHub action `reactivecircus/android-emulator-runner@v2` in Docker. So let's opt for the easy fix of retrying. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97067 Approved by: https://github.com/malfet	2023-03-23 07:10:32 +00:00
Edward Z. Yang	37faa48844	DCE inference graphs too (#97275 ) I added a bunch of asserts to verify that I didn't accidentally kill copy_ in the graph, hopefully this combined with our existing tests is good enough. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97275 Approved by: https://github.com/bdhirsh	2023-03-23 07:02:52 +00:00
Kurt Mohler	fbc803df0c	Only warn once for TypedStorage deprecation (#97379 ) Fixes #97207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97379 Approved by: https://github.com/ezyang	2023-03-23 05:40:23 +00:00
Elias Ellison	b507d7d798	Fix Device Idx Setting (#97399 ) We weren't always setting the device indices, which led to a StopIteration Exception on next(iter(device_idxs)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97399 Approved by: https://github.com/yanboliang, https://github.com/ngimel	2023-03-23 04:55:31 +00:00
Peter Bell	5d8c7e7ea4	Sort: Use cub::WarpMergeSort for small sorts (32 < n <= 128) (#96223 ) We currently use `bitonicSortKVInplace` for sorts of size `n <= 32` but use `radixSortKVInplace` for `32 < n <= 4096`. Bitonic sort is also unstable, which forces stable sorts fall back to which is up to 4x slower in this small regime. This PR adds a new kernel `warpMergeSortKVInplace` using `cub::WarpMergeSort` to implement sorts with `32 < n <= 128` and all stable sorts with `n < 128`. This results in up to a 2x speedup for unstable sorts and up to 15x for stable sorts, depending on the input geometry. This also doesn't increase the total number of kernels since we are replacing radix-sorts of size 32 and 128. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96223 Approved by: https://github.com/ngimel	2023-03-23 04:24:54 +00:00
Scott Wolchok	3b54592050	[PyTorch] Add annotation_str benchmark (#96496 ) To be used to evaluate performance of following improvements. Baseline numbers: https://gist.github.com/swolchok/c8bcb92be1dc6e67c4f7efad498becd5 Differential Revision: [D43919653](https://our.internmc.facebook.com/intern/diff/D43919653/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43919653/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/96496 Approved by: https://github.com/Skylion007	2023-03-23 04:18:07 +00:00
PyTorch MergeBot	a34d35d569	[vision hash update] update the pinned vision hash (#97396 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97396 Approved by: https://github.com/pytorchbot	2023-03-23 04:14:58 +00:00
Kazuaki Ishizaki	62ecfa8b79	Fix typo under torch/csrc/jit/passes directory (#97222 ) This PR fixes typo in comments under `torch/csrc/jit/passes` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97222 Approved by: https://github.com/davidberard98, https://github.com/kit1980	2023-03-23 04:08:42 +00:00
mikey dagitses	603a32c964	cleanup caffe2 cc_proto_library (#97325 ) cleanup caffe2 cc_proto_library Summary: This doesn't need to be public, nor does it need a long name. Since this is the most private library, we move it to the bottom of the file. Test Plan: Should be a no-op, verify in CI. Reviewers: sahanp Subscribers: Tasks: Tags: --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97325). * #97337 * #97336 * #97335 * #97334 * __->__ #97325 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97325 Approved by: https://github.com/malfet, https://github.com/PaliC	2023-03-23 03:34:12 +00:00
Jiong Gong	35439e8610	[Inductor] add guards to guarantee vector int32 only used by comparison ops (for masked load) (#97144 ) Fix https://github.com/pytorch/pytorch/issues/97124 and https://github.com/pytorch/pytorch/issues/97127 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97144 Approved by: https://github.com/EikanWang, https://github.com/jansel	2023-03-23 03:12:50 +00:00
Huy Do	c5b65032ac	Restore ROCm trunk jobs (#97354 ) Move it back from unstable as the job looks stable now. The one remaining flaky test I have seen is `functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_svd_cuda_float32` `b04363ead4`. So I just try to skip that one on ROCm? I will monitor the job a bit longer, and have this PR at the ready. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97354 Approved by: https://github.com/zou3519, https://github.com/ZainRizvi	2023-03-23 02:56:44 +00:00
PyTorch MergeBot	a74ecaf0f6	Revert "Retry download and install NDK when testing Android (#97067 )" This reverts commit d9b289b74749f76ea9a8c27e74fb60251ff42a66. Reverted https://github.com/pytorch/pytorch/pull/97067 on behalf of https://github.com/huydhn due to Need to rework this a bit as sdkmanager does not correctly treat a failed download as failure (surprise) https://github.com/pytorch/pytorch/actions/runs/4495666042/jobs/7909537961	2023-03-23 02:53:49 +00:00
Huy Do	da7c42f89a	Uninstall PyTorch after testing on non-ephemeral Windows runners (#97285 ) Per title, I suspect that having a leftover PyTorch built from CUDA 11.7 installed in non-ephemeral Windows runners could cause some flakiness on Windows CUDA 11.8 jobs also running on the same type of runners, for example `win-vs2019-cuda11.8-py3` in `5d3c347bf6` failed with a PATH error: ``` nvrtc: error: failed to open nvrtc-builtins64_117.dll. Make sure that nvrtc-builtins64_117.dll is installed correctly. ``` This also cleans up the dead code about `pytorch_env_restore.bat` under `ci_scripts` temp directory. This directory is cleaned up always by [teardown-win](https://github.com/pytorch/pytorch/blob/master/.github/actions/teardown-win/action.yml#L33). So the bat script will never be there for the next job anyway. As Windows test jobs are doing fine, proving that we don't need this adhoc script anymore. ### Testing https://github.com/pytorch/pytorch/actions/runs/4485931686/jobs/7888513795 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97285 Approved by: https://github.com/seemethere	2023-03-23 02:26:26 +00:00
cyy	e64ddd1ab9	Upgrade NVTX to NVTX3 (#90689 ) Due to recent upgrade to CUDA 11, we can upgrade NVTX to NVTX3 as well, which is a header only library that can simplify the building system a lot. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90689 Approved by: https://github.com/soumith, https://github.com/malfet	2023-03-23 01:56:42 +00:00
ecao	4b75583052	Add autocast_test_lists.py to the merge patterns (#94381 ) Add autocast_test_lists.py to the merge patterns. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94381 Approved by: https://github.com/jgong5, https://github.com/kit1980, https://github.com/malfet	2023-03-23 01:56:02 +00:00
Kazuaki Ishizaki	4610ce49f6	Fix typo under torch/testing directory (#97254 ) This PR fixes typo in comments and messages under `torch/testing` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97254 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-03-23 01:46:17 +00:00
Xiaodong Wang	788300cc2a	[cudnn] Support v8 API in fbcode (#96512 ) Summary: It turns out we never turn on cudnn v8 API which blocks bf16 conv. Enable the new v8 API Test Plan: buck run mode/dev-nosan scripts/xdwang/example:fc_pytorch Reviewed By: ngimel Differential Revision: D43784279 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96512 Approved by: https://github.com/malfet	2023-03-23 01:41:04 +00:00
haozhe.zhu	fe0afc5852	use accumulate type in BF16 gemm(include dot, mv) ref path (#96074 ) Fix https://github.com/pytorch/pytorch/issues/95125 and https://github.com/pytorch/pytorch/issues/83863 for bf16 accumulation in gemm ref path Pull Request resolved: https://github.com/pytorch/pytorch/pull/96074 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2023-03-23 01:22:59 +00:00
Shuming Hu	b45880c537	Optionally ignore utf-8 decoding error when converting std::string to python str. (#97282 ) Summary: When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument. Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070 Reviewed By: Nayef211 Differential Revision: D43970697 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97282 Approved by: https://github.com/davidberard98	2023-03-23 01:19:08 +00:00
Mengwei Liu	a524123c91	[torchgen] Bump native function max namespace levels due for internal use case (#97381 ) Summary: As titled. Should be trivial Test Plan: Rely on unit test Differential Revision: D44314834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97381 Approved by: https://github.com/cccclai	2023-03-23 00:40:37 +00:00
Masaki Kozuki	13ca08435c	[test_foreach] add cases of zero size tensors (#95028 ) supply zero-size tensors only if multi_tensor_apply_kernel would be called w.h.p, i.e. device is cuda and dtype is float32 rel: - https://github.com/pytorch/pytorch/pull/94655 - https://github.com/pytorch/pytorch/issues/94865 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95028 Approved by: https://github.com/ngimel	2023-03-23 00:12:13 +00:00
Zachary DeVito	116a4f2301	linemaps for inductor: python 3.9 and lower doesn't have bisect key argument (#97369 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97369 Approved by: https://github.com/eellison	2023-03-22 23:39:52 +00:00
Zachary DeVito	3303f5447a	[inductor] use real data for cudagraphify (#97363 ) Using zeros is unsafe and results in a bad memory access in GPT2SequenceClassification that only occurs when a tensor gets put at the beginning of segment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97363 Approved by: https://github.com/eellison	2023-03-22 23:39:52 +00:00
Edward Z. Yang	a1edf5f63c	[EASY] Do hook sizes check with SymInt (#97362 ) I don't think this matters for any uses right now, but I found it during an audit; might as well fix it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97362 Approved by: https://github.com/wconstab	2023-03-22 23:26:00 +00:00
Catherine Lee	5425191f57	Update xla pin merge rule for python3.8 (#97371 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97371 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2023-03-22 23:11:48 +00:00
Catherine Lee	bc268284de	[ci] Onnx test 3->2 shards (#97383 ) Nit, not entirely sure why onnx needs an extra shard, it also doesn't seem to be doing anything. https://github.com/pytorch/pytorch/actions/runs/4494513193/jobs/7907327958 The test step is 2 minutes long Pull Request resolved: https://github.com/pytorch/pytorch/pull/97383 Approved by: https://github.com/huydhn	2023-03-22 23:11:39 +00:00
Ivan Zaitsev	191a2322f0	[WIP][Stronghold] Integrate python API BC-linter from test-infra (#96977 ) See: https://github.com/pytorch/test-infra/tree/main/tools/stronghold Pull Request resolved: https://github.com/pytorch/pytorch/pull/96977 Approved by: https://github.com/osalpekar	2023-03-22 22:15:34 +00:00
Zain Rizvi	712bd9ae88	Upload failed and rerun tests (#97304 ) Upload data for any test that didn't cleanly succeed to S3 for injestion by rockset. About 0.001% of tests fall under this category, keeping the data usage low. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97304 Approved by: https://github.com/clee2000	2023-03-22 22:03:56 +00:00
Brian Hirsh	545abc292b	[aot autograd] refactor to make functionalization self-contained (#96341 ) This refactor should make it easier to add an export hook into aot autograd. (1) I killed `create_forward_or_joint_functionalized()` (and the functions that it called, like `forward_or_joint()`) which used to handle autograd + functionalization all-in-one-go for the joint case, and was also used in the inference case. I added a few separate helper functions: `create_functionalized_graph()`: this takes a flat fn, and returns a functionalized fx graph. It is mostly just a thin wrapper around functionalization + make_fx(), but also has some extra logic to manually append `copy_()` ops to the end of the graph. `fn_no_extra_mutations()`: this creates the fn that we want to trace in the inference code path. It takes in a function that it then calls, and returns the outputs + any (updated) mutated inputs. `joint_fn_no_external_mutations()`: this creates the fn that we want to trace in the joint code path. It takes in a function, and traces out its joint. It also does the work of cloning inputs that are mutated and require gradients, returning mutated inputs as outputs, and returning intermediate bases as outputs We should be able to add an export hook by basically adding a similar version of `joint_fn_no_external_mutations` but with a lot more restrictions (guaranteed to have no tangents, not synthetic bases, etc), and calling `create_functionalized_graph()` on it. Differential Revision: [D44204090](https://our.internmc.facebook.com/intern/diff/D44204090) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96341 Approved by: https://github.com/ezyang	2023-03-22 21:41:52 +00:00
Will Constable	e8a722b9cb	Fix missing dynamo cache lookup registration in profiler.profiler (#97305 ) This follows https://github.com/pytorch/pytorch/pull/96199 and supports the 'other' profiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97305 Approved by: https://github.com/voznesenskym	2023-03-22 21:09:16 +00:00
Huy Do	ec54f186fe	Add an issue template to disable CI jobs (#97045 ) Per title, I will update the runbook to point to this after the review Pull Request resolved: https://github.com/pytorch/pytorch/pull/97045 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi	2023-03-22 20:32:35 +00:00
fduwjj	5cc2e4d7c9	[10/N] Remove ST init ops (#96985 ) Differential Revision: [D44158326](https://our.internmc.facebook.com/intern/diff/D44158326) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96985 Approved by: https://github.com/wz337, https://github.com/wanchaol	2023-03-22 20:26:18 +00:00
redwrasse	11114ab8be	rename to need_attn_weights to match elsewhere (#97102 ) Change variable spelling from `need_atten_weights` to `need_attn_weights` to match naming convention elsewhere in pytorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97102 Approved by: https://github.com/drisspg	2023-03-22 20:12:23 +00:00
soulitzer	7a8b691388	Make early stop the default for checkpoint and expose a way to disable (#96866 ) Why did I choose context manager instead of per-call? Early stopping is not part of the model definition, and depending on how a particular model is used, e.g., with PT2 or not we may or may not want to disable early stopping. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96866 Approved by: https://github.com/albanD	2023-03-22 20:03:56 +00:00
fduwjj	546835c45a	[9/N] Remove ST multiple ops (#96989 ) Differential Revision: [D44158327](https://our.internmc.facebook.com/intern/diff/D44158327) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96989 Approved by: https://github.com/wz337, https://github.com/wanchaol	2023-03-22 20:02:58 +00:00
Kazuaki Ishizaki	5d5f43abea	[prims] Fix schema of minimum_value for a primitive operation (#97327 ) This PR fixes incorrect schema for `minimum_value` in creating a primitive operation. This PR also fixes typo in comment and python doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97327 Approved by: https://github.com/zou3519	2023-03-22 20:01:33 +00:00
Yuxin Wu	726fc366a2	Add missing __main__ in two unittests (#97302 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97302 Approved by: https://github.com/zou3519	2023-03-22 19:09:08 +00:00
Kiersten Stokes	28929b1205	Add `as_strided_` to tensor docs (#97300 ) Closes #87365 I added `as_strided_` to the tensor docs, following what seemed to be a pattern consistent with similar functions. More specifically, both the in-place and out-of-place function counterparts are defined in `_tensor_docs.py`, with the in-place version linking to the out-of-place version and the out-of-place version pointing to the corresponding `_torch_docs.py` definition. If the above is not what we want (e.g. we want to add a more robust description, examples, etc.), let me know and I will be happy to update accordingly! Pull Request resolved: https://github.com/pytorch/pytorch/pull/97300 Approved by: https://github.com/zou3519	2023-03-22 19:08:30 +00:00
PyTorch MergeBot	a7856e18a7	Revert "DCE inference graphs too (#97275 )" This reverts commit aa3a57b80d39fc803f3f85e6a84a49926d99b4ba. Reverted https://github.com/pytorch/pytorch/pull/97275 on behalf of https://github.com/ezyang due to this broke a test	2023-03-22 18:55:52 +00:00
Jeffrey Dunn	d779dadda1	Remove stack trace captures from import (#97274 ) Summary: Calls to this function without an argument will get a stack trace at import time. This is expensive, we can just skip it by passing in a value. Test Plan: Wait for tests Differential Revision: D44244345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97274 Approved by: https://github.com/kiukchung	2023-03-22 18:34:13 +00:00
Elias Ellison	9c144bc4fe	Dont increment generation if forward of backward exists, and warning on deallocation of live tensors (#97168 ) Refining the logic for when it is okay to ignore previously live outputs from cudagraphs. If there is a forward that has been invoked without invocation of the corresponding backwards, dont allow overwriting outputs. Differential Revision: [D44228369](https://our.internmc.facebook.com/intern/diff/D44228369) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97168 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-03-22 18:27:36 +00:00
Jason Ansel	9370f253e3	[inductor] Rewrite convolution triton templates (#95556 ) Fixes #95775 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95556 Approved by: https://github.com/Chillee, https://github.com/ngimel	2023-03-22 18:12:23 +00:00
Bin Bao	da96ae230b	[CI] Add a missing dtype flag in nightly perf run (#97357 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97357 Approved by: https://github.com/huydhn, https://github.com/weiwangmeta	2023-03-22 17:28:07 +00:00
PyTorch MergeBot	73b7702b7e	Revert "FIX make sure we import the correct object from multiprocessing (#81862 )" This reverts commit 701cdbb6a5baa65cfbd90b91aff70dc262dcf31f. Reverted https://github.com/pytorch/pytorch/pull/81862 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally	2023-03-22 17:22:47 +00:00
mikey dagitses	6273c0af95	move caffe2/proto/ to its own Bazel package (#97324 ) move caffe2/proto/ to its own Bazel package Summary: This is just to break up build files and make the system easier to reason about during the transition to the common build system. Test Plan: Verified locally and rely on CI. Reviewers: sahanp Subscribers: Tasks: Tags: --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97324). * #97337 * #97336 * #97335 * #97334 * #97325 * __->__ #97324 * #97323 * #97322 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97324 Approved by: https://github.com/malfet	2023-03-22 17:19:26 +00:00
mikey dagitses	364d92f9b6	remove dead torch_pb.h library (#97323 ) remove dead torch_pb.h library Summary: This is only used in one place, ensure it still builds. Test Plan: Rely on CI. Reviewers: sahanp Subscribers: Tasks: Tags: --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97323). * #97337 * #97336 * #97335 * #97334 * #97325 * #97324 * __->__ #97323 * #97322 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97323 Approved by: https://github.com/malfet	2023-03-22 17:06:21 +00:00
soulitzer	89d116d961	[BE][docs]Improve and update checkpoint documentation (#96862 ) Updates: - ~recommend user to use non-reentrant, mention that reentrant will be deprecated in the future~ - merges all the warnings into a single list of non-reentrant improvements over reentrant - adds an additional entry to the list about allowing backward inside checkpointed region Pull Request resolved: https://github.com/pytorch/pytorch/pull/96862 Approved by: https://github.com/albanD	2023-03-22 16:53:29 +00:00
vfdev	0f424f7f05	Fixed broken link to troubleshooting.html docs page (#97330 ) Seen first in error message: ``` [2023-03-22 10:30:39,786] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (64) function: '<resume in paste_mask_in_image>' (/vision/torchvision/models/detection/roi_heads.py:407) reasons: w == 857 to diagnose recompilation issues, see https://pytorch.org/docs/master/dynamo/troubleshooting.html. [2023-03-22 10:30:40,036] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (64) function: '<resume in paste_mask_in_image>' (/vision/torchvision/models/detection/roi_heads.py:406) reasons: ___stack0 == 207 to diagnose recompilation issues, see https://pytorch.org/docs/master/dynamo/troubleshooting.html. ``` Broken link: - https://pytorch.org/docs/master/dynamo/troubleshooting.html. Good link: - https://pytorch.org/docs/master/compile/troubleshooting.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/97330 Approved by: https://github.com/zou3519	2023-03-22 16:40:21 +00:00
David Berard	a133b5081c	[JIT] Partially support ForwardRef type annotations for NamedTuple attributes (#96933 ) Summary NamedTuple attributes can be annotated to declare their type: ```python class MyNamedTuple(NamedTuple): x: int y: torch.Tensor z: MyOtherType ``` Normally in python you can also declare your types as strings, `x: 'int'`. But NamedTuples previously didn't support this, because their annotation evaluation process was slightly different. This PR updates the NamedTuple attribute type annotation evaluation method to support ForwardRef declarations (i.e. declaring as strings). Details Below I repeat the comment I left in _jit_internal.py: NamedTuple types are slightly different from normal types. Normally, annotations are evaluted like this (during jit.script): 1. Load strings of python code into c++ and parse. 2. Get annotations as strings 3. Use the PythonResolver's resolution callback (rcb) to convert the string into a python object 4. We call into annotations.py:ann_to_type to convert python obj from step 3 into a type that torchscript understands. NamedTuples are more complicated, because they have sub-types. Normally, once we have the NamedTuple type object from #3, we can just look at the annotation literal values and use ann_to_type directly on them. But sometimes, users will annotate with string literals, e.g. ``` x: 'int' ``` This also happens with PEP563 (from __forward__ import annotations) These annotations appear in the annotation dict as ForwardRef('int'). Then, we need to convert the string into a python object. This requires having local context for custom objects or imported types. rcb() is what gives us this. So, we plumb rcb through the stack so it can be used in this context for the if block below. FAQ: - Why do we need this special handling for NamedTuple but string annotations work fine for normal types? Normally, we parse the string directly and then call rcb() directly from C++. - Why not use ForwardRef._evaluate? For that, we need globals() and locals() for the local context where the NamedTuple was defined. rcb is what lets us look up into these. So, basically rcb does the hard work for us. - What is rcb? rcb is a ResolutionCallback - python callable that takes a string and returns a type. It's generated by `createResolutionCallback.` in _jit_internal.py. Why is this only partial support: This only plumbs the rcb through some paths. In particular, the `toSugaredValue` path uses a fake rcb. Alternatives*: We could also treat this the way we treat non-nn.Module classes: we evaluate them separately, ahead of time. That solution is probably better, but probably requires a more risky refactor for the way NamedTuples are handled. Fixes #95858 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96933 Approved by: https://github.com/qihqi	2023-03-22 15:20:38 +00:00
mikey dagitses	d850c33bfe	remove dead proto_convert library (#97322 ) remove dead proto_convert library Summary: This has no code, only a collection of headers. Just make sure the only thing that includes it still builds. Test Plan: Rely on CI. Reviewers: sahanp Subscribers: Tasks: Tags: --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97322). * #97337 * #97336 * #97335 * #97334 * #97325 * #97324 * #97323 * __->__ #97322 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97322 Approved by: https://github.com/malfet	2023-03-22 14:40:31 +00:00
nima10khodaveisi	5537792307	[dynamo] handle dim in size kwargs (#96992 ) (#97098 ) Fixes #96992 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97098 Approved by: https://github.com/ezyang	2023-03-22 14:19:59 +00:00
Pearu Peterson	9d5ac03b9a	Deprecate gradcheck check_sparse_nnz argument as duplicate of masked argument (#97187 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97187 Approved by: https://github.com/soulitzer	2023-03-22 14:11:03 +00:00
Edward Z. Yang	cff4826f28	pytorch_unet is now passing (#97309 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97309 Approved by: https://github.com/janeyx99, https://github.com/zou3519	2023-03-22 13:55:05 +00:00
Bin Bao	be49d3b170	[CI] Turn on debug logging for dla102 and gernet_l (#97307 ) Summary: Log the generated code for those two flaky tests to see if there is any codegen difference when they fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97307 Approved by: https://github.com/ezyang	2023-03-22 13:42:13 +00:00
Michael Lazos	c37ab85d96	Improve TORCH_LOGS settings error msg (#97264 ) Lists registered loggable entities if an invalid settings string is passed via TORCH_LOGS [before](https://gist.github.com/mlazos/91fcbc3d577f874bcb3daea44f8b41f2) [after](https://gist.github.com/mlazos/815ea9e76aca665602228f960e0eb0d6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97264 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-03-22 13:26:53 +00:00
XiaobingSuper	aab34a476f	inductor(cpu): support mkldnn packed linear to improve bfloat16 performance (#96954 ) As title, enable mkldnn packed linear to improve bfloat16 performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96954 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/desertfire	2023-03-22 12:25:59 +00:00
Horace He	e49b4d3827	Changed logging in aotautograd a little (#97289 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97289 Approved by: https://github.com/mlazos	2023-03-22 09:33:30 +00:00
wangxiyuan	4ab1588d99	Enhance error message for dependency check (#96642 ) If python development library is missing when building pytorch from source, cmake will raise the error like: ``` CMake Error at cmake/Dependencies.cmake:1079 (if): if given arguments: "VERSION_LESS" "3" Unknown arguments specified ``` it's quite a misleading information that user would consider it's a syntax error or cmake version problem. This PR add a check to ensure `PYTHONLIBS_VERSION_STRING` exist before using. Related #87993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96642 Approved by: https://github.com/kit1980	2023-03-22 08:42:48 +00:00
Michael Lazos	f6bafcde6f	Added current buck target as minifier dep (#97183 ) Summary: Have minifier include the current buck target as a dependency to make sure all deps are included. Test Plan: TORCH_COMPILE_DEBUG_DIR=”.” buck2 run mode/dev-nosan //caffe2/test/inductor:minifier_smoke Differential Revision: D44231209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97183 Approved by: https://github.com/anijain2305	2023-03-22 08:30:53 +00:00
leslie-fang-intel	a6d8c70933	Init quantization backend config for inductor (#96476 ) Summary Init the backend config file with quantization recipes for quantization 2.0 inductor path. In this PR, we only init the recipe for `convolution` and `convolution_relu`. Test Plan ``` clear && python -m pytest test_quantization.py -k test_inductor_backend_config_conv ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96476 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jerryzh168	2023-03-22 07:56:56 +00:00
Wang, Eikan	517a432d6e	[Inductor] Enable CppWrapper to support BF16 (#97089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97089 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-22 05:54:09 +00:00
Wang, Eikan	573b2deb4b	[Inductor] Fix the issue that cannot pass lint check for debug mode (#97249 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97249 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-03-22 04:38:55 +00:00
Wang, Eikan	37e1d85848	[Inductor] Load a BF16 scalar and broadcast it as a float vector (#97070 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97070 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-22 04:25:44 +00:00
Wang, Eikan	c5d7ed9423	[Inductor] Fix the issue that cannot pass lint check for debug mode (#97249 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97249 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-03-22 04:25:44 +00:00
ecao	b72bddabe9	Move empty check to the start of _pack_padded_sequence (#94885 ) Fixes #94122. Move empty check to the start of `_pack_padded_sequence`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94885 Approved by: https://github.com/kshitij12345, https://github.com/jgong5, https://github.com/malfet	2023-03-22 04:16:58 +00:00
Horace He	f9a9a88812	Remove chhillee from autoreview (#97293 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97293 Approved by: https://github.com/kit1980	2023-03-22 03:45:43 +00:00
drisspg	db15d191b6	Update NestedTensor add to support non identical striding for NT+NT (#97195 ) # Summary NestedTensors currenlty don't support non-identical strided addition. When accumulating grad it possible to try and accumulate a grad with different striding then the old var and there is no way to change this in user code. This is a solution.. probs should support strided addition for NT Pull Request resolved: https://github.com/pytorch/pytorch/pull/97195 Approved by: https://github.com/albanD, https://github.com/cpuhrsch	2023-03-22 03:34:47 +00:00
Jiong Gong	4733de18fd	[Inductor] Add debug logging to explain reasons of disabling vectorization (#97108 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97108 Approved by: https://github.com/EikanWang, https://github.com/jansel	2023-03-22 02:38:34 +00:00
Yanbo Liang	c1025af012	[Dynamo] throw better error message if assert with non-string message (#97297 ) Error message before this PR: ``` torch._dynamo.exc.Unsupported: missing: LOAD_ASSERTION_ERROR ``` After: ``` torch._dynamo.exc.Unsupported: assert with non-string message ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97297 Approved by: https://github.com/tugsbayasgalan	2023-03-22 02:24:04 +00:00
Will Constable	57c13fde18	Test and fix guard fail message in CompileProfiler (#97055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97055 Approved by: https://github.com/voznesenskym, https://github.com/jansel	2023-03-22 02:17:57 +00:00
Sergii Dymchenko	1e4e256790	Mention pytorchbot command on label error (#97267 ) It's not clear how to add a label, especially for contributors without write permissions - they don't have a UI for that. One recent struggle example - https://github.com/pytorch/pytorch/pull/94671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97267 Approved by: https://github.com/PaliC, https://github.com/malfet, https://github.com/seemethere	2023-03-22 02:13:14 +00:00
atalman	688427b5ae	Add sympy to binary linux test - fix conda nightly (#97281 ) Try to fix following nightly conda linux failure: https://github.com/pytorch/pytorch/actions/runs/4476375944/jobs/7868006749 We will have to revert builder sympy install: `ce427de8a8` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97281 Approved by: https://github.com/malfet	2023-03-22 01:53:51 +00:00
Yanbo Liang	c7fad13310	[Dynamo] Support nn.Module.named_children (#97216 ) Fixes Meta internal export case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97216 Approved by: https://github.com/jansel	2023-03-22 01:43:10 +00:00
Edward Z. Yang	aa3a57b80d	DCE inference graphs too (#97275 ) I added a bunch of asserts to verify that I didn't accidentally kill copy_ in the graph, hopefully this combined with our existing tests is good enough. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97275 Approved by: https://github.com/bdhirsh	2023-03-22 01:02:21 +00:00
Shunting Zhang	3282030fa4	[inductor] reusing autotuning sub-processes (#97219 ) The major cost of doing autotuning in sub process is process creating and initialization. Previously we do that for each benchmark task. This PR reuse a child process as long as it has not crashed yet. This improves compiling time a lot. It's still a bit slower than single process tuning though. Here are the comparison between single process tuning and multi-process tuning: - if a benchmark task will crash the process, then single process tuning is a no-go - if a benchmark task works fine, then tuning in child process will be slower. We will try to leveraging multi-GPU to further speed this up. TLDR for the compilation time: we reduce the 11x slowdown to 1.5x. We'll try to further improve that. Here are the compilation time comparison: Single process autotuning: ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0307s 90.0% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_7 0.0379s 73.0% ref_mm_plus_mm 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% SingleProcess AUTOTUNE takes 9.04686689376831 seconds ``` Naive multi process tuning (not reuse child process): 11x slower than single process autotuning ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0287s 100.0% triton_mm_plus_mm_6 0.0287s 100.0% triton_mm_plus_mm_1 0.0317s 90.3% triton_mm_plus_mm_5 0.0317s 90.3% triton_mm_plus_mm_7 0.0379s 75.7% ref_mm_plus_mm 0.0389s 73.7% triton_mm_plus_mm_2 0.0399s 71.8% triton_mm_plus_mm_3 0.0399s 71.8% triton_mm_plus_mm_4 0.0420s 68.3% SubProcess AUTOTUNE takes 101.22216320037842 seconds ``` Multi process tuning reusing child process (this PR): 1.5x slower than single process autotuning ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0307s 90.0% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_7 0.0379s 73.0% ref_mm_plus_mm 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% SubProcess AUTOTUNE takes 13.752070665359497 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97219 Approved by: https://github.com/ngimel	2023-03-22 00:52:57 +00:00
Mikayla Gawarecki	0b094ca37f	Add gradcheck_nondet_tol to a few padding moduleinfos (#97265 ) Fixes #96739, see https://github.com/pytorch/pytorch/issues/96739#issuecomment-1478327704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97265 Approved by: https://github.com/albanD	2023-03-21 23:46:28 +00:00
Brian Hirsh	af440c427b	[draft for discussion] add per-dispatch key modes (#97052 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97052 Approved by: https://github.com/ezyang, https://github.com/zou3519	2023-03-21 23:45:45 +00:00
Jun Luo	793cf0cbb0	Fix dispatching issue of the new device type. (#97273 ) Summary: Fix the device type dispatching issue. Test Plan: All CI should pass. Reviewed By: scottxu0730 Differential Revision: D44179512 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97273 Approved by: https://github.com/ezyang	2023-03-21 23:23:06 +00:00
jjsjann123	2b32a74ab0	moving nvfuser benchmark to third_party/nvfuser (#96725 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96725 Approved by: https://github.com/davidberard98	2023-03-21 23:19:15 +00:00
Nikita Shulga	a1ef0be30c	[BE] Remove spurious semicolon in XPUHooksInterface.h (#97296 ) Semicolon in `void foo() {};` is redundant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97296 Approved by: https://github.com/kit1980	2023-03-21 23:15:27 +00:00
Horace He	6dded5d63e	Fixes warning to refer to SMs instead of Cuda Cores (#97224 ) Fixes https://github.com/pytorch/pytorch/issues/97179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97224 Approved by: https://github.com/eellison, https://github.com/voznesenskym	2023-03-21 22:37:31 +00:00
Natalia Gimelshein	47f18b78ec	leave libdevice name for fbcode (#97257 ) fbcode triton version is not updated yet, leave libdevice name there. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97257 Approved by: https://github.com/bertmaher, https://github.com/jansel	2023-03-21 21:51:15 +00:00
sclarkson	9a18968253	Fix kDefaultTimeout multiple definition build failure (#97270 ) Make the namespace explicit to avoid the constexpr conflict on GCC 11. Fixes #90448 @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/97270 Approved by: https://github.com/ezyang	2023-03-21 21:44:53 +00:00
Natalia Gimelshein	e7d9331688	[inductor] hoist symbolic padding expressions (#97099 ) Towards fixing pnasnet5large, see #96709. The generated kernel looks much better ``` @pointwise(size_hints=[1048576], filename=__file__, meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'i32', 3: 'i32', 4: 'i32', 5: 'i32', 6: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': [], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 6), equal_to_1=())]}) @triton.jit def triton_(in_ptr0, out_ptr0, ks0, ks1, ks2, ks3, xnumel, XBLOCK : tl.constexpr): xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x1 = (xindex // ks0) % ks0 x0 = xindex % ks0 x2 = (xindex // ks3) x4 = xindex tmp0 = x1 + ((-1)ks1) tmp1 = 0 tmp2 = tmp0 >= tmp1 tmp3 = ks2 tmp4 = tmp0 < tmp3 tmp5 = x0 + ((-1)ks1) tmp6 = tmp5 >= tmp1 tmp7 = tmp5 < tmp3 tmp8 = tmp2 & tmp4 tmp9 = tmp8 & tmp6 tmp10 = tmp9 & tmp7 tmp11 = tl.load(in_ptr0 + (x0 + ((-1)ks1) + (ks2x1) + (x2(ks2ks2)) + ((-1)ks1ks2) + tl.zeros([XBLOCK], tl.int32)), tmp10 & xmask, other=0) tmp12 = tl.where(tmp10, tmp11, 0.0) tl.store(out_ptr0 + (x4 + tl.zeros([XBLOCK], tl.int32)), tmp12, xmask) ``` Interestingly, removing `expand` in in index `simplify` function makes `load` expression a little bit better, but `store` fails to simplify to flat store in this case, so I'm leaving `expand` in. Full pnasnet still chokes on `ceiling` in batch_norm kernels, additionally, it looks like shape propagation goofs in inductor and generates overly complicated expressions, we should switch to meta data from fx graph. I'm still not adding `ceil` print to triton, because we should be able to hoist all indexing expression (and just printing ceil without converting to int64 doesn't work) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97099 Approved by: https://github.com/jansel	2023-03-21 21:43:32 +00:00
mikey dagitses	b615b7ef9e	use a proper cc_library for the miniz library (#96957 ) use a proper cc_library for the miniz library Summary: Using "include" is hostile to the Bazel way of doing things. Test Plan: Rely on CI. Reviewers: Subscribers: Tasks: Tags: --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/96957). * __->__ #96957 * #96956 * #96955 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96957 Approved by: https://github.com/PaliC	2023-03-21 21:39:43 +00:00
Huy Do	d9b289b747	Retry download and install NDK when testing Android (#97067 ) As this step uses network to download and install NDK, it could fail flakily, i.e. https://github.com/pytorch/pytorch/actions/runs/4452757793/jobs/7820701670. So I'm adding retrying to the workflow. I could try figure out a way to Dockerize this, but not sure yet how to handle the GitHub action `reactivecircus/android-emulator-runner@v2` in Docker. So let's opt for the easy fix of retrying. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97067 Approved by: https://github.com/malfet	2023-03-21 21:36:59 +00:00
mikey dagitses	19b5b67bc5	exclude all generated files from torch_headers (#96956 ) exclude all generated files from torch_headers Summary: This allows Bazel to build without having to wipe the standard CMake build. The standard CMake build produces generated files in the source tree, this causes a problem because Bazel has its own way of generating them, and then both sets of generated files conflict with each other. Test Plan: Rely on CI. Reviewers: Subscribers: Tasks: Tags: --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/96956). * #96957 * __->__ #96956 * #96955 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96956 Approved by: https://github.com/PaliC	2023-03-21 21:34:58 +00:00
Shunting Zhang	d785d0c0a1	[reland][inductor] do benchmark in sub processes for max autotuning (#97215 ) Previous attempt of landing this PR is reverted due to a landrace: https://github.com/pytorch/pytorch/pull/96410 . The reason is `PyCodeCache.load` has a new linemap argument being added but my previous PR does not handle it (due to a stale checkout). Fix is trivial. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97215 Approved by: https://github.com/Chillee, https://github.com/jansel	2023-03-21 21:19:45 +00:00
mikey dagitses	b759134152	update Bazel to the latest release 6.1.1 (#96955 ) update Bazel to the latest release 6.1.1 Summary: Test Plan: Rely on CI. Reviewers: Subscribers: Tasks: Tags: --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/96955). * #96957 * #96956 * __->__ #96955 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96955 Approved by: https://github.com/PaliC	2023-03-21 21:02:44 +00:00
Bin Bao	ea9194a4f2	[inductor] Make the original ATen info dumped in alphabetical order (#97261 ) Summary: To avoid a lot of noises when comparing output_code.py from two runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97261 Approved by: https://github.com/Chillee	2023-03-21 20:34:49 +00:00
Pierre Moulon	01885cea43	[Typo] mulithreading_enabled => multithreading_enabled (#97054 ) Summary: Fix typo Test Plan: Continuous integration - Expected NoOp since it is just a variable renaming Differential Revision: D44118850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97054 Approved by: https://github.com/Skylion007	2023-03-21 20:11:59 +00:00
Mikayla Gawarecki	b04363ead4	[easy] Expose documentation for a few global nn.Module hooks (#97185 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97185 Approved by: https://github.com/albanD	2023-03-21 20:09:29 +00:00
Han Qi (qihqi)	7a93865c46	Fix regression on loading jit module from flatbuffer (#97190 ) Summary: https://fb.workplace.com/groups/pytorch.edge.users/permalink/1287477365455887 Root cause: Introduced in D44106776. But this loop is wierd because class_dep can grow, so it cannot be replaced with c10::irange. Test Plan: Used model at `fbpkg fetch speech.tuna.milan.ondevice.en_us.transducer:6` Then `buck run xplat/caffe2/fb/lite_predictor:convert_model -- --model=$HOME/20230320debug/pytorchmodel.pt --output_name=/tmp/ffmodel.ff` Differential Revision: D44234894 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97190 Approved by: https://github.com/larryliu0820	2023-03-21 19:54:44 +00:00
Jason Ansel	de2230baa7	[dynamo] Improve error message for missing backend (#97255 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97255 Approved by: https://github.com/msaroufim	2023-03-21 19:36:04 +00:00
Edward Z. Yang	ec3894ec0a	Fix typo in settings regex logging (#97245 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97245 Approved by: https://github.com/jansel, https://github.com/mlazos	2023-03-21 19:02:04 +00:00
Joel Schlosser	77e73b9b7a	Refactor NT offsets metadata to be a Tensor (#96909 ) It's tedious work, but somebody's gotta do it. Benefits: * Enable access to offsets metadata from Python via private API (for validation, etc.) * Consistency with nested sizes / strides metadata * Needed for SymInt-ifying offsets metadata * more TBD Bonus: * Remove `_tensor` suffixes from metadata / getter names Pull Request resolved: https://github.com/pytorch/pytorch/pull/96909 Approved by: https://github.com/drisspg	2023-03-21 18:51:35 +00:00
Masaki Kozuki	22ea21da3d	Change 1D Tensor of 1 element to 0D Tensor (#96994 ) add 0d tensor to graph adam/adamw test Affected: - `torch.cuda.amp.GradScaler`'s `found_inf`, `_scale`, and `_growth_tracker` - `step` of Adam & AdamW of `capturable` Fixes #96776 🤞 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96994 Approved by: https://github.com/janeyx99	2023-03-21 18:24:19 +00:00
Kim,Won-Joong	c47cf9bc7f	Update parallel_apply.py for assertion error when len(modules) != len(inputs) (#94671 ) Print the result why it is wrong. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94671 Approved by: https://github.com/ngimel, https://github.com/kit1980	2023-03-21 17:46:23 +00:00
Li-Huai (Allan) Lin	a6bbeec2e1	Fix required template (#97247 ) Fixes https://github.com/pytorch/pytorch/pull/96878#issuecomment-1477776378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97247 Approved by: https://github.com/ezyang	2023-03-21 17:43:44 +00:00
Lu, Chengjun	dbb31672b2	Fix the compatible issue of the Dynamo and the PyDev.Debugger. (#96721 ) The PyDev.Debugger use the _PyFrameEvalFunction to debug the python script. Fallback to the previous _PyFrameEvalFunction to fix the dynamo with PyDev.Debugger issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96721 Approved by: https://github.com/ezyang	2023-03-21 17:36:14 +00:00
Bin Bao	b95896c578	[CI] Fix perf_nightly output file naming error (#97263 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97263 Approved by: https://github.com/huydhn	2023-03-21 17:35:36 +00:00
Bert Maher	acd9df8a72	[inductor] Add scaled_dot_product_attention to fallback kernels (#93339 ) Summary: We don't have decomps/lowerings for SDPA (and probably won't for a while) so don't warn. Test Plan: code inspection Differential Revision: D42878203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93339 Approved by: https://github.com/desertfire, https://github.com/drisspg	2023-03-21 17:06:18 +00:00
Jing Xu	0a2b527abe	Update mkl_verbose return value check due to API change in mkl (#96283 ) As title. Originally `mkl_verbose()` function returned `0` and `1`, indicating failure and success respectively. However, the version that PyTorch uses now changed the output of `mkl_verbose()` to reflect its input level. Thus, the check logic needs to be changed to compare output of the `mkl_verbose()` function with -1. https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/support-functions/miscellaneous/mkl-verbose.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/96283 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-03-21 16:56:47 +00:00
Huy Do	244736a5a5	Mark ROCm tests as flaky (#97259 ) Before https://github.com/pytorch/pytorch/pull/96464, ROCm tests in trunk are already quite flaky https://hud.pytorch.org/reliability/pytorch/pytorch?jobName=trunk%20%2F%20linux-focal-rocm5.4.2-py3.8%20%2F%20test%20(default). After https://github.com/pytorch/pytorch/pull/96464, there is a new group of flaky failures coming from functorch. So let's mark the test as flaky to monitor without impacting trunk. Two flaky tests currently seeing in trunk are: * https://github.com/pytorch/pytorch/issues/97256 * `functorch/test_memory_efficient_fusion.py` OOM Pull Request resolved: https://github.com/pytorch/pytorch/pull/97259 Approved by: https://github.com/malfet, https://github.com/zou3519	2023-03-21 16:55:00 +00:00
Edward Z. Yang	5d3c347bf6	Make split reduction warning only emit once (#97112 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97112 Approved by: https://github.com/Skylion007	2023-03-21 14:57:31 +00:00
tommoral	701cdbb6a5	FIX make sure we import the correct object from multiprocessing (#81862 ) Fixes #44687. The issue was that the Process object is not the one from the `_default_context` which should be `loky` when nesting `loky` calls. This is a revamp of #53282 that was reverted because it broke some other tests. How can I run the failing tests so I can see why this breaks? Pull Request resolved: https://github.com/pytorch/pytorch/pull/81862 Approved by: https://github.com/VitalyFedyunin, https://github.com/janeyx99	2023-03-21 14:48:17 +00:00
Peter Bell	4e054175d6	Fix uniform returning end point for BFloat16 and Half (#96962 ) Fixes #96947 If we generate `1.0 - float_eps`, the BFloat16 and Half constructors will round this to 1.0 which is outside of the half-open range. Instead, we delay the bounds change until after the value has been rounded. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96962 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-03-21 14:01:29 +00:00
Richard Zou	5acf403088	Run functorch tests in default shards; delete functorch-specific shards (#96464 ) Fixes #96347 This PR: - Makes the functorch tests run as a part of the "default" shards - Delete the functorch CI shard from all CI job configurations (if it exists) - Increase the "default" shard count by 1 for each job, unless it was previously set to 1, to accommodate the new functorch tests and not regress time-to-signal. - Adds a bunch of skips for ROCM and torchdynamo configurations. We can investigate them later. NB: I might go through some more iterations to figure out what other skips need to be added, but this iteration of the PR seems to pass most CI. suite. Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/96464 Approved by: https://github.com/huydhn	2023-03-21 13:53:01 +00:00
mantaionut	b004819f91	Re-enable TestJit.test_profiler (#94391 ) Test to see if TestJit.test_profiler still fails on Windows on CI. I was not able to reproduce the crash locally. Also I tested 3 times on CI and the test passed. Even with this change the test will still be disabled due to https://github.com/pytorch/pytorch/issues/81626 Fixes #62820 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94391 Approved by: https://github.com/huydhn	2023-03-21 13:52:23 +00:00
Chung-chieh Shan	2c588b3ad5	Allow new_full's fill_value argument type to be complex (#91345 ) It seems that this code should type-check but doesn't: ```python torch.zeros((2,3),dtype=torch.cdouble).new_full((4,5),complex(6,7)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91345 Approved by: https://github.com/zou3519, https://github.com/ezyang	2023-03-21 12:34:00 +00:00
Iris	38b687ed4d	[PTD][Checkpoint] Add checkpointing support for DTensor submesh (#96802 ) DTensor submesh support is added in https://github.com/pytorch/pytorch/pull/95458. This PR adds support for DTensor submesh by adding an extra check when create local save/load plan. If the rank is not participating in the mesh, we simply skip creating WriteItem/ReadItem for the local SavePlan/LoadPlan. Updated the associated test as well. cc. @wanchaol, @kumpera Pull Request resolved: https://github.com/pytorch/pytorch/pull/96802 Approved by: https://github.com/wanchaol	2023-03-21 08:17:17 +00:00
Yanbo Liang	a9b9fd90a2	[Inductor] index_put - unsqueeze indices[0] if self and indices[0] are not broadcastable (#97105 ) Fixes #97104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97105 Approved by: https://github.com/ngimel	2023-03-21 07:07:41 +00:00
Will Constable	141a2ebcf1	Clean up Compilation Profiler (#97029 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97029 Approved by: https://github.com/voznesenskym	2023-03-21 06:24:22 +00:00
Michael Voznesensky	f9ce593267	Extend aot autograd dedup guards to params, stop using positions (#96774 ) The purpose of this PR is to remove reliance on argument positions in dedup guards, AND extend the functionality to params. A version of this PR was stamped prior https://github.com/pytorch/pytorch/pull/95831 - but was kinda gross, because it was based on an underlying PR that did way too much with source names. This PR leaves most of that alone, in favor of just reusing the same name standardization logic that dynamo module registration does. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96774 Approved by: https://github.com/ezyang	2023-03-21 05:59:33 +00:00
Xia, Weiwen	e8be6d813b	[Quant][FX] Fix issue of lowering weighted functional ops with kwargs (#95865 ) Fixes #95492 Summary This PR fixes the issue that weighted functional ops with kwargs are not lowered correctly since kwargs are ignored. These kwargs should be moved from the functional op to its cooresponding prepack op, e.g., from `F.conv2d` to `quantized.conv2d_prepack`. Test plan python test/test_quantization.py -k test_lowering_functional_conv_with_kwargs python test/test_quantization.py -k test_lowering_functional_conv_transpose_with_kwargs python test/test_quantization.py -k test_lowering_functional_linear_with_kwargs Pull Request resolved: https://github.com/pytorch/pytorch/pull/95865 Approved by: https://github.com/jgong5, https://github.com/supriyar	2023-03-21 05:29:03 +00:00
Scott Wolchok	7beac103ee	[PyTorch] Remove unnecessary unpickler.h #include in jit/serialization/import.h (#96687 ) A forward declaration will do here. Differential Revision: [D43995795](https://our.internmc.facebook.com/intern/diff/D43995795/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96687 Approved by: https://github.com/suo	2023-03-21 03:43:05 +00:00
BowenBao	d2f5722996	[ONNX] 'Transform' as base class for passes (#95935 ) Base class `Transform` provides basic diagnostics functionality. Diagnostics are automatically recorded for inherited passes. New base class `Pass` will be added when `analysis` is introduced. Example diagnostics for `test_mnist`: Decompose: <img src="https://user-images.githubusercontent.com/9376104/222615465-689e76eb-6b30-4670-aed5-a0d419583bfe.png" width="80%" height="80%"> Shape inference: <img src="https://user-images.githubusercontent.com/9376104/222615527-0484e504-f9d5-4f5c-b018-3e45ef15c138.png" width="80%" height="80%"> Moving placeholders: <img src="https://user-images.githubusercontent.com/9376104/222852379-36caf263-6965-4e5d-9dce-f63075a3812f.png" width="80%" height="80%"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95935 Approved by: https://github.com/justinchuby	2023-03-21 03:31:22 +00:00
Jack Taylor	45296f87ec	Fix for verify_dynamo on ROCm (#97013 ) Prior to this change ROCm was not exiting check_cuda, causing an exception at packaging.version.parse(torch.version.cuda), this change exits check_cuda if torch.version.cuda is None ``` python verify_dynamo.py Python version: 3.9.16 `torch` version: 2.1.0a0+git2b2f10c CUDA version: None ROCM version: 5.4 All required checks passed ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97013 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet, https://github.com/kit1980	2023-03-21 03:19:31 +00:00
Markus Hennerbichler	ee6b19bd4c	Error only if autocast actually enabled (#96097 ) I am trying to use bfloat16 AMP on a range of devices, using the `enabled` argument to actually enable/disable AMP, like this: ```python with torch.cuda.amp.autocast(enabled=use_amp, dtype=torch.bfloat16): ``` However, this raises a RuntimeError even if enabled=False. ``` File "/venv/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 221, in __init__ raise RuntimeError('Current CUDA Device does not support bfloat16. Please switch dtype to float16.') RuntimeError: Current CUDA Device does not support bfloat16. Please switch dtype to float16. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96097 Approved by: https://github.com/ngimel, https://github.com/kit1980	2023-03-21 03:13:13 +00:00
Bert Maher	cc0701e5b3	[inductor] Move fx-fusion tests to a separate file (#97028 ) They're sort of independent of the rest of inductor, and this makes them a bit easier to find and marginally faster to run. Differential Revision: [D44168337](https://our.internmc.facebook.com/intern/diff/D44168337/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44168337/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/97028 Approved by: https://github.com/jansel	2023-03-21 03:11:39 +00:00
Bert Maher	695d98b0bc	[inductor] Allow `tensors` kwarg in sink_cat_after_pointwise (#97019 ) Lacking handling of kwargs strikes again. Differential Revision: [D44166740](https://our.internmc.facebook.com/intern/diff/D44166740/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97019 Approved by: https://github.com/jansel	2023-03-21 03:07:53 +00:00
Pritam Damania	e20e5f5578	[RFC] Add an API to remove autograd hooks from DDP (#96490 ) Summary: When creating a new DDP instance for the same model when an old DDP instance existed, the autograd hooks from the old DDP instance might not be cleared. Also, relying on python gc to clear out old autograd hooks is fragile and may not work 100% of the time. As a result, in this PR I'm adding a way to explicitly remove these hooks from DDP Test Plan: Unit test added Pull Request resolved: https://github.com/pytorch/pytorch/pull/96490 Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma	2023-03-21 02:56:16 +00:00
Edward Z. Yang	fa82080016	Don't run fallback if symbolic sizes in fake tensor (#97148 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97148 Approved by: https://github.com/Skylion007, https://github.com/eellison, https://github.com/bdhirsh	2023-03-21 02:23:44 +00:00
chunyuan	adcd1b3077	inductor: support profiler_mark_wrapper_call in cpp wrapper (#97119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97119 Approved by: https://github.com/alexsio27444, https://github.com/jgong5, https://github.com/desertfire	2023-03-21 01:40:09 +00:00
Kazuaki Ishizaki	50ed38a7eb	Fix typo under docs directory (#97202 ) This PR fixes typo in `.rst` files under docs directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97202 Approved by: https://github.com/kit1980	2023-03-21 01:24:10 +00:00
Chien-Chin Huang	793cb3f424	[FSDP][optim_state_dict] Print out more useful error message for optim_state_dict (#96860 ) Summary: Print out more useful error message for optim_state_dict Test Plan: CI Reviewed By: wz337 Differential Revision: D43556073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96860 Approved by: https://github.com/rohan-varma, https://github.com/wz337	2023-03-21 01:04:24 +00:00
Chien-Chin Huang	f5612758d8	[SPMD] Make the IterGraphModule less verbose and more profiling friendly (#96969 ) Make the IterGraphModule less verbose and more profiling friendly Differential Revision: [D44110594](https://our.internmc.facebook.com/intern/diff/D44110594/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96969 Approved by: https://github.com/mrshenli	2023-03-20 23:54:48 +00:00
redwrasse	9c288b992b	minor spelling fixes NestedTensorImpl.h (#97103 ) Minor spelling fixes in comments in NestedTensorImpl.h Pull Request resolved: https://github.com/pytorch/pytorch/pull/97103 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-03-20 23:46:40 +00:00
Driss Guessous	a269e5fa04	Add forward and backward support for silu to NestedTensors (#97181 ) # Summary Add forward and backward support for silu to NestedTensors - Add forward support to silu - Add forward support to silu_ - Add backward support to silu - Add to NT docs - Add tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/97181 Approved by: https://github.com/cpuhrsch, https://github.com/jbschlosser	2023-03-20 23:46:12 +00:00
Carl Lemaire	9a5fed1bd0	Harmonize BCELoss example to F.binary_cross_entropy (#95178 ) About that line: ``` torch.empty(3).random_(2) ``` * Since BCE supports targets in the interval [0, 1], a better example is to sample from uniform(0, 1), using `rand` * BCE supports multiple dimensions, and the example in `F.binary_cross_entropy` highlights it * `rand` is more well known than `random_`, which is a bit obscure (`rand` is in the [Random Sampling section in the docs](https://pytorch.org/docs/stable/torch.html#random-sampling)) * Chaining `empty` and `random_` gives binary values as floats, which is a weird way to get that result * Why do it in two steps when we have sampling functions that do it in a single step? Pull Request resolved: https://github.com/pytorch/pytorch/pull/95178 Approved by: https://github.com/albanD, https://github.com/kit1980	2023-03-20 23:45:01 +00:00
chenxujun	252c6f25e0	Update vec256_complex_float_vsx.h (#95658 ) Update vec256_complex_float_vsx.h Pull Request resolved: https://github.com/pytorch/pytorch/pull/95658 Approved by: https://github.com/jgong5, https://github.com/kit1980	2023-03-20 23:44:21 +00:00
Natalia Gimelshein	c089c6bf15	update triton pin (#96730 ) Fixes #96694 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96730 Approved by: https://github.com/malfet	2023-03-20 23:42:33 +00:00
Wei Wang	485cc7515d	[Inductor CI] Fix concurrency cancellation rule of inductor-perf-compare job (#97197 ) Currently old commits' jobs are not cancelled: https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-compare.yml This PR tries to fix the concurrency rule so that when new commit is pushed, old job gets cancelled immediately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97197 Approved by: https://github.com/atalman	2023-03-20 23:32:38 +00:00
itmorn	ea6113ea20	Update loss.py (#95367 ) Fix the dimension bug in the document Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95367 Approved by: https://github.com/albanD, https://github.com/kit1980	2023-03-20 23:24:49 +00:00
Mark Saroufim	b1e8f2fc11	Update torch.fx docs (#97058 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/97058 Approved by: https://github.com/svekars, https://github.com/SherlockNoMad	2023-03-20 23:13:16 +00:00
Aleksei Nikiforov	663e7c9eeb	Fix TestBufferProtocolCPU::test_byte_to_int_cpu test on Big Endian (#96424 ) Fix TestBufferProtocolCPU::test_byte_to_int_cpu test on Big Endian Pull Request resolved: https://github.com/pytorch/pytorch/pull/96424 Approved by: https://github.com/janeyx99	2023-03-20 21:27:33 +00:00
Huy Do	270b42d279	Fix test_schema_check CUDA illegal memory access (#97062 ) I'm seeing some recent [CUDA illegal memory access](https://hud.pytorch.org/failure/FAILED%20test_schema_check.py%3A%3ATestSchemaCheckModeOpInfoCUDA%3A%3Atest_schema_correctness_fft_fft_cuda_bool%20-%20RuntimeError%3A%20CUDA%20error%3A%20an%20illegal%20memory%20access%20was%20encountered) error related to this test. So a cheap fix is to run it serially. Fixes https://github.com/pytorch/pytorch/issues/95749 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97062 Approved by: https://github.com/clee2000	2023-03-20 20:57:27 +00:00
loganthomas	c848a777e8	DOC: Various typo fixes (#97095 ) Various typos found while browsing documentation/source code. Thank you for a wonderful deep-learning library! Pull Request resolved: https://github.com/pytorch/pytorch/pull/97095 Approved by: https://github.com/mikaylagawarecki, https://github.com/kit1980	2023-03-20 20:46:04 +00:00
donnyyou	8a6e28ccd3	Fix typo for generator. (#97136 ) Fix typo for generator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97136 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-03-20 20:43:56 +00:00
Shunting Zhang	13398d8b95	[inductor] improve bandwidth computation (#97057 ) When we compute bandwidth for an kernel, we should double the memory usage for inplace arguments since we need read them once and write them once. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97057 Approved by: https://github.com/Chillee	2023-03-20 20:30:46 +00:00
shibo	6b691b99da	add amp support for custom backend (#96188 ) Fixes #ISSUE_NUMBER 1、add amp support for custom backend 2、optimize the file `backend_registration.py`, and rename it with `custom_backend_registration.py`. And then we would register other funcs for custom backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96188 Approved by: https://github.com/bdhirsh	2023-03-20 20:27:35 +00:00
Catherine Lee	a37b4fa03a	[mergebot] An ignore current flag (#96756 ) with https://github.com/pytorch/test-infra/pull/3882 Add -ic/--ignore-current flag for merge. It ignores the currently failing checks but will stop when there is a new failure. If there are no pending checks, it fails and tells you to use -f/--force. Doesn't work on ghstacks with more than 1 PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96756 Approved by: https://github.com/huydhn	2023-03-20 19:07:01 +00:00
Jane Xu	aacbf091db	Allow fused optimizers to call _foreach_zero_ in zero_grad (#97159 ) Fixes #97032 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97159 Approved by: https://github.com/Skylion007	2023-03-20 19:03:26 +00:00
ZhongYingMatrix	1c40ce4f19	handle SymInt shape/input when debugging in dynamic shape (#96645 ) Handle SymInt shape/input when debugging in dynamic shape. Fixes #96272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96645 Approved by: https://github.com/bdhirsh	2023-03-20 18:19:03 +00:00
Li-Huai (Allan) Lin	100641aadf	[MPS] Fix torch.eye unsupported bool constant on macOS 12 (#97027 ) Fixes #91620 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97027 Approved by: https://github.com/kulinseth	2023-03-20 18:08:36 +00:00
Wanchao Liang	16e7e5a24b	[dtensor] lazy init process groups in device mesh (#96700 ) This PR adds a private flag to allow process grou lazy initialization, this is replacing the previous `dim_groups` arg, as no one is using that now This could help avoid creating process groups when not necessary Differential Revision: [D44044664](https://our.internmc.facebook.com/intern/diff/D44044664) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96700 Approved by: https://github.com/fduwjj, https://github.com/XilunWu	2023-03-20 17:50:04 +00:00
Bin Bao	ead5186462	[CI] Change tests used by the new dashboard (#96986 ) Summary: Stop using runn.py to trigger the new dashboard run. Instead, we spell out the actual cmd which will be easier to extend. Dropping perf tests for dynamo_eager and aot_eager in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96986 Approved by: https://github.com/huydhn, https://github.com/weiwangmeta	2023-03-20 17:28:12 +00:00
Brian Coutinho	bda9d7ba73	[pytorch][2/3] Pytorch profiler permits CPU events with CUPTI Range profiler mode (#97048 ) Summary: ## Motivation Initial version of CUPTI Range profile was conservative in turning of all other event types in kineto/pytorch profiler. However, there is value in enabling CPU side activity logging. This let's us correlate the CPU operator -> GPU kernel statistics, helps us analyze flops/other performance metrics at the operator level. ## Details 1. Update pytorch profiler experimental configs parsing to allow setting CPU activities along with range profiler. Only enable on per kernel measurement mode. 1. Fixed Clang tidy issues (added nolint for 2 of them) Test Plan: Testplan see bottom diff Differential Revision: D44165079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97048 Approved by: https://github.com/aaronenyeshi	2023-03-20 14:44:31 +00:00
Liao, Xuan	16d85160d5	Fix standalone compile for op with multiple outputs (#96936 ) Op-benchmark directly uses fx.Graph to create nodes without dynamo and then compiles the graph with inductor. Currently, operators with multiple outputs, e.g. native_layer_norm, would fail to run caused by standalone torch._inductor.compile() API #95594. Actually, the graph's result is a node with several outputs instead of a tuple with several nodes. However, the standalone API forces a non-tuple result be a tuple, i.e., a tuple with one node-type element with several outputs. This PR considers a return node with several outputs as a tuple to avoid errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96936 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-20 07:56:18 +00:00
Jianyu Huang	4a99b4f12b	enable Half for cat serial kernel (#96021 ) Summary: 1.31 x speedup. \| \| shape \| before \| after \| \| ------------ \| ------------- \| ------------ \| ------------- \| \| half \| 1024 * (100 + i) \| 235.75 us \| 179.11 us \| Benchmark with ``` import torch import torch.utils.benchmark as benchmark def cat(args, dim=0): return torch.cat(args, dim) tensors = [] for i in range(10): tensors.append(torch.rand(1024, 100 + i).half()) t0 = benchmark.Timer( stmt="cat(tensors, dim=1)", setup="from __main__ import cat", globals={"tensors": tensors}, num_threads=1, ) ``` Test Plan: CI Differential Revision: D43810514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96021 Approved by: https://github.com/ngimel, https://github.com/houseroad, https://github.com/jgong5	2023-03-20 05:33:04 +00:00
drisspg	dba9487324	Add helpful pretty pretting summaries to torch for lldb debugging (#97101 ) # Summary Add support for pretty printing of tensors when using lldb similiar to what is currently available for gdb <img width="772" alt="Screenshot 2023-03-18 at 6 20 34 PM" src="https://user-images.githubusercontent.com/32754868/226148687-b4e6cfe1-8be1-4657-9ebc-d134f697dd37.png"> <img width="254" alt="Screenshot 2023-03-18 at 6 20 43 PM" src="https://user-images.githubusercontent.com/32754868/226148690-caca6f76-d873-419e-b5e4-6bb403b3d179.png"> I changed it so to override the variable formatting instead of having to call a seperate command you can just do `print <tensor>` I also add one for sizes <img width="309" alt="Screenshot 2023-03-19 at 1 05 49 PM" src="https://user-images.githubusercontent.com/32754868/226206458-e3f0111b-6a97-4d75-8125-48455aa2cf43.png"> Last one: <img width="815" alt="Screenshot 2023-03-19 at 1 39 23 PM" src="https://user-images.githubusercontent.com/32754868/226207687-20bd014f-9e0e-4c01-b2c8-190b7365aa70.png"> If you use the codelldb extension be sure to add: `"lldb.launch.initCommands": ["command source ${env:HOME}/.lldbinit"]` To your setttings .json Pull Request resolved: https://github.com/pytorch/pytorch/pull/97101 Approved by: https://github.com/ngimel	2023-03-20 01:27:44 +00:00
Aaron Gokaslan	5471621497	[BE] Remove unnecessary dict comprehensions (#97116 ) Removes unnecessary dict comprehensions that optimize creation of dicts from iterables Pull Request resolved: https://github.com/pytorch/pytorch/pull/97116 Approved by: https://github.com/kit1980	2023-03-20 00:56:57 +00:00
AllenTiTaiWang	be0b415a5a	[ONNX] Set shape/type into torchscript (#96349 ) Fixes https://github.com/pytorch/pytorch/pull/95676#issuecomment-1460588229 PS: It doesn't seem the exported ONNX_proto having type now. I wonder if there was a ONNX pass doing this for us (converting torch dtype to onnx dtype during exporting.) Type promotion issue would be raised with an error if we want to set type ```python onnxscript_value.dtype = expected_value.dtype ``` onnx.onnx_cpp2py_export.shape_inference.InferenceError: [ShapeInferenceError] Shape inference error(s): (op_type:aten_add, node name: aten_add_1): [ShapeInferenceError] (op_type:Add, node name: n3): B has inconsistent type tensor(int64) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96349 Approved by: https://github.com/justinchuby, https://github.com/wschin	2023-03-19 21:58:10 +00:00
Michael Voznesensky	722c4e59a4	Replace source check with assert (#95640 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95640 Approved by: https://github.com/ezyang	2023-03-19 21:51:59 +00:00
PyTorch MergeBot	c8030b5406	Revert "Update mkl_verbose return value check due to API change in mkl (#96283 )" This reverts commit c1214ce5c26fce541a920bdf9917c9ca9f63ecb0. Reverted https://github.com/pytorch/pytorch/pull/96283 on behalf of https://github.com/kit1980 due to Looks like this broke inductor tests on macos-12-py3-arm64 https://github.com/pytorch/pytorch/actions/runs/4458194071/jobs/7830194137	2023-03-19 21:48:01 +00:00
Edward Z. Yang	e74c5e5637	rexnet_100 is disabled for static, does not need dynamic listing (#97100 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97100 Approved by: https://github.com/Skylion007	2023-03-19 20:57:49 +00:00
PyTorch MergeBot	5d33f9cddb	Revert "Fix standalone compile for op with multiple outputs (#96936 )" This reverts commit 37cde56658e20afae6d94b70d53e4131043e09e8. Reverted https://github.com/pytorch/pytorch/pull/96936 on behalf of https://github.com/kit1980 due to Broke inductor tests on macos-12-py3-arm64 https://github.com/pytorch/pytorch/actions/runs/4458548491/jobs/7830566793	2023-03-19 20:32:13 +00:00
Driss Guessous	90537a779c	Update FlashAttention to work with sm90 Gpus (#97051 ) # Summary FlashAttention was confirmed to work on h100 and sm90 hardware so we update the checks to account for this Pull Request resolved: https://github.com/pytorch/pytorch/pull/97051 Approved by: https://github.com/cpuhrsch	2023-03-19 19:33:57 +00:00
Liao, Xuan	37cde56658	Fix standalone compile for op with multiple outputs (#96936 ) Op-benchmark directly uses fx.Graph to create nodes without dynamo and then compiles the graph with inductor. Currently, operators with multiple outputs, e.g. native_layer_norm, would fail to run caused by standalone torch._inductor.compile() API #95594. Actually, the graph's result is a node with several outputs instead of a tuple with several nodes. However, the standalone API forces a non-tuple result be a tuple, i.e., a tuple with one node-type element with several outputs. This PR considers a return node with several outputs as a tuple to avoid errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96936 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-19 02:44:03 +00:00
Jing Xu	c1214ce5c2	Update mkl_verbose return value check due to API change in mkl (#96283 ) As title. Originally `mkl_verbose()` function returned `0` and `1`, indicating failure and success respectively. However, the version that PyTorch uses now changed the output of `mkl_verbose()` to reflect its input level. Thus, the check logic needs to be changed to compare output of the `mkl_verbose()` function with -1. https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/support-functions/miscellaneous/mkl-verbose.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/96283 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-03-18 20:30:07 +00:00
Liao, Xuan	5ee5a164ff	[aot] disable inference view tracking (#96478 ) For inference, we should disable unnecessary view tracking to save time. Most of operators get an improvement of performance (inductor v.s. eager). This PR fix the general regression of operators for inductor. Example of operators' speedup in torchbench (inductor v.s. eager): <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> \| current \| new -- \| -- \| -- aten.hardsigmoid.default \| [0.6426090814905988, 0.6791992931354925, 0.7046010955095103] \| [0.7921782106271767, 0.8919522525991529, 0.9128089963571694] aten.tanh.default \| [0.6135534976747065, 0.7588851221588919, 0.898274076411234] \| [0.857534066531159, 1.0524121834821605, 1.2535141671420165] aten.floor.default \| [0.6115868728087821, 0.6115868728087821, 0.6115868728087821] \| [0.9472870784346195, 0.9472870784346195, 0.9472870784346195] aten.exp.default \| [0.7784016216625718, 0.9279358274876591, 1.1201178548406794] \| [0.5777145055206203, 0.8610140436473923, 1.1850714193498957] aten.mul_.Tensor \| [0.14381872531802153, 0.14638969818507447, 0.14947766446663138] \| [0.37695307573466363, 0.3832122689450142, 0.38963470437456904] aten.hardtanh_.default \| [0.49502896822398157, 0.5897512505705527, 0.8052969399847189] \| [0.4915338157706071, 0.6098169585316151, 0.8587605051115021] aten.relu_.default \| [0.47776870021339685, 0.54452322796367, 0.6516167164223963] \| [0.4764791289773786, 0.5608095328163419, 0.6753350976452626] </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96478 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/jgong5, https://github.com/bdhirsh	2023-03-18 13:58:24 +00:00
Wanchao Liang	4805441b4a	[dtensor] remove unused tests and fix ci (#97064 ) fix ci Pull Request resolved: https://github.com/pytorch/pytorch/pull/97064 Approved by: https://github.com/huydhn	2023-03-18 06:01:37 +00:00
Huy Do	a5923ab3f3	Revert "[inductor] do benchmark in sub processes for max autotuning (#96410 )" (#97075 ) This reverts commit 34256bc73080d7898138c821273b9f31fab777f8. @kit1980: I'm not sure how best to revert a co-dev PR like https://github.com/pytorch/pytorch/pull/96410#issuecomment-1474704337. IIRC, Ivan and Eli did a revert PR like this before, so I create one here just in case we need to use it. If that's the case, please feel free to get this merge to fix trunk. Otherwise, this can be closed. @shunting314 If you can do a forward fix faster than this, please help do so. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97075 Approved by: https://github.com/kit1980	2023-03-18 05:07:18 +00:00
Michael Lazos	a1c46e5f8f	component-level configurable logging for dynamo, inductor, aot (#94858 ) Summary: Adds NNC-like logging that is configured through an env var `TORCH_COMPILE_LOGS` Examples: `TORCH_LOGS="dynamo,guards" python script.py` - prints dynamo logs at level INFO with guards of all functions that are compiled `TORCH_LOGS="+dynamo,guards,graph" python script.py` - prints dynamo logs at level DEBUG with guards and graphs (in tabular) format of all graphs that are compiled [More examples with full output](https://gist.github.com/mlazos/b17f474457308ce15e88c91721ac1cce) Implementation: The implementation parses the log settings from the environment, finds any components (aot, dynamo, inductor) or other loggable objects (guards, graph, etc.) and generates a log_state object. This object contains all of the enabled artifacts, and a qualified log name -> level mapping. _init_logs then adds handlers to the highest level logs (the registered logs), and sets any artifact loggers to level DEBUG if the artifact is enabled. Note: set_logs is an alternative for manipulating the log_state, but if the environment contains TORCH_LOGS, the environment settings will be prioritized. Adding a new log: To add a new log, a dev should add their log name to torch._logging._registrations (there are examples there already). Adding a new artifact: To add a new artifact, a dev should add their artifact name to torch._logging._registrations as well. Additionally, wherever the artifact is logged, `torch._logging.getArtifactLogger(__name__, <artifact_name>)` should be used instead of the standard logging implementation. [design doc](https://docs.google.com/document/d/1ZRfTWKa8eaPq1AxaiHrq4ASTPouzzlPiuquSBEJYwS8/edit#) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94858 Approved by: https://github.com/ezyang	2023-03-18 04:17:31 +00:00
Qi Zhu	086ce765a5	Add new parameter `materialize_grads` to torch.autograd.grad() (#97015 ) Fixes #44189 Adds a new parameter, zero_grad_unused, to the torch.autograd.grad() function. This parameter allows for the gradient to be set to 0 instead of None when a variable is unused, which can be helpful for higher-order partial differentials. Here is an example of using this new parameter to solve d^3y/dx^3 given y = a * x: ```python x = torch.tensor(0.5, dtype=torch.float32, requires_grad=True) a = torch.tensor(1, dtype=torch.float32, requires_grad=True) y = x * a dydx = torch.autograd.grad(y, x, create_graph=True, allow_unused=True) d2ydx2 = torch.autograd.grad(dydx, x, allow_unused=True, zero_grad_unused=True) try: d3ydx3 = torch.autograd.grad(d2ydx2, x, allow_unused=True, zero_grad_unused=True) except RuntimeError as e: assert False, "Should not raise error" ``` With `zero_grad_unused`, d2ydx2 could be 0 instead of None, enabling d3ydx3 to be calculated as defined in math without throwing an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97015 Approved by: https://github.com/soulitzer	2023-03-18 03:11:12 +00:00
Shunting Zhang	34256bc730	[inductor] do benchmark in sub processes for max autotuning (#96410 ) This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like https://github.com/openai/triton/issues/1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96410 Approved by: https://github.com/jansel	2023-03-18 02:43:28 +00:00
Michael Gschwind	b132220309	Update MHA doc string (#97046 ) Summary: Update MHA doc string Test Plan: sandcastle & github Differential Revision: D44179519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97046 Approved by: https://github.com/voznesenskym	2023-03-18 02:14:59 +00:00
Wang, Eikan	915cbf8208	[Inductor] Eliminate redundant to_dtype node (#96650 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96650 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-18 01:51:38 +00:00
Huy Do	679dec847e	Use is_available instead of device_count to check for CUDA availability (#97043 ) There are some tests that incorrectly uses the number of GPU devices `torch.cuda.device_count() > 0` to check for CUDA availability instead of the default `torch.cuda.is_available()` call. This makes these tests more brittle when encountering infra flakiness on G5 runner using A10G, for example [test_pytorch_np](https://hud.pytorch.org/failure/FAILED%20test_tensorboard.py%3A%3ATestTensorBoardPyTorchNumpy%3A%3Atest_pytorch_np%20-%20RuntimeError%3A%20No%20CUDA%20GPUs%20are%20available). The underlying problem is that GPU devices could crash on these runner. While the root cause for that is unclear and we will try to upgrade to a new NVIDIA driver https://github.com/pytorch/pytorch/pull/96904 to see if it helps, we can also make these tests more resilient by using the correct check to skip tests correctly when GPU crashes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97043 Approved by: https://github.com/clee2000	2023-03-18 00:39:42 +00:00
Huy Do	c62fc81cc5	Increase the timeout value for linter calculate-docker-image (#96993 ) I should have known that this step rebuilds the linter Docker image if it doesn't exists. When it does so, it takes close to 15 minutes to finish, i.e. https://github.com/pytorch/pytorch/actions/runs/4443046530/attempts/1, instead of the regular 2-minute run, i.e. https://github.com/pytorch/pytorch/actions/runs/4442455480/jobs/7798700609. This x2 the timeout value of this step to 30 minutes to avoid getting timeout flakily. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96993 Approved by: https://github.com/clee2000	2023-03-18 00:06:39 +00:00
James Braza	b390e7037e	[docs] passing LogSoftmax into NLLLoss (#97001 ) Fixes https://github.com/pytorch/pytorch/issues/96795 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97001 Approved by: https://github.com/soulitzer	2023-03-17 23:22:13 +00:00
Sergii Dymchenko	410210b351	Remove obsolete "merge -g" flag from update_commit_hashes.py (#97033 ) The flag is deprecated and is being removed in https://github.com/pytorch/test-infra/pull/3882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97033 Approved by: https://github.com/huydhn	2023-03-17 22:51:58 +00:00
Huy Do	db2c1ea8c8	Re-enable test_ops_jit on Windows (#96859 ) (#96931 ) Fixes https://github.com/pytorch/pytorch/issues/96858 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96931 Approved by: https://github.com/kit1980	2023-03-17 22:42:22 +00:00
David Berard	a4c706bcbc	[dynamo][dashboard] fix triton clone step in dashboard (#96623 ) previously this would clone triton, and then try to checkout without being in the git repo directory. This wasn't usually a problem because the environment already had a triton repo downloaded; but I ran into this while trying to construct a new environment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96623 Approved by: https://github.com/anijain2305	2023-03-17 22:36:26 +00:00
Jane Xu	4a90aca60d	Make keep-going work for more than linux (#96974 ) cc. asked by @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96974 Approved by: https://github.com/huydhn	2023-03-17 22:08:37 +00:00
BJ Hargrave	b59a60ddff	Fix CPU bitwise shifts for out-of-limit shift values (#96659 ) Negative shift values and positive shift values greater than the bit size of the dtype (limit `0..bits`) now yield expected results which are consistent with numpy. Left shift with an out-of-limit shift value result in a value of `0`. Right shift with an out-of-limit shift value results in a value of `-1` for negative inputs and `0` for non-negative inputs (sign preserving). Fixes https://github.com/pytorch/pytorch/issues/70904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96659 Approved by: https://github.com/ngimel, https://github.com/albanD, https://github.com/zou3519, https://github.com/jgong5, https://github.com/malfet	2023-03-17 21:35:34 +00:00
Aaron Gokaslan	dd9ade6377	Remove unnecessary items() call in zero_grad (#97040 ) Micro-optimization to zero_grad() which is performance critical Pull Request resolved: https://github.com/pytorch/pytorch/pull/97040 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-03-17 21:34:14 +00:00
Driss Guessous	98a5cf090d	[SDPA] Remove the chunk_grad from mem-eff attention (#96880 ) # Summary There exists an optimization within the scaled_dot_product_efficieint bacwkard attention path to, under the right conditions, output grad_q, grad_k, grad_v all as aliases of the same storage. This was done to optimize for the hot path where mha does packed linear_projection -> chunk -> (view stuff) -> sdpa. The thought was that chunk-> would be able to "trivially" cat inputs to chunk.backward(). However upon closer inspection chunk.backward will call ` cat` irregardless of the inputs so this is not being utilized. I validated this by profiling on main and then this branch and the traces produced the same both with `split.backward()` calling into cat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96880 Approved by: https://github.com/cpuhrsch	2023-03-17 21:28:25 +00:00
Guang Yang	d4b8ed2b11	Fail fast when dynamo attempts to add unspecialized int/float as additional graph inputs (#96786 ) Summary: Verified the changes to catch unspecialized int/floats being added as additional graph in D44037548 prior to RP(https://github.com/pytorch/pytorch/pull/95621). However, with #95621 the issue to be solved originally is no longer valid because int & float in `forward` will always be specialized in export. This RP is to add the assertion anyway (though not be hit unless there is a regression) to immediately catch the attempt to add unspecialized int/float to additional graphargs Test Plan: Example of the error message would look like: ``` Dynamo attempts to add additional input: value=9.999999747378752e-06, source=NNModuleSource(inner=AttrSource(base=NNModuleSource(inner=AttrSource(base=LocalInputSource(local_name='self', pos=0), member='torch_module')), member='eps')) ``` Passed all export tests ``` Buck UI: https://www.internalfb.com/buck2/fea72653-5549-47e7-a9bf-740eb86a8e26 Test UI: https://www.internalfb.com/intern/testinfra/testrun/8725724422167257 RE: reSessionID-7b3470b1-c293-4c4a-9671-dd0b7a2839b8 Up: 6.0 KiB Down: 0 B Jobs completed: 101. Time elapsed: 115.7s. Tests finished: Pass 98. Fail 0. Fatal 0. Skip 0. 0 builds failed ``` Differential Revision: D44075910 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96786 Approved by: https://github.com/tugsbayasgalan, https://github.com/ezyang	2023-03-17 21:15:18 +00:00
Kiersten Stokes	cea13ad9fa	Improve size mismatch error messaging referencing mat/vet sizes (#96863 ) Fixes #94841 This fixes the error messages in the following files, the same as those referenced in the linked issue. I was not able to find any additional examples, but am happy to add commits for any that I may have missed! ``` aten/src/ATen/native/Blas.cpp: "size mismatch, got ", self.size(0), ", ", mat.size(0), "x", mat.size(1), ",", vec.size(0)); torch/_decomp/decompositions.py: lambda: f"size mismatch, got {self.size(0)}x{self.size(1)},{vec.size(0)}", ``` Example output for `Blas.cpp` before: ``` size mismatch, got 3, 3x4,1 ``` The new error messages have the following format: ``` aten/src/ATen/native/Blas.cpp: "size mismatch, got bias (", self.size(0), "), matrix (", mat.size(0), "x", mat.size(1), "), vector (", vec.size(0), ")"); torch/_decomp/decompositions.py: lambda: f"size mismatch, got matrix ({self.size(0)}x{self.size(1)}), vector ({vec.size(0)})", ``` Example output for `Blas.cpp` after: ``` size mismatch, got bias (3), matrix (3x4), vector (1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96863 Approved by: https://github.com/albanD	2023-03-17 21:07:48 +00:00
albanD	985fc66b30	Bind increment_version to python (#96852 ) Should be convenient when writing python-only kernels (with triton) that don't have access to the C++ APIs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96852 Approved by: https://github.com/soulitzer	2023-03-17 20:36:33 +00:00
dujinhang	1983b31711	Fixed print tensor.type() issue. (#96381 ) Fixes #95954 Updating the cpp printing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96381 Approved by: https://github.com/albanD	2023-03-17 20:26:43 +00:00
Max Podkorytov	57bb5b159d	[static-runtime] one more attempt to improve crash log readability (#96903 ) Summary: * add human readable type and ivalue printout * fix internal linter warnings Test Plan: error message now looks like e.g. ``` E0315 16:27:32.409082 422313 ExceptionTracer.cpp:222] exception stack complete terminate called after throwing an instance of 'c10::Error' what(): List[int] is not a subtype of List[int]; schema arg name: 'split_sizes', ivalue: [1, 1] ``` Differential Revision: D44112297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96903 Approved by: https://github.com/davidberard98	2023-03-17 17:56:26 +00:00
Xiao Wang	44d7bbfe22	[cpp extension] Allow setting PYTORCH_NVCC to a customized nvcc in torch cpp extension build (#96987 ) per title I can write a script named `nvcc` like this ```bash #!/bin/bash /opt/cache/bin/sccache /usr/local/cuda/bin/nvcc $@ ``` and set its path to `PYTORCH_NVCC` (added in this PR), along with another `sccache-g++` script to env var `CXX`. `cfa6b52e02/torch/utils/cpp_extension.py (L2106-L2109)` With ninja, I can fully enable c-cached build on my cuda extensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96987 Approved by: https://github.com/ezyang	2023-03-17 17:05:17 +00:00
Shunting Zhang	8ce296ae2c	[ez][inductor] show kernel category in kernel benchmark result (#96991 ) I feel it's useful to show if an kernel is pointwise/reduction/persistent_reduction in the benchmark output. Only print the upper case of the first 3 letters to avoid wrap the line: - POI for pointwise - RED for reduction - PER for persistent_reduction <img width="1091" alt="Screenshot 2023-03-16 at 5 10 21 PM" src="https://user-images.githubusercontent.com/52589240/225780546-07b8d345-2bbe-40bd-9e65-185e9294743e.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96991 Approved by: https://github.com/Chillee	2023-03-17 17:02:43 +00:00
Luke Confait	46eaf4be7d	Fix Typo in pytorch/torch/autograd/__init__.py (#97024 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97024 Approved by: https://github.com/Skylion007, https://github.com/soulitzer	2023-03-17 16:24:18 +00:00
Shintaro Iwasaki	95575f0a5f	[DTensor] Fix _get_or_create_default_group() (#96961 ) Summary: This PR fixes `_get_or_create_default_group()` of `DeviceMesh`. When `mesh` of the first created `DeviceMesh` is not `[0, 1, 2, ... WORLD_SIZE - 1]` and `is_initialized() == False`, it wrongly asserts. This PR fixes this issue by removing these assertions. --- More specifically, `_get_or_create_default_group()` has 4 checks: 1. `DeviceMesh must include every process in WORLD` 2. `DeviceMesh cannot have duplicate values` 3. `DeviceMesh ranks must start from 0` 4. `DeviceMesh should have all ranks of WORLD` 1, 3, and 4 are not satisfied when `self.mesh` is not `[0, 1, 2, ... WORLD_SIZE - 1]`. 2 is a valid check, but it is also checked in `__init__()`, so we don't need to check it again in this function. Test Plan: CI Reviewed By: wanchaol Differential Revision: D44098849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96961 Approved by: https://github.com/wanchaol	2023-03-17 15:52:19 +00:00
Kurt Mohler	ffddb2219a	Change `THPStorage::cdata` to be a `MaybeOwned<Storage>`, add unpack func (#96801 ) Part of #91395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96801 Approved by: https://github.com/ezyang	2023-03-17 14:58:21 +00:00
Aleksei Nikiforov	7f94ea8492	test/test_torch.py: fix TestTorch::test_from_buffer test (#96952 ) Use opposite encoding on big endian systems Pull Request resolved: https://github.com/pytorch/pytorch/pull/96952 Approved by: https://github.com/ezyang	2023-03-17 14:36:33 +00:00
Jiong Gong	18cf30fb2a	[Inductor] preserve AliasedLayout on View (#96948 ) Fix https://github.com/pytorch/pytorch/issues/96728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96948 Approved by: https://github.com/Chillee	2023-03-17 14:29:13 +00:00
Michael Gschwind	92eb9d363a	Decoder native functions join the dead code society (#96025 ) Summary: Decoder native joins the dead code society With the recent introduction of PT2, we no longer need native decoder operators: 1 - full-function SDPA kernels can be used to implement cross-attention efficiently without the (slower) decoder MHA blob. 2 - torch.compile() generates more efficient code across many platforms from the python implementation of decoders than the decoder layer blob by tailoring code to target Test Plan: github & sandcastle Differential Revision: D43811808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96025 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-03-17 09:45:55 +00:00
PyTorch MergeBot	b5ecf727be	Revert "[aot autograd] refactor to make functionalization self-contained (#96341 )" This reverts commit 3cd9c7a16d8b19c28d12bf5b56a8a7c20405476a. Reverted https://github.com/pytorch/pytorch/pull/96341 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-03-17 09:24:05 +00:00
chunyuan	238b06086f	inductor: fix cpp wrapper ExternKernel check (#96799 ) Fix cpp_wrapper functionality for ExternKernel. Changes in https://github.com/pytorch/pytorch/pull/91575 has disabled the cpp_wrapper for ExternKernel cases. 1. Need to set the `cpp_wrapper` attr before `V.graph.register_buffer(self)`. `register_buffer` will invoke the below check: `c6a82e4339/torch/_inductor/graph.py (L220-L223)` The current code which sets the `cpp_wrapper` after the `V.graph.register_buffer(self)` will always disable the cpp wrapper. 2. Fix the missing `ordered_kwargs_for_cpp_kernel` attr for `at::addmm_out` 3. Enhance the UT to check that cpp_wrapper has been turned on for the supported cases to prevent being unintentionally disabled by future changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96799 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-03-17 08:58:35 +00:00
Charlie Yan	13538c88b3	[1/n] Consolidate `replicate` and `DDP`: setup ufmt for `distributed.py` (#96597 ) As we already enabled ufmt for composable APIs in https://github.com/pytorch/pytorch/pull/90873, it seems a good idea to enable ufmt for other distributed APIs as well. This change setup ufmt for DDP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96597 Approved by: https://github.com/rohan-varma	2023-03-17 06:25:11 +00:00
Nikita Shulga	24ce3a7c34	Move `hasPrimaryContext` to `c10::cuda` (#96800 ) This method has to be accessible from `c10` to enable CUDA-12 integration. Implemented by providing private `c10::cuda:_internal::setHasPrimaryContext` that passes the pointer to the implementation (in `torch_cuda`) back to c10. Use global class constructor/destructor to guarantee RAII. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96800 Approved by: https://github.com/ngimel	2023-03-17 04:50:35 +00:00
PyTorch MergeBot	cbd3df93c4	[vision hash update] update the pinned vision hash (#96990 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96990 Approved by: https://github.com/pytorchbot	2023-03-17 03:13:22 +00:00
Stephen Jia	4de1bc16e3	[PyTorch][XNNPACK] Update wrappers for internal only x86 SSE2 kernels (#96896 ) Summary: Same as D43747173 (https://github.com/pytorch/pytorch/pull/95911) except for the newly added x86 SSE2 kernels. For future reference, wrappers can be generated by ``` cd ~/fbsource/xplat/third-party/XNNPACK # Update the list of internal only kernels in generate-wrappers.py python3 generate-wrappers.py ``` Test Plan: CI Reviewed By: digantdesai Differential Revision: D44072764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96896 Approved by: https://github.com/digantdesai	2023-03-17 03:07:39 +00:00
Li-Huai (Allan) Lin	f865e23abc	[MPS] Introduce MPSUnaryGradCachedGraph & MPSBinaryGradCachedGraph (#95289 ) This PR introduces `MPSUnaryGradCachedGraph` & `MPSBinaryGradCachedGraph` to replace duplicate CachedGraph creation in backward functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95289 Approved by: https://github.com/kulinseth	2023-03-17 02:50:51 +00:00
Elias Ellison	571f96bf59	cudagraph trees (#89146 ) CUDA Graph Trees Design doc: https://docs.google.com/document/d/1ZrxLGWz7T45MSX6gPsL6Ln4t0eZCSfWewtJ_qLd_D0E/edit Not currently implemented : - Right now, we are using weak tensor refs from outputs to check if a tensor has dies. This doesn't work because a) aliasing, and b) aot_autograd detaches tensors (see note [Detaching saved tensors in AOTAutograd]). Would need either https://github.com/pytorch/pytorch/issues/91395 to land to use storage weak refs or manually add a deleter fn that does what I want. This is doable but theres some interactions with the caching allocator checkpointing so saving for a stacked pr. - Reclaiming memory from the inputs during model recording. This isn't terribly difficult but deferring to another PR. You would need to write over the input memory during warmup, and therefore copy the inputs to cpu. Saving for a stacked pr. - Warning on overwriting previous generation outputs. and handling nested torch.compile() calls in generation tracking Differential Revision: [D43999887](https://our.internmc.facebook.com/intern/diff/D43999887) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89146 Approved by: https://github.com/ezyang	2023-03-17 02:47:03 +00:00
Xiaodong Wang	cf732053e4	nn.EmbeddingBag bound check (#96022 ) Summary: Today if we're accessing out of bound embedding rows, it'll either go through or throw IMA. This is not ideal - adding bound checks. This will probably slow things down - need to benchmark it. Test Plan: TODO: add some tests Tried a simple example and it's showing this: ``` aten/src/ATen/native/cuda/EmbeddingBag.cu:143: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [0,1,0] Assertion `input[emb] < numRows` failed. ``` Differential Revision: D43810777 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96022 Approved by: https://github.com/cpuhrsch, https://github.com/ngimel	2023-03-17 02:01:43 +00:00
Ramin Azarmehr	50beab2978	[MPS] Fix the failure with ReplicatePad3D (#96988 ) - Only ReflectPad needs the torch checks for input arguments and not the ReplicatePad - Added a test case - The failure was originally found in test_modules with test `test_forward_nn_ReplicationPad3d_mps_float32` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96988 Approved by: https://github.com/DenisVieriu97	2023-03-17 01:41:12 +00:00
PyTorch MergeBot	417e7bc09f	Revert "[PTD][Checkpoint] Add checkpointing support for DTensor submesh (#96802 )" This reverts commit cfa6b52e02eb61f71c0034d5b7e73e365420f35a. Reverted https://github.com/pytorch/pytorch/pull/96802 on behalf of https://github.com/huydhn due to This breaks distributed test `cfa6b52e02`. Probably a landrace as PR signal was green	2023-03-17 01:04:43 +00:00
Nikita Shulga	c9135e4408	Stop using my channel for 3.11 builds (#96973 ) As `numpy` for Python 3.11 is now available from the default anaconda channel Pull Request resolved: https://github.com/pytorch/pytorch/pull/96973 Approved by: https://github.com/kit1980, https://github.com/atalman	2023-03-17 00:55:38 +00:00
Avik Chaudhuri	e4e761b277	record caller frame instead of function frame (#96882 ) Previously, when starting to trace a function, we would record a frame summary recording the definition loc. This would lead to an unconventional-looking stack trace when used for debugging, e.g., shape guards. ``` File ".../scripts/avik/pt2/example.py", line 407, in forward def forward(self, x): ... File ".../transformers/models/bert/modeling_bert.py", line 912, in forward @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")) ... File ".../transformers/models/bert/modeling_bert.py", line 562, in forward def forward( ... File ".../transformers/models/bert/modeling_bert.py", line 484, in forward def forward( ... File ".../transformers/models/bert/modeling_bert.py", line 416, in forward def forward( ... File ".../transformers/models/bert/modeling_bert.py", line 275, in forward def forward( ... File ".../transformers/models/bert/modeling_bert.py", line 351, in forward attention_scores = attention_scores + attention_mask ``` As noted in https://github.com/pytorch/pytorch/pull/95848#discussion_r1134397096, we would like to change this to record function calls instead, like conventional stack traces do. This diff makes this change. The above stack now looks like the following, which is way more helpful at a glance to understand what's going on. ``` File ".../scripts/avik/pt2/example.py", line 408, in forward bert_out = self.bert(**x) ... File ".../transformers/models/bert/modeling_bert.py", line 1021, in forward encoder_outputs = self.encoder( ... File ".../transformers/models/bert/modeling_bert.py", line 610, in forward layer_outputs = layer_module( ... File ".../transformers/models/bert/modeling_bert.py", line 496, in forward self_attention_outputs = self.attention( ... File ".../transformers/models/bert/modeling_bert.py", line 426, in forward self_outputs = self.self( ... File ".../transformers/models/bert/modeling_bert.py", line 351, in forward attention_scores = attention_scores + attention_mask ``` Differential Revision: [D44101882](https://our.internmc.facebook.com/intern/diff/D44101882/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96882 Approved by: https://github.com/ezyang	2023-03-17 00:06:16 +00:00
Elias Ellison	ea7415087a	Expose Stream Recording Apis in python (#96384 ) Differential Revision: [D43999891](https://our.internmc.facebook.com/intern/diff/D43999891) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96384 Approved by: https://github.com/zdevito	2023-03-16 23:45:43 +00:00
Rohan Gupta	b01d6f2cdb	addmv decomp #2 (#96264 ) Fixes #94617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96264 Approved by: https://github.com/ngimel, https://github.com/ezyang	2023-03-16 23:09:45 +00:00
Danni Li	5842e5c175	vmap support for torch.tril and torch.triu (#94287 ) Summary: Add vmap support for torch.tril and torch.triu. Fix: #91403 Test Plan: GitHub pipeline Differential Revision: D43016624 ### Expected behavior Same as using for-loop: ```python import torch x = torch.randn(32, 3) results = [] for xi in x: y = torch.triu(xi) results.append(y) """ triu: input tensor must have at least 2 dimensions --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) <ipython-input-7-d726203efb0e> in <module> 4 results = [] 5 for xi in x: ----> 6 y = torch.triu(xi) 7 results.append(y) RuntimeError: triu: input tensor must have at least 2 dimensions """ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94287 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2023-03-16 22:33:18 +00:00
Iris	cfa6b52e02	[PTD][Checkpoint] Add checkpointing support for DTensor submesh (#96802 ) DTensor submesh support is added in https://github.com/pytorch/pytorch/pull/95458. This PR adds support for DTensor submesh by adding an extra check when create local save/load plan. If the rank is not participating in the mesh, we simply skip creating WriteItem/ReadItem for the local SavePlan/LoadPlan. Updated the associated test as well. cc. @wanchaol, @kumpera Pull Request resolved: https://github.com/pytorch/pytorch/pull/96802 Approved by: https://github.com/wanchaol	2023-03-16 22:16:58 +00:00
Pearu Peterson	2abcafcfd8	Add masked_grad kw argument to to_dense (#96095 ) As in the title. The `masked_grad` kw argument is required for `to_dense` backward to distinguish the expected semantics of sparse tensors. `masked_grad=True` means that the `to_dense` backward will apply a mask to the returned gradient where the mask is defined by the input indices. The default semantics implies `masked_grad==True` for BC but see the [comment](https://github.com/pytorch/pytorch/pull/96095/files#diff-d4df180433a09071e891d552426911c227b30ae9b8a8e56da31046e7ecb1afbeR501-R513) in `to_dense_backward`. As a consequence, existing code that is run through autograd engine must replace `.to_dense()` calls with `.to_dense(masked_grad=False)`. For example, ```python torch.autograd.gradcheck(lambda x: torch.sum(x, [0]).to_dense()) torch.autograd.gradcheck(lambda x: torch.sparse.sum(x, [0]).to_dense()) ``` (recall, gradcheck has `masked=False` as default) must be updated to ```python torch.autograd.gradcheck(lambda x: torch.sum(x, [0]).to_dense(masked_grad=False)) torch.autograd.gradcheck(lambda x: torch.sparse.sum(x, [0]).to_dense(masked_grad=True), masked=True) ``` Fixes https://github.com/pytorch/pytorch/issues/95550 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96095 Approved by: https://github.com/cpuhrsch	2023-03-16 21:38:11 +00:00
Huy Do	9d80969fa4	Retry brew and gem installation in trunk ios workflow (#96970 ) Per title, I don't want to see network flakiness like this https://github.com/pytorch/pytorch/actions/runs/4439991996/jobs/7793213476 ever again :P Pull Request resolved: https://github.com/pytorch/pytorch/pull/96970 Approved by: https://github.com/clee2000	2023-03-16 21:30:57 +00:00
Andrew Gu	b02fd701fb	[FSDP] Reduce CPU overhead (#96958 ) I experimented with 200 `nn.Linear`s with `bias=True` for a total of 400 `nn.Parameter`s all wrapped into the same FSDP instance and world size of 2. `unshard()` -> `_use_unsharded_views()` - (From previous PR) unsafe `setattr`: 6.112 ms -> 4.268 ms `pre_unshard()` -> `_writeback_orig_params()` - Factor out `flat_param` and `flat_param_grad` data pointers: ~1.8 ms -> 1.071 ms - Now dominated by calling `_typed_storage()` on each original parameter and its gradient `reshard()` -> `_use_sharded_views()` - Factor out `torch.empty(0, ...)`: ~4.6 - 4.7 ms -> ~2.7 - 2.8 ms - Now dominated by `aten::slice()` and (unsafe) `setattr`, which are required I removed some `assert` calls that were only needed for mypy or if the subsequent call would provide the same error message anyway. These have negligible overhead, but I think it is still okay to remove them and avoid the type check. We need to address type checking more holistically anyway. --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/96958 Approved by: https://github.com/rohan-varma	2023-03-16 21:13:57 +00:00
Bin Bao	931a4913b1	[inductor] Refactor memory management code in wrapper codegen (#96768 ) Summary: use inheritance to simplify CppWrapperCodeGen and to prepare for AOT codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/96768 Approved by: https://github.com/jansel	2023-03-16 18:36:35 +00:00
Kwanghoon An	3f4090652c	Passing LinearPackedParamBase Capsule as a saved_data to backward stage (#96269 ) Summary: Initial implementation was unpacking for original weight in custom furward function which will double weight tensor in memory 2x bigger. Hence we better unpack weight in backward function. store Capsule object in saved_data storage and unpack in backward function. Detail : https://github.com/pytorch/pytorch/pull/94432#discussion_r1126669178 Test Plan: buck2 run //scripts/kwanghoon/pytorch:torch_playground - [D43809980](https://www.internalfb.com/diff/D43809980) You can plug and play with above script. Differential Revision: D43895790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96269 Approved by: https://github.com/kimishpatel	2023-03-16 17:37:05 +00:00
Shintaro Iwasaki	397fb2762e	[DTensor] Fix DeviceMesh (#96861 ) Summary: This Diff fixes some DeviceMesh issues, which blocks internal DTensor integration. Specifically, when `self.mesh = [2, 3]` while `world_size = 4`, because `unique_mesh_values[-1] == 3`, it takes the first short-cut branch and uses `default_pg`. Let's check the length instead of the last value of `unique_mesh_values`. Test Plan: CI Reviewed By: wanchaol Differential Revision: D44079872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96861 Approved by: https://github.com/wanchaol	2023-03-16 16:40:38 +00:00
Huy Do	6718e3ca7c	Cache the transformers model used in ONNX test (#96793 ) Also updating merge_rule to allow ONNX exporter team to update the Docker script by themselves. By default, the model is cached at ~/.cache/huggingface/hub/ under CI jenkins user. The model is cached so that we don't need to re-download it every time in CI, which causes flaky [CI failures](https://hud.pytorch.org/failure/FAILED%20test%2Fonnx%2Ftest_fx_to_onnx_with_onnxruntime.py%3A%3ATestFxToOnnxWithOnnxRuntime%3A%3Atest_large_scale_exporter_with_tiny_gpt2%20-%20requests.exceptions.ReadTimeout%3A%20HTTPSConnectionPool(host%3D'huggingface.co'%2C%20port%3D443)%3A%20Read%20timed%20out.%20(read%20timeout%3D10.0)). This is the second part after https://github.com/pytorch/pytorch/pull/96590 ### Testing Confirm that the model is cached in the Docker image before running the test: ``` jenkins@dd0db85dd34f:~/workspace$ ls -la ~/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/* /var/lib/jenkins/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/blobs: total 2460 drwxrwxr-x 2 jenkins jenkins 126 Mar 15 05:48 . drwxrwxr-x 5 jenkins jenkins 48 Mar 15 05:48 .. -rw-rw-r-- 1 jenkins jenkins 662 Mar 15 05:48 2c81a6c4c984e95a45338c64a7445c1f0f88077f -rw-rw-r-- 1 jenkins jenkins 2514146 Mar 15 05:48 b706b24034032bdfe765ded5ab6403d201d295a995b790cb24c74becca5c04e6 /var/lib/jenkins/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/refs: total 4 drwxrwxr-x 2 jenkins jenkins 18 Mar 15 05:48 . drwxrwxr-x 5 jenkins jenkins 48 Mar 15 05:48 .. -rw-rw-r-- 1 jenkins jenkins 40 Mar 15 05:48 main /var/lib/jenkins/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/snapshots: total 0 drwxrwxr-x 3 jenkins jenkins 54 Mar 15 05:48 . drwxrwxr-x 5 jenkins jenkins 48 Mar 15 05:48 .. drwxrwxr-x 2 jenkins jenkins 50 Mar 15 05:48 5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96793 Approved by: https://github.com/titaiwangms, https://github.com/ZainRizvi	2023-03-16 16:38:22 +00:00
Jaewon Lee	aeb3db8aa0	Back out "Fixing a bug where allocating a 4GB block results in using 8GB of memory (#95827 )" (#96796 ) Summary: Original commit changeset: a19273017a2a Original Phabricator Diff: D43969564 ----------------------------------------------------------------------------------------------------------------------- Test Plan: unlandayc Reviewed By: terrycsy Differential Revision: D44080273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96796 Approved by: https://github.com/jianyuh, https://github.com/davidberard98	2023-03-16 16:35:33 +00:00
mantaionut	0eb9e01cbd	Enable TestTorchbind on Windows (#96507 ) This PR addresses the issues opened in #25155. However, those specific tests are no longer used since after #37473 they were moved to test_torchbind. This PR enable TestTorchbind on Windows. test_custom_class.py is no longer used after that commit. In the original issue, there were problems on Windows with those tests so I tested the updated ones to see if they work. I had no issues with them so this enables them on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96507 Approved by: https://github.com/ezyang	2023-03-16 16:18:08 +00:00
alexdremov	62eb7a2e97	[MPS] LSTM grad_y missing fix (#96601 ) Fixes #96416 Added tests that do not use LSTM output simalarly to the issue Seems like this fix once again introduces backward incompatibility. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96601 Approved by: https://github.com/albanD, https://github.com/kulinseth	2023-03-16 15:53:56 +00:00
Edward Z. Yang	b249b44bc1	Turn off split reductions for dynamic shapes (#96850 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96850 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-03-16 14:39:57 +00:00
Nikita Karetnikov	bf08d1387c	[primTorch] handle out in `sort` meta function (#96719 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96719 Approved by: https://github.com/ezyang	2023-03-16 07:38:53 +00:00
Bin Bao	577d930c39	[CI] Revert https://github.com/pytorch/pytorch/pull/96195 (#96897 ) Summary: https://github.com/pytorch/pytorch/pull/96195 was an experiment for debugging flaky failures on CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96897 Approved by: https://github.com/ngimel	2023-03-16 06:28:18 +00:00
Yu, Guangye	8187c0de88	Add xpu device type to torch dispatch overload (#96849 ) # Motivate Add XPU device type to CppFunction dispatch overload function. We previously omitted it. # Solution Add XPU device type. # Additional This list is synchronized with the k-constants in c10/core/DeviceType.h Pull Request resolved: https://github.com/pytorch/pytorch/pull/96849 Approved by: https://github.com/ezyang	2023-03-16 05:52:51 +00:00
Christian Puhrsch	0a53c9624a	Back out "Add _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339 )" (#96885 ) Summary: Backing out _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339) Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/96885 Approved by: https://github.com/drisspg	2023-03-16 05:32:55 +00:00
mingfeima	06054d7df0	fix random output issue on index_select when src is scalar and index is empty (#96408 ) Fix https://github.com/pytorch/pytorch/issues/94340 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96408 Approved by: https://github.com/ngimel	2023-03-16 05:30:45 +00:00
fduwjj	3405ac8a08	[TP][DTensor Op] Enable Embedding op for DTensor (#96702 ) We enabled col-wise embedding for TP users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96702 Approved by: https://github.com/wanchaol	2023-03-16 05:18:07 +00:00
Horace He	44c9ecad8d	fix flop formulas for sdpa (#96690 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96690 Approved by: https://github.com/drisspg	2023-03-16 04:55:56 +00:00
Catherine Lee	8c2341c1b9	Remove pytest block list (#96698 ) Enables the last few files under pytest. xdist was causing problems with `profiler/test_profiler` `test_source_multithreaded` due to creating extra threads. Luckily we don't use it so we can disable it with `-p no:xdist`, but this is incompatible with pytest-rerunfailures==10.2, so upgrade to 10.3. I'd update the windows ami but idk how. `dynamo/test_optimizers` and `dynamo/test_repros` both had tests that used skip_if_pytest. https://github.com/pytorch/pytorch/pull/93251/files suggests that it is due to pytest assertion rewriting, so I added `PYTEST_DONT_REWRITE` to their module docstrings to prevent pytest from rewriting assertions. Disable test by issue in `dynamo/test_dynamic_shapes` seems sane. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96698 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-03-16 04:22:42 +00:00
Zachary DeVito	3162f71787	[memory debugging] Extract frame information from inductor (#95753 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95753 Approved by: https://github.com/Chillee	2023-03-16 04:12:54 +00:00
Zachary DeVito	e74f70d212	Revert "Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 )"" (#96878 ) This reverts commit e1ea584b1caf9c50de25ce69396dfeb523a452c0. Adds __has_include check to fix fbcode build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96878 Approved by: https://github.com/ezyang	2023-03-16 04:12:54 +00:00
PyTorch MergeBot	1f340df33c	[vision hash update] update the pinned vision hash (#96906 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96906 Approved by: https://github.com/pytorchbot	2023-03-16 02:59:05 +00:00
Edward Z. Yang	3606f59366	Default specialize_int to False (#96624 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96624 Approved by: https://github.com/janeyx99	2023-03-16 02:54:18 +00:00
Rohan Varma	308a58ebca	[FSDP] Rename to _get_orig_buffer_dtypes (#96790 ) Reland this PR Differential Revision: [D44078430](https://our.internmc.facebook.com/intern/diff/D44078430/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96790 Approved by: https://github.com/awgu	2023-03-16 00:31:29 +00:00
Rohan Varma	71adb32ddc	[DDP] API to get data parallel parameters (#95097 ) Add a private API to retrieve data parallel parameters. This is useful for example for apply_optimizer_in_backward in the case user wishes to ensure it is applied only on DDP managed parameters. Differential Revision: [D43383878](https://our.internmc.facebook.com/intern/diff/D43383878/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95097 Approved by: https://github.com/zhaojuanmao, https://github.com/fegin	2023-03-16 00:30:37 +00:00
Ke Wen	3ce9aac786	Add environment variable to force flattening of 3D input tensor (#96761 ) Adding an environment variable `TORCH_LINEAR_FLATTEN_3D` to force flattening of 3D input tensor even when it is non-contiguous. Today, the `Linear` op would flatten a 3D input sensor if it is contiguous. It was found that even for some non-contiguous inputs (esp. with BF16 data type), flattening would also yield higher performance. For example: ``` x_size = (3072, 1196, 128) x = torch.rand(x_size, device="cuda", dtype=torch.bfloat16) x = torch.transpose(x, 1, 2) torch._C._nn.linear(x, weight, bias) ``` Since the detailed auto-tuning is unknown, this PR adds an environment variable for users to make a choice. (Default value is still 0.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96761 Approved by: https://github.com/ngimel	2023-03-16 00:24:09 +00:00
Driss Guessous	dcafe3f271	Updates to the release notes scripts and documentation (#94560 ) # Summary This PR made some significant changes to the scripts around Release Scripts. At a high level: - Turned the quips into docs and updated links - Update the common.categorizes list in the hopes to make this the source of truth for releases- This is hard since the release_notes labels can be changed at will. An alternative would be to poll from github api. However, I think that is overkill. The notebook does a set compare and will show you knew categories. I think we want this to be manual so that the release note engineer will decided how to categorize. - Create cateogry group from speaking with folks on distributed and AO that told me these different release categories can be merged. - I am the newest person to Core and don't use ghstack soo made token getting a lil more generic. - Added a classifier.py file. This file will train a commit categorizer for you, hopefully with decent accuracy. I was able to achieve 75% accuracy. I drop the highest frequency class - "skip" since this creates a more useful cateogrizer. - I updated the categorize.py script so that the prompt will be what the classifier thinks, gated by a flag. - Added a readme that will hopefully help future release notes engineers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94560 Approved by: https://github.com/albanD	2023-03-16 00:09:26 +00:00
Will Constable	731bb6e61b	Fix periodic job by excluding check_graph_breaks (try 2) (#96803 ) note about second try First try (https://github.com/pytorch/pytorch/pull/96780) was reverted because while it fixed periodic, it broke inductor cpu-accuracy (which strangely didn't show up as failures on this PR). This try keeps the cpu-accuracy filter and also adds the inductor filter to get rid of periodic jobs. the actual PR desc It's going to be harder to properly support check_graph_breaks across multiple baselines. Periodic and Inductor workflows are different baselines since they include different sets of models. It's not as simple as checking in the csv for the superset (periodic), becuase update_expected.py is designed to run given the sha of your failing PR and reset the baseline to that PR's artifacts. This is a nice workflow, and would be harder to manage if it had to always point to a periodic job. For now just do the check on the inductor job and ignore the other models covered only on periodic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96803 Approved by: https://github.com/desertfire	2023-03-15 23:24:47 +00:00
mingfeima	6d62134f2c	fix aminmax output resize issue when input is a zero dimension tensor (#96171 ) Fix https://github.com/pytorch/pytorch/issues/96042 ### before ``` >>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=True) __main__:1: UserWarning: An output with one or more elements was resized since it had shape [], which does not match the required output shape [1]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at ../aten/src/ATen/native/Resize.cpp:24.) torch.return_types.aminmax( min=tensor([1]), max=tensor([1])) >>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=False) torch.return_types.aminmax( min=tensor(1), max=tensor(1)) ``` ### after ``` >>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=True) torch.return_types.aminmax( min=tensor(1), max=tensor(1)) >>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=False) torch.return_types.aminmax( min=tensor(1), max=tensor(1)) ``` Marked the following test as expected_fail: `test_vmap.py TestVmapOperatorsOpInfoCPU.test_op_has_batch_rule_aminmax_cpu_float32` Given input shape of (2), the loop out is shape (2), the batched vmap out is (2, 1), which mismatched. The loop out will calculate twice on a tensor shape of ( ): without this patch, the output is (1), and then stacked into (2, 1); with this patch, the output is ( ), then stacked into (2). Pull Request resolved: https://github.com/pytorch/pytorch/pull/96171 Approved by: https://github.com/jgong5, https://github.com/ngimel, https://github.com/zou3519	2023-03-15 22:44:13 +00:00
albanD	7c525823c7	Remove un-used list. And disable pytest for public binding test. (#96684 ) This contains a temporary change to make sure the test fails nicely now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96684 Approved by: https://github.com/clee2000	2023-03-15 22:12:00 +00:00
soulitzer	f3db2a6341	Expose API to specify custom context manager for checkpoint (#96783 ) Per [design](https://docs.google.com/document/d/1v-yqRqiWA6dIUOw5OpqFs2PqSQIbDEkwRPGk9FcYnxg/edit) we want (1) to allow the user to pass in a function that returns two context managers (2) a per-call API only for now, and (3) do not upstream selective checkpoint for the short term. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96783 Approved by: https://github.com/albanD	2023-03-15 20:37:33 +00:00
Howard Huang	ac7329b323	Add exceptionhandler to more distributed_c10d APIs (#96770 ) Summary: Adding exception handler to a few more APIs so that internal errors are logged to the c10d errors scuba table Test Plan: sandcastle Differential Revision: D44068557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96770 Approved by: https://github.com/wz337	2023-03-15 20:31:46 +00:00
Xiao Wang	1716709d46	[CUDA] Use accumulate type to improve accuracy of grid_sample on half precision inputs [v2] (#96586 ) Fixes #96429 This PR is also a follow up for #90427. In that PR, we also discussed whether calculations of grid indices `grid_sampler_compute_source_index` should also be upcasted to `opmath_t` https://github.com/pytorch/pytorch/pull/90427/files#r1048876708. Due to another unit test failure, we didn't upcast those calculations in that PR. After some investigations, I found that the inaccurate results have nothing to do with the internals of `affine_grid`, even if it's calculated using `double` internally. As long as input `grid` is passed to `grid_sample` in half precision, the results will be less inaccurate than a float `grid`. This can be verified with a short C++ program like this (by setting `TYPE_T` to `__half` and `float` in compilations) ```cpp #include <cuda.h> #include <cuda_runtime.h> #include <cuda_fp16.h> #include <iostream> #ifndef TYPE_T #define TYPE_T float #endif int main() { using type_t = TYPE_T; type_t d = static_cast<__half>((double)2.0 / 3.0); type_t s = (((float)d + 1.f) * 3 - 1) / 2; printf("%.15f %.15f\n", (double)d, (double)s); } ``` Outputs are ``` ./float.out 0.666503906250000 1.999755859375000 ./half.out 0.666503906250000 2.000000000000000 ``` To resolve the discussion back in https://github.com/pytorch/pytorch/pull/90427/files#r1048876708, I've also increased the test tolerance in the failed unit test `issue_24823_1(torch.half)`. For the original script in #96429, I got more accurate results with `align_corners = True` ``` align_corners = True Expected result has mean absolute value of 0.5285 and maximum absolute value of 3.2067. Half precision result is off by 0.0001 (0.02%) on average and 0.0010 (0.03%) at maximum. align_corners = False Expected result has mean absolute value of 0.5189 and maximum absolute value of 3.0101. Half precision result is off by 0.0001 (0.02%) on average and 0.0010 (0.03%) at maximum. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96586 Approved by: https://github.com/ngimel	2023-03-15 19:25:20 +00:00
Will Constable	54cd4a67d0	Output peak memory stats from dynamo torchbench perf CI (#95666 ) Adds absolute memory usage numbers (in addition to compression ratio) to performance jobs. Example output: <img width="1211" alt="image" src="https://user-images.githubusercontent.com/4984825/225419950-500908c5-00ce-4711-afa2-c995bf90d35d.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95666 Approved by: https://github.com/ezyang, https://github.com/williamwen42	2023-03-15 19:24:47 +00:00
Elias Ellison	445863128b	Use .to instead of contiguous to generate channels last tensor (#96791 ) Fix for https://github.com/pytorch/pytorch/issues/95693. From https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html: > There are minor difference between the two APIs to and contiguous. We suggest to stick with to when explicitly converting memory format of tensor. For general cases the two APIs behave the same. However in special cases for a 4D tensor with size NCHW when either: C==1 or H==1 && W==1, only to would generate a proper stride to represent channels last memory format. We hit this case in convolution_backward in calling `contiguous()`. Even though we were determining that we should run the backward in channels_last forward, as FakeTensor had gathered from the output of [determine_backend_memory_format](https://github.com/pytorch/pytorch/blob/master/torch/_subclasses/fake_tensor.py#L559), we were still outputting a contiguous tensor. That led to the mismatch in strides in the issue. Should we be calling `to` instead of `contiguous` more liberally throughout the codebase, especially in convolution related code ? Not sure if there are reasons not to do this. Another fix would be to update `cudnn_conv_suggest_memory_format` so that it would output a contiguous_format in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96791 Approved by: https://github.com/ngimel	2023-03-15 19:12:04 +00:00
Kshiteej K	24c49dbf14	[functorch] batch rule : few decomposition ops (#96744 ) Fixes https://github.com/pytorch/pytorch/issues/96741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96744 Approved by: https://github.com/zou3519	2023-03-15 18:55:05 +00:00
kshitij12345	9b1b3fdd2d	add to functorch codeowner (#96746 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96746 Approved by: https://github.com/zou3519	2023-03-15 18:52:22 +00:00
Stas Bekman	11e708dd6b	[doc] fix `torch.cuda.mem_get_info` doc (#96621 ) the current `torch.cuda.mem_get_info` doc is incorrect. This util returns `free, total` and not `free, used` ``` __host__ cudaError_t cudaMemGetInfo ( size_t* free, size_t* total ) Gets free and total device memory. ``` Also this util isn't mentioned in https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management - should it be included there as well? Pull Request resolved: https://github.com/pytorch/pytorch/pull/96621 Approved by: https://github.com/kit1980	2023-03-15 18:11:00 +00:00
Huy Do	6339ee5d23	Temporarily disable test_ops_jit on Windows (#96859 ) See https://github.com/pytorch/pytorch/issues/96858 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96859 Approved by: https://github.com/kit1980	2023-03-15 17:51:32 +00:00
Wanchao Liang	aa09f8891c	add 2d tests to ci (#96711 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96711 Approved by: https://github.com/huydhn, https://github.com/fduwjj	2023-03-15 17:18:47 +00:00
Driss Guessous	5612aa6acd	Fixes a layer_norm_nested backwards edge case. (#96788 ) # Summary Add Test and the fix for when input NT doesn't require grad to layernorm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96788 Approved by: https://github.com/cpuhrsch	2023-03-15 17:16:13 +00:00
Xuehai Pan	80e8e41ca7	Fix type hint for `torch.Tensor.grad_fn` (#96804 ) Fix type hint for `torch.Tensor.grad_fn`, which can be a `torch.autograd.graph.Node` or `None`. This is a regression in `torch` 2.0. It makes `mypy` failure in downstream projects. Ref: - https://github.com/pytorch/pytorch/issues/94937#issuecomment-1469344993 - metaopt/torchopt#149 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96804 Approved by: https://github.com/Skylion007	2023-03-15 17:14:05 +00:00
Elias Ellison	a7d2e451fd	Fix build, shadowed variable (#96778 ) Had an internal build error with this Differential Revision: [D44071892](https://our.internmc.facebook.com/intern/diff/D44071892) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96778 Approved by: https://github.com/Chillee, https://github.com/voznesenskym	2023-03-15 16:41:06 +00:00
Brian Hirsh	e9d9151eec	[aot autograd] avoid cloning some inputs unnecessarily when they dont require grad (#96342 ) When constructing the joint graph, we normally have to clone any inputs that are mutated, so that we can pass in the original, pre-mutation inputs as leaves to autograd. Previously, we were doing this for all mutated inputs - but we only need to do it for inputs that require gradients and participate in autograd. Hopefully this should speed up code like batch norm - I think before this we were unnecessarily cloning the running stats during training. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96342 Approved by: https://github.com/albanD, https://github.com/ezyang	2023-03-15 16:34:04 +00:00
Brian Hirsh	3cd9c7a16d	[aot autograd] refactor to make functionalization self-contained (#96341 ) This refactor should make it easier to add an export hook into aot autograd. (1) I killed `create_forward_or_joint_functionalized()` (and the functions that it called, like `forward_or_joint()`) which used to handle autograd + functionalization all-in-one-go for the joint case, and was also used in the inference case. I added a few separate helper functions: `create_functionalized_graph()`: this takes a flat fn, and returns a functionalized fx graph. It is mostly just a thin wrapper around functionalization + make_fx(), but also has some extra logic to manually append `copy_()` ops to the end of the graph. `fn_no_extra_mutations()`: this creates the fn that we want to trace in the inference code path. It takes in a function that it then calls, and returns the outputs + any (updated) mutated inputs. `joint_fn_no_external_mutations()`: this creates the fn that we want to trace in the joint code path. It takes in a function, and traces out its joint. It also does the work of cloning inputs that are mutated and require gradients, returning mutated inputs as outputs, and returning intermediate bases as outputs We should be able to add an export hook by basically adding a similar version of `joint_fn_no_external_mutations` but with a lot more restrictions (guaranteed to have no tangents, not synthetic bases, etc), and calling `create_functionalized_graph()` on it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96341 Approved by: https://github.com/ezyang	2023-03-15 16:34:04 +00:00
Rohan Varma	8e6287264d	[Optim in backward] register_hook=False API (#95096 ) Use this API to avoid registering hooks for applications that do their own custom logic. This eliminates the need for DDP to have to de-register these hooks. Differential Revision: [D43383794](https://our.internmc.facebook.com/intern/diff/D43383794/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43383794/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/95096 Approved by: https://github.com/zhaojuanmao	2023-03-15 14:33:13 +00:00
Brian Hirsh	8ce8d49cc4	aot autograd: consolidate metadata (#96340 ) Another bonus of factoring the synthetic_base logic into one place: we used to have a `CompiledRuntimeMetadata` object that encapsulated `ViewAndMutationMeta`, plus a bunch of extra synthetic base metadata that was plumbed around. Now I can kill that first metadata object, and use `ViewAndMutationMeta` on its own everywhere. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96340 Approved by: https://github.com/ezyang	2023-03-15 13:45:45 +00:00
Brian Hirsh	070cefaef9	aot_autograd: dont requires_grad on tangents (#96339 ) Ed pointed it out a few days ago - I probably added this mistakenly a few months ago. I can't think of any reason it's necessary, and removing it doesn't cause any tests to fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96339 Approved by: https://github.com/ezyang	2023-03-15 13:45:45 +00:00
Brian Hirsh	a269469982	aot autograd refactor: make all synthetic base logic layered in a single location (#96235 ) This refactor doesn't significantly change LoC in aot autograd, but I think this nets out to making it clearer (interested in peoples' thoughts). The idea is that I tried to re-write the part of aot autograd that deals with synthetic bases in a layered way, similar to how Ed wrote the logic for dedup'ing inputs: it happens in one place, and all of the downstream transformation in aot autograd don't have to worry about it. Specifically, I added a new function `aot_wrapper_synthetic_base`, similar to the existing `aot_wrapper_dedupe`. The benefit: none of the other code in aot autograd needs to think about synthetic bases (previously, synthetic base code was intertwined in several places). The downsides: there are two. (1) `aot_wrapper_synthetic_base()` needs to have its own epilogue. There is one particularly hairy case, where factoring the synthetic base logic to a single location was painful: If you have two inputs that alias each other, where one gets a data mutation, and the other gets a metadata mutation. Ordinarily, metadata mutations are handled by the runtime epilogue, in `create_runtime_wrapper`. However, now that things are factored this way, the runtime wrapper operates only on synthetic bases instead of operating on the original inputs. For data mutations, it is fine to apply the data mutation to the synthetic base instead of the original input alias. But for metadata mutations, we need to apply the metadata mutation directly to the original inputs. The way that I handled this was by tracking which inputs slot into this specific case (part of a synthetic base, and get metadata mutations), and updateing the flat_fn() that we pass downstream to return these updated inputs as extra outputs. From the perspective of downstream logic, these are real user outputs, that it can treat like any other user outputs. `aot_wrapper_synthetic_base` will know to grab these extra outputs and use them to apply the metadata mutations. This was pretty annoying, but has the benefit that all of that logic is encapsulated entirely in `aot_wrapper_synthetic_base()`. (2) input mutations are now performed on the synthetic base instead of the individual aliases. You can see the original code comment [here](`b0b5f3c6c6/torch/_functorch/aot_autograd.py (L1131)`) for details. We used to do the optimized thing in this case, and now we do the less optimized thing (copying the entire synthetic base, instead of the potentially smaller alias). To be fair, we had no data showing that this optimization was showing improvements on any models in practice. I also think that the main reason anyone would ever run across this problem is because of a graph break - so if you care about perf, you probably want to avoid the extra graph breaks to begin with. I haven't added any warnings for this, but we probably could depending on what people think. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96235 Approved by: https://github.com/ezyang	2023-03-15 13:45:43 +00:00
Brian Hirsh	7a076b7b93	[aot_autograd] only performance functionalization analysis pass once (#95992 ) For a while now, we've been re-running our functionalization analysis pass twice - once for get metadata when dedup'ing, and an entire second time during aot_dispatch_base/autograd. This should also probably speed up compile times pretty noticeably, since we're going from: (a) inference-only trace case: 3 fw traces -> 2 fw traces (b) autograd trace case: 2 fw traces + 1 joint trace -> 1 fw trace + 1 joint trace Pull Request resolved: https://github.com/pytorch/pytorch/pull/95992 Approved by: https://github.com/ezyang	2023-03-15 13:45:40 +00:00
PyTorch MergeBot	e1ea584b1c	Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 )" This reverts commit 4e1060c609c094fd5f58041ebed803f74410ee36. Reverted https://github.com/pytorch/pytorch/pull/95541 on behalf of https://github.com/DanilBaibak due to breaking internal builds	2023-03-15 13:28:41 +00:00
Bin Bao	33c7be360f	[reland][CI] switch torchbench to a pinned version (#96782 ) Summary: This is reland of https://github.com/pytorch/pytorch/pull/96553 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96782 Approved by: https://github.com/huydhn	2023-03-15 12:46:36 +00:00
Nikita Vedeneev	3fd24fb608	COO intersection: allow external hash + hash reuse in sparse_mask (#94596 ) External hash implies more flexibility in the op coverage + performance improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94596 Approved by: https://github.com/cpuhrsch	2023-03-15 09:11:14 +00:00
Kulin Seth	93f7996995	[MPS] Fix the MacOS 13.3 selector check. (#96733 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/96733 Approved by: https://github.com/DenisVieriu97	2023-03-15 06:43:18 +00:00
BowenBao	60a68477a6	Bump black version to 23.1.0 (#96578 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96578 Approved by: https://github.com/ezyang	2023-03-15 06:27:59 +00:00
Nikita Shulga	a229e78544	[BE] Enforce sign-compare (#96723 ) Number of OSS PR were reverted, because new signed-unsigned comparison warnings, which are treated as errors in some internal builds. Not sure how those selective rules are applied, but this PR removes `-Wno-sign-compare` from PyTorch codebase. The only tricky part in this PR, as making sure that non-ASCII character detection works for both signed and unsigned chars here: `6e3d51b08a/torch/csrc/jit/serialization/python_print.cpp (L926)` Exclude several files from sign-compare if flash attention is used, due to the violation in cutlass, to be fixed by https://github.com/NVIDIA/cutlass/pull/869 Do not try to fix sign compare violations in caffe2 codebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/96723 Approved by: https://github.com/albanD	2023-03-15 06:04:20 +00:00
Boris Fomitchev	96c745dfdc	Fix int() casting in torch.nn.RNN to have correctly traced JIT and ONNX graph. (#92970 ) Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com> Fixes #91351 As for unit tests - in this PR I only fixed LSTM unit test to properly use dynamic axes and expose export issue by running test with same ONNX for additional inputs. If the changes approved, we should also fix the rest of the tests (RNN/GRU and beyond). I have verified the following updated tests are working with new code and failing with the old code: test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset_version_14_is_script_False_keep_initializers_as_inputs_True::test_rnn_name_lstm_nonlinearity_None_unilayer_bidirectional_no_initial_state_with_variable_length_sequences_with_dropout test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset_version_14_is_script_False_keep_initializers_as_inputs_True::test_rnn_name_lstm_nonlinearity_None_unilayer_bidirectional_with_initial_state_with_variable_length_sequences_with_dropout Pull Request resolved: https://github.com/pytorch/pytorch/pull/92970 Approved by: https://github.com/titaiwangms, https://github.com/kit1980	2023-03-15 05:33:41 +00:00
PyTorch MergeBot	d3a38bdd47	Revert "Update xnnpack to the latest commit (#95884 )" This reverts commit 0da89664cc21b14b84c5d438d358278699f8a51e. Reverted https://github.com/pytorch/pytorch/pull/95884 on behalf of https://github.com/kit1980 due to Broke buck-build-test https://github.com/pytorch/pytorch/actions/runs/4421715166/jobs/7752808844#logs	2023-03-15 05:32:36 +00:00
Mark Saroufim	6110effa86	Rework torch.compile docs (#96706 ) Chatted with @stas00 on slack and here are some great improvements he suggested to the compile docs - [x] Rename `dynamo` folder to `compile` - [x] Link `compile` docstring on `torch.html` to main index page for compile - [x] Create a new index page that describes why people should care - [x] easy perf, memory reduction, 1 line - [x] Short benchmark table - [x] How to guide - [x] TOC that links to the more technical pages folks have written, make the existing docs we have a Technical overview - [x] Highlight the new APIs for `torch._inductor.list_options()` and `torch._inductor.list_mode_options()` - clarify these are inductor specific and add more prose around which ones are most interesting He also highlighted an interesting way to think about who is reading this doc we have - [x] End users, that just want things to run fast - [x] Library maintainers wrapping torch.compile which would care for example about understanding when in their code they should compile a model, which backends are supported - [x] Debuggers who needs are somewhat addressed by the troubleshooting guide and faq but those could be dramatically reworked to say what we expect to break And in a seperate PR I'll work on the below with @SherlockNoMad - [ ] Authors of new backends that care about how to plug into dynamo or inductor layer so need to explain some more internals like - [ ] IR - [ ] Where to plugin, dynamo? inductor? triton? Pull Request resolved: https://github.com/pytorch/pytorch/pull/96706 Approved by: https://github.com/svekars	2023-03-15 04:41:13 +00:00
PyTorch MergeBot	2795233668	Revert "Fix periodic job by excluding check_graph_breaks (#96780 )" This reverts commit 8ec9beacec54fb6c56102f60180063f0fca7f24c. Reverted https://github.com/pytorch/pytorch/pull/96780 on behalf of https://github.com/wconstab due to broke trunk builds that didn't run on PR? didn't see those trunk failures on PR CI even with trunk label	2023-03-15 04:30:47 +00:00
Zachary DeVito	85639c1a88	[allocator] Generalize recording to a pool (#96542 ) Previously the allocator would query whether a stream was recording a graph, and look up the pool associated with a graph. This change has the allocator directly associate a stream with a mempool, decoupling "record this stream to a pool" from the action of "record all actions to a cuda graph". Pull Request resolved: https://github.com/pytorch/pytorch/pull/96542 Approved by: https://github.com/eellison	2023-03-15 04:28:49 +00:00
XiaobingSuper	e01b092705	inductor: don't remember pre-loop order if pre loop has loop collapse (#96640 ) Given the following case from timm ese_vovnet19b_dw: ``` import torch import torch._dynamo class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() self.conv1 = torch.nn.Conv2d(256, 256, kernel_size=1, padding=0) self.conv2 = torch.nn.Conv2d(256, 256, kernel_size=1, padding=0) self.pool = torch.nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True) def forward(self, x): x = self.conv1(x) x2 = self.conv2(x) y = x2 * x return self.pool(y) model = Model().to(memory_format=torch.channels_last).eval() x = torch.randn(128, 256, 56, 56).to(memory_format=torch.channels_last) opt_model = torch._dynamo.optimize('inductor')(model) with torch.no_grad(): for i in range(2): y = opt_model(x ``` before this PR, the max_pooling can't be vectorized: ``` extern "C" void kernel(float* in_out_ptr0, const float* in_ptr0, float* out_ptr0) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=0; i0<6422528; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16i0); auto tmp1 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 16i0); auto tmp2 = tmp0 * tmp1; tmp2.store(in_out_ptr0 + 16i0); } #pragma omp for simd simdlen(8) for(long i0=102760448; i0<102760448; i0+=1) { auto tmp0 = in_ptr0[i0]; auto tmp1 = in_out_ptr0[i0]; auto tmp2 = tmp0 tmp1; in_out_ptr0[i0] = tmp2; } } { #pragma omp for for(long i0=0; i0<128; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<256; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<28; i2+=1) { #pragma GCC ivdep for(long i3=0; i3<28; i3+=1) { auto tmp0 = static_cast<long>(2i2); auto tmp1 = static_cast<long>(0); auto tmp2 = tmp0 >= tmp1; auto tmp3 = static_cast<long>(56); auto tmp4 = tmp0 < tmp3; auto tmp5 = tmp2 & tmp4; auto tmp6 = static_cast<long>(2i3); auto tmp7 = tmp6 >= tmp1; auto tmp8 = tmp6 < tmp3; auto tmp9 = tmp7 & tmp8; auto tmp10 = tmp5 & tmp9; auto tmp11 = [&] { auto tmp12 = in_out_ptr0[i1 + (512i3) + (28672i2) + (802816i0)]; return tmp12; } ; auto tmp13 = tmp10 ? tmp11() : -std::numeric_limits<decltype(tmp11())>::infinity(); auto tmp14 = static_cast<long>(1 + (2i3)); auto tmp15 = tmp14 >= tmp1; auto tmp16 = tmp14 < tmp3; auto tmp17 = tmp15 & tmp16; auto tmp18 = tmp5 & tmp17; auto tmp19 = [&] { auto tmp20 = in_out_ptr0[256 + i1 + (512i3) + (28672i2) + (802816i0)]; return tmp20; } ; auto tmp21 = tmp18 ? tmp19() : -std::numeric_limits<decltype(tmp19())>::infinity(); auto tmp22 = (tmp13 != tmp13) ? tmp13 : std::max(tmp21, tmp13); auto tmp23 = static_cast<long>(2 + (2i3)); auto tmp24 = tmp23 >= tmp1; auto tmp25 = tmp23 < tmp3; auto tmp26 = tmp24 & tmp25; auto tmp27 = tmp5 & tmp26; auto tmp28 = [&] { auto tmp29 = in_out_ptr0[512 + i1 + (512i3) + (28672i2) + (802816i0)]; return tmp29; } ; auto tmp30 = tmp27 ? tmp28() : -std::numeric_limits<decltype(tmp28())>::infinity(); auto tmp31 = (tmp22 != tmp22) ? tmp22 : std::max(tmp30, tmp22); auto tmp32 = static_cast<long>(1 + (2i2)); auto tmp33 = tmp32 >= tmp1; auto tmp34 = tmp32 < tmp3; auto tmp35 = tmp33 & tmp34; auto tmp36 = tmp35 & tmp9; auto tmp37 = [&] { auto tmp38 = in_out_ptr0[14336 + i1 + (512i3) + (28672i2) + (802816i0)]; return tmp38; } ; auto tmp39 = tmp36 ? tmp37() : -std::numeric_limits<decltype(tmp37())>::infinity(); auto tmp40 = (tmp31 != tmp31) ? tmp31 : std::max(tmp39, tmp31); auto tmp41 = tmp35 & tmp17; auto tmp42 = [&] { auto tmp43 = in_out_ptr0[14592 + i1 + (512i3) + (28672i2) + (802816i0)]; return tmp43; } ; auto tmp44 = tmp41 ? tmp42() : -std::numeric_limits<decltype(tmp42())>::infinity(); auto tmp45 = (tmp40 != tmp40) ? tmp40 : std::max(tmp44, tmp40); auto tmp46 = tmp35 & tmp26; auto tmp47 = [&] { auto tmp48 = in_out_ptr0[14848 + i1 + (512i3) + (28672i2) + (802816i0)]; return tmp48; } ; auto tmp49 = tmp46 ? tmp47() : -std::numeric_limits<decltype(tmp47())>::infinity(); auto tmp50 = (tmp45 != tmp45) ? tmp45 : std::max(tmp49, tmp45); auto tmp51 = static_cast<long>(2 + (2i2)); auto tmp52 = tmp51 >= tmp1; auto tmp53 = tmp51 < tmp3; auto tmp54 = tmp52 & tmp53; auto tmp55 = tmp54 & tmp9; auto tmp56 = [&] { auto tmp57 = in_out_ptr0[28672 + i1 + (512i3) + (28672i2) + (802816i0)]; return tmp57; } ; auto tmp58 = tmp55 ? tmp56() : -std::numeric_limits<decltype(tmp56())>::infinity(); auto tmp59 = (tmp50 != tmp50) ? tmp50 : std::max(tmp58, tmp50); auto tmp60 = tmp54 & tmp17; auto tmp61 = [&] { auto tmp62 = in_out_ptr0[28928 + i1 + (512i3) + (28672i2) + (802816i0)]; return tmp62; } ; auto tmp63 = tmp60 ? tmp61() : -std::numeric_limits<decltype(tmp61())>::infinity(); auto tmp64 = (tmp59 != tmp59) ? tmp59 : std::max(tmp63, tmp59); auto tmp65 = tmp54 & tmp26; auto tmp66 = [&] { auto tmp67 = in_out_ptr0[29184 + i1 + (512i3) + (28672i2) + (802816i0)]; return tmp67; } ; auto tmp68 = tmp65 ? tmp66() : -std::numeric_limits<decltype(tmp66())>::infinity(); auto tmp69 = (tmp64 != tmp64) ? tmp64 : std::max(tmp68, tmp64); out_ptr0[i1 + (256i3) + (7168i2) + (200704i0)] = tmp69; } } } } } } } ''') ``` We always avoid reordering when pre-loop has a loop collapse: `2cbce06fee/torch/_inductor/ir.py (L2267-L2273)`. This PR adds a check that only reuses pre-loop ordering when not having loop collapse. After this PR, the generated code is ``` extern "C" void kernel(float* in_out_ptr0, const float* in_ptr0, float* out_ptr0) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=0; i0<6422528; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16i0); auto tmp1 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 16i0); auto tmp2 = tmp0 * tmp1; tmp2.store(in_out_ptr0 + 16i0); } #pragma omp for simd simdlen(8) for(long i0=102760448; i0<102760448; i0+=1) { auto tmp0 = in_ptr0[i0]; auto tmp1 = in_out_ptr0[i0]; auto tmp2 = tmp0 tmp1; in_out_ptr0[i0] = tmp2; } } { #pragma omp for for(long i0=0; i0<128; i0+=1) { #pragma GCC ivdep for(long i1=0; i1<28; i1+=1) { #pragma GCC ivdep for(long i2=0; i2<28; i2+=1) { for(long i3=0; i3<16; i3+=1) { auto tmp0 = at::vec::Vectorized<int>(static_cast<int>(2i1)); auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0)); auto tmp2 = tmp0 >= tmp1; auto tmp3 = at::vec::Vectorized<int>(static_cast<int>(56)); auto tmp4 = tmp0 < tmp3; auto tmp5 = tmp2 & tmp4; auto tmp6 = at::vec::Vectorized<int>(static_cast<int>(2i2)); auto tmp7 = tmp6 >= tmp1; auto tmp8 = tmp6 < tmp3; auto tmp9 = tmp7 & tmp8; auto tmp10 = tmp5 & tmp9; auto tmp11 = [&] { auto tmp12 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + (16i3) + (512i2) + (28672i1) + (802816i0)); return tmp12; } ; auto tmp13 = decltype(tmp11())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp11(), to_float_mask(tmp10) != at::vec::Vectorized<float>(0)); auto tmp14 = at::vec::Vectorized<int>(static_cast<int>(1 + (2i2))); auto tmp15 = tmp14 >= tmp1; auto tmp16 = tmp14 < tmp3; auto tmp17 = tmp15 & tmp16; auto tmp18 = tmp5 & tmp17; auto tmp19 = [&] { auto tmp20 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 256 + (16i3) + (512i2) + (28672i1) + (802816i0)); return tmp20; } ; auto tmp21 = decltype(tmp19())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp19(), to_float_mask(tmp18) != at::vec::Vectorized<float>(0)); auto tmp22 = at::vec::maximum(tmp21, tmp13); auto tmp23 = at::vec::Vectorized<int>(static_cast<int>(2 + (2i2))); auto tmp24 = tmp23 >= tmp1; auto tmp25 = tmp23 < tmp3; auto tmp26 = tmp24 & tmp25; auto tmp27 = tmp5 & tmp26; auto tmp28 = [&] { auto tmp29 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 512 + (16i3) + (512i2) + (28672i1) + (802816i0)); return tmp29; } ; auto tmp30 = decltype(tmp28())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp28(), to_float_mask(tmp27) != at::vec::Vectorized<float>(0)); auto tmp31 = at::vec::maximum(tmp30, tmp22); auto tmp32 = at::vec::Vectorized<int>(static_cast<int>(1 + (2i1))); auto tmp33 = tmp32 >= tmp1; auto tmp34 = tmp32 < tmp3; auto tmp35 = tmp33 & tmp34; auto tmp36 = tmp35 & tmp9; auto tmp37 = [&] { auto tmp38 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 14336 + (16i3) + (512i2) + (28672i1) + (802816i0)); return tmp38; } ; auto tmp39 = decltype(tmp37())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp37(), to_float_mask(tmp36) != at::vec::Vectorized<float>(0)); auto tmp40 = at::vec::maximum(tmp39, tmp31); auto tmp41 = tmp35 & tmp17; auto tmp42 = [&] { auto tmp43 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 14592 + (16i3) + (512i2) + (28672i1) + (802816i0)); return tmp43; } ; auto tmp44 = decltype(tmp42())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp42(), to_float_mask(tmp41) != at::vec::Vectorized<float>(0)); auto tmp45 = at::vec::maximum(tmp44, tmp40); auto tmp46 = tmp35 & tmp26; auto tmp47 = [&] { auto tmp48 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 14848 + (16i3) + (512i2) + (28672i1) + (802816i0)); return tmp48; } ; auto tmp49 = decltype(tmp47())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp47(), to_float_mask(tmp46) != at::vec::Vectorized<float>(0)); auto tmp50 = at::vec::maximum(tmp49, tmp45); auto tmp51 = at::vec::Vectorized<int>(static_cast<int>(2 + (2i1))); auto tmp52 = tmp51 >= tmp1; auto tmp53 = tmp51 < tmp3; auto tmp54 = tmp52 & tmp53; auto tmp55 = tmp54 & tmp9; auto tmp56 = [&] { auto tmp57 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 28672 + (16i3) + (512i2) + (28672i1) + (802816i0)); return tmp57; } ; auto tmp58 = decltype(tmp56())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp56(), to_float_mask(tmp55) != at::vec::Vectorized<float>(0)); auto tmp59 = at::vec::maximum(tmp58, tmp50); auto tmp60 = tmp54 & tmp17; auto tmp61 = [&] { auto tmp62 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 28928 + (16i3) + (512i2) + (28672i1) + (802816i0)); return tmp62; } ; auto tmp63 = decltype(tmp61())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp61(), to_float_mask(tmp60) != at::vec::Vectorized<float>(0)); auto tmp64 = at::vec::maximum(tmp63, tmp59); auto tmp65 = tmp54 & tmp26; auto tmp66 = [&] { auto tmp67 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 29184 + (16i3) + (512i2) + (28672i1) + (802816i0)); return tmp67; } ; auto tmp68 = decltype(tmp66())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp66(), to_float_mask(tmp65) != at::vec::Vectorized<float>(0)); auto tmp69 = at::vec::maximum(tmp68, tmp64); tmp69.store(out_ptr0 + (16i3) + (256i2) + (7168i1) + (200704i0)); } #pragma omp simd simdlen(8) for(long i3=256; i3<256; i3+=1) { auto tmp0 = static_cast<long>(2i1); auto tmp1 = static_cast<long>(0); auto tmp2 = tmp0 >= tmp1; auto tmp3 = static_cast<long>(56); auto tmp4 = tmp0 < tmp3; auto tmp5 = tmp2 & tmp4; auto tmp6 = static_cast<long>(2i2); auto tmp7 = tmp6 >= tmp1; auto tmp8 = tmp6 < tmp3; auto tmp9 = tmp7 & tmp8; auto tmp10 = tmp5 & tmp9; auto tmp11 = [&] { auto tmp12 = in_out_ptr0[i3 + (512i2) + (28672i1) + (802816i0)]; return tmp12; } ; auto tmp13 = tmp10 ? tmp11() : -std::numeric_limits<decltype(tmp11())>::infinity(); auto tmp14 = static_cast<long>(1 + (2i2)); auto tmp15 = tmp14 >= tmp1; auto tmp16 = tmp14 < tmp3; auto tmp17 = tmp15 & tmp16; auto tmp18 = tmp5 & tmp17; auto tmp19 = [&] { auto tmp20 = in_out_ptr0[256 + i3 + (512i2) + (28672i1) + (802816i0)]; return tmp20; } ; auto tmp21 = tmp18 ? tmp19() : -std::numeric_limits<decltype(tmp19())>::infinity(); auto tmp22 = (tmp13 != tmp13) ? tmp13 : std::max(tmp21, tmp13); auto tmp23 = static_cast<long>(2 + (2i2)); auto tmp24 = tmp23 >= tmp1; auto tmp25 = tmp23 < tmp3; auto tmp26 = tmp24 & tmp25; auto tmp27 = tmp5 & tmp26; auto tmp28 = [&] { auto tmp29 = in_out_ptr0[512 + i3 + (512i2) + (28672i1) + (802816i0)]; return tmp29; } ; auto tmp30 = tmp27 ? tmp28() : -std::numeric_limits<decltype(tmp28())>::infinity(); auto tmp31 = (tmp22 != tmp22) ? tmp22 : std::max(tmp30, tmp22); auto tmp32 = static_cast<long>(1 + (2i1)); auto tmp33 = tmp32 >= tmp1; auto tmp34 = tmp32 < tmp3; auto tmp35 = tmp33 & tmp34; auto tmp36 = tmp35 & tmp9; auto tmp37 = [&] { auto tmp38 = in_out_ptr0[14336 + i3 + (512i2) + (28672i1) + (802816i0)]; return tmp38; } ; auto tmp39 = tmp36 ? tmp37() : -std::numeric_limits<decltype(tmp37())>::infinity(); auto tmp40 = (tmp31 != tmp31) ? tmp31 : std::max(tmp39, tmp31); auto tmp41 = tmp35 & tmp17; auto tmp42 = [&] { auto tmp43 = in_out_ptr0[14592 + i3 + (512i2) + (28672i1) + (802816i0)]; return tmp43; } ; auto tmp44 = tmp41 ? tmp42() : -std::numeric_limits<decltype(tmp42())>::infinity(); auto tmp45 = (tmp40 != tmp40) ? tmp40 : std::max(tmp44, tmp40); auto tmp46 = tmp35 & tmp26; auto tmp47 = [&] { auto tmp48 = in_out_ptr0[14848 + i3 + (512i2) + (28672i1) + (802816i0)]; return tmp48; } ; auto tmp49 = tmp46 ? tmp47() : -std::numeric_limits<decltype(tmp47())>::infinity(); auto tmp50 = (tmp45 != tmp45) ? tmp45 : std::max(tmp49, tmp45); auto tmp51 = static_cast<long>(2 + (2i1)); auto tmp52 = tmp51 >= tmp1; auto tmp53 = tmp51 < tmp3; auto tmp54 = tmp52 & tmp53; auto tmp55 = tmp54 & tmp9; auto tmp56 = [&] { auto tmp57 = in_out_ptr0[28672 + i3 + (512i2) + (28672i1) + (802816i0)]; return tmp57; } ; auto tmp58 = tmp55 ? tmp56() : -std::numeric_limits<decltype(tmp56())>::infinity(); auto tmp59 = (tmp50 != tmp50) ? tmp50 : std::max(tmp58, tmp50); auto tmp60 = tmp54 & tmp17; auto tmp61 = [&] { auto tmp62 = in_out_ptr0[28928 + i3 + (512i2) + (28672i1) + (802816i0)]; return tmp62; } ; auto tmp63 = tmp60 ? tmp61() : -std::numeric_limits<decltype(tmp61())>::infinity(); auto tmp64 = (tmp59 != tmp59) ? tmp59 : std::max(tmp63, tmp59); auto tmp65 = tmp54 & tmp26; auto tmp66 = [&] { auto tmp67 = in_out_ptr0[29184 + i3 + (512i2) + (28672i1) + (802816i0)]; return tmp67; } ; auto tmp68 = tmp65 ? tmp66() : -std::numeric_limits<decltype(tmp66())>::infinity(); auto tmp69 = (tmp64 != tmp64) ? tmp64 : std::max(tmp68, tmp64); out_ptr0[i3 + (256i2) + (7168i1) + (200704i0)] = tmp69; } } } } } } ``` After this PR, we can get a 18% performance improvement for timm ese_vovnet19b_dw on skx-4148(```python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/timm_models.py --performance --float32 -dcpu -n50 --inductor --channels-last --no-skip --dashboard --only ese_vovnet19b_dw```): Pull Request resolved: https://github.com/pytorch/pytorch/pull/96640 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-03-15 04:06:37 +00:00
PyTorch MergeBot	c6a82e4339	[vision hash update] update the pinned vision hash (#96787 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96787 Approved by: https://github.com/pytorchbot	2023-03-15 03:05:28 +00:00
Will Constable	8ec9beacec	Fix periodic job by excluding check_graph_breaks (#96780 ) It's going to be harder to properly support check_graph_breaks across multiple baselines. Periodic and Inductor workflows are different baselines since they include different sets of models. It's not as simple as checking in the csv for the superset (periodic), becuase `update_expected.py` is designed to run given the sha of your failing PR and reset the baseline to that PR's artifacts. This is a nice workflow, and would be harder to manage if it had to always point to a periodic job. For now just do the check on the inductor job and ignore the other models covered only on periodic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96780 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-03-15 02:54:53 +00:00
Matthias Braun	6d4d9840cd	Stop including of PassManagerBuilder for llvm >= 15 (#96762 ) Summary: LLVM trunk / llvm-16 removes the `PassManagerBuilder.h` file. But we are using the new pass manager for llvm>=15 anyway. Test Plan: sandcastle Differential Revision: D44064301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96762 Approved by: https://github.com/bertmaher	2023-03-15 02:22:42 +00:00
PyTorch MergeBot	a8f36dd646	Revert "add amp support for custom backend (#96188 )" This reverts commit cf12edee02a44009c4f06e36efa97d9a7372ab35. Reverted https://github.com/pytorch/pytorch/pull/96188 on behalf of https://github.com/kit1980 due to Broke some linalg tests : https://github.com/pytorch/pytorch/actions/runs/4420037607/jobs/7750708339	2023-03-15 00:03:19 +00:00
Edward Z. Yang	5396f85c91	Propagate dynamo dynamic_shapes config to backwards (#96771 ) This fixes ``` File "/data/users/ezyang/a/pytorch/torch/_inductor/codegen/triton.py", line 1642, in codegen_node_schedule indexing_dtype_strength_reduction(node._body) File "/data/users/ezyang/a/pytorch/torch/_inductor/optimize_indexing.py", line 310, in indexing_dtype_strength_reduction OptimizeIndexing(loop_body, indices, indexing).run() File "/data/users/ezyang/a/pytorch/torch/_inductor/optimize_indexing.py", line 96, in __init__ self.replace_indirect(k, ValueRanges(0, v)) File "/data/users/ezyang/a/pytorch/torch/utils/_sympy/value_ranges.py", line 67, in __init__ upper = simple_sympify(upper) File "/data/users/ezyang/a/pytorch/torch/utils/_sympy/value_ranges.py", line 33, in simple_sympify assert not e.free_symbols, f"free variables NYI: {e}" AssertionError: free variables NYI: s0 ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96771 Approved by: https://github.com/eellison	2023-03-14 23:45:55 +00:00
Irem Yuksel	0da89664cc	Update xnnpack to the latest commit (#95884 ) After trying to update cpuinfo submodule to the latest, I saw that an update on xnnpack is also necessary. Fixes the 2 failing checks on [#95379](https://github.com/pytorch/pytorch/pull/95379) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95884 Approved by: https://github.com/Skylion007	2023-03-14 23:25:55 +00:00
Rohan Varma	707d892564	Debug logging around DDP mixed precision copies (#96438 ) Per title Differential Revision: [D43859976](https://our.internmc.facebook.com/intern/diff/D43859976/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96438 Approved by: https://github.com/zhaojuanmao	2023-03-14 23:06:59 +00:00
Bin Bao	b60d6e246e	[inductor] Consolidate codegen functions in sizevars.py into wrapper.py (#96654 ) Summary: Refactor the code so that wrapper codegen doesn't mix Python and C++. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96654 Approved by: https://github.com/jansel	2023-03-14 22:55:12 +00:00
Edward Z. Yang	037acd5a22	Update CI skips (#96745 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96745 Approved by: https://github.com/wconstab	2023-03-14 22:19:10 +00:00
Digant Desai	a198ce6d76	[PyTorch][XNNPACK] Update build files for newly added kernels (#95911 ) Same thing as D43747173 with some modifications to make sure internal only kernels are disabled in open source. Differential Revision: [D43747173](https://our.internmc.facebook.com/intern/diff/D43747173/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43747173/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/95911 Approved by: https://github.com/digantdesai	2023-03-14 22:13:24 +00:00
Aaron Gokaslan	dd5e6e8553	[BE]: Merge startswith calls - rule PIE810 (#96754 ) Merges startswith, endswith calls to into a single call that feeds in a tuple. Not only are these calls more readable, but it will be more efficient as it iterates through each string only once. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96754 Approved by: https://github.com/ezyang	2023-03-14 22:05:20 +00:00
PyTorch MergeBot	be4eaa69c2	Revert "[CI] switch torchbench to a pinned version (#96553 )" This reverts commit 61d6ccd29a1806f75b7604aa55d44a918ea6a3fb. Reverted https://github.com/pytorch/pytorch/pull/96553 on behalf of https://github.com/desertfire due to land race	2023-03-14 21:39:45 +00:00
PyTorch MergeBot	2951a75c3a	Revert "Update perf smoke test threshold in check_hf_bert_perf_csv.py (#96772 )" This reverts commit 2eed44933b5460623d135c2f453000b1d636c333. Reverted https://github.com/pytorch/pytorch/pull/96772 on behalf of https://github.com/desertfire due to land race	2023-03-14 21:37:30 +00:00
Yanbo Liang	e7d795dccd	[Inductor] aten.{avg_pool2d/max_pool2d_with_indices} arguments can be 1 element tuple (#96727 ) Fixes failure from 14k github models: ```pytest ./generated/test_ProGamerGov_neural_dream.py -k test_000``` Error: ``` ...... File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/graph.py", line 357, in call_function raise LoweringException(e, target, args, kwargs).with_traceback( File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/graph.py", line 354, in call_function out = lowerings[target](args, kwargs) File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/lowering.py", line 228, in wrapped out = decomp_fn(args, **kwargs) File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/lowering.py", line 3124, in avg_pool2d assert len(padding) == 2 torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: LoweringException: AssertionError: target: aten.avg_pool2d.default args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cuda', torch.float32, size=[4, 4, 64, 64], stride=[16384, 4096, 64, 1])) )) args[1]: [7, 7] args[2]: [1, 1] args[3]: [0] args[4]: False args[5]: False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96727 Approved by: https://github.com/jansel	2023-03-14 21:34:30 +00:00
Will Constable	784dd583a6	Automatically register/clear dynamo profiler hooks while profiling (#96199 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96199 Approved by: https://github.com/jansel	2023-03-14 21:19:33 +00:00
Wei Wang	2eed44933b	Update perf smoke test threshold in check_hf_bert_perf_csv.py (#96772 ) Reduce the threshold a little further due to runner to runner performance variations. e.g. https://github.com/pytorch/pytorch/actions/runs/4419276220/jobs/7747985757 https://github.com/pytorch/pytorch/actions/runs/4419548525/jobs/7748553775 failed to meet 1.145 but were above 1.140. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96772 Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/atalman	2023-03-14 21:00:13 +00:00
Thomas Li	159145a19e	Add support for torch.complex in functorch (#96032 ) Fixes #91175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96032 Approved by: https://github.com/Skylion007, https://github.com/kshitij12345, https://github.com/zou3519	2023-03-14 20:47:53 +00:00
Kurt Mohler	06b7285163	Add `torch._check` functions analogous to C++ `TORCH_CHECK` (#88725 ) Adds `_check`, `_check_index`, `_check_value`, `_check_type`, `_check_not_implemented`, `_check_tensor_all` Part of #72948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88725 Approved by: https://github.com/albanD	2023-03-14 20:44:50 +00:00
shibo	cf12edee02	add amp support for custom backend (#96188 ) Fixes #ISSUE_NUMBER 1、add amp support for custom backend 2、optimize the file `backend_registration.py`, and rename it with `custom_backend_registration.py`. And then we would register other funcs for custom backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96188 Approved by: https://github.com/bdhirsh	2023-03-14 20:43:21 +00:00
soulitzer	d30db9a251	Replace non-reentrant checkpoint with a rewrite that can be nested and contain grad (#90105 ) Changes: - bc-breaking change: The main difference between this and the old non-reentrant impl that it replaces is that we clear recomputed tensors on backward immediately upon unpack, even if retain_graph=True. This has the following additional implications: - Accessing _saved_tensors multiple times will silently recompute forward multiple times. - Accessing ctx.saved_tensor twice in the same backward will now raise an error. - To avoid dealing with the potential consequences, early stopping has been hidden behind a global flag that is by default False, and can be enabled via a context manager. We can remove this in a follow up. Some features of nesting as a result do not work by default. Before land: - import to check for more bc-breakingness - implement any workarounds for the bc-breaking-ness, if we decide on any - update docs to reflect new lifetime of recomputed variables - update docs to mention the early stop feature Follow ups: - enable early-stopping by default - update docs/tutorial to feature nested use cases Related docs: - code comment: https://github.com/pytorch/pytorch/pull/90105/files#diff-9dcd955620b52ce128e18e3567be88edbb238810460d1288a86fabc20e483b30R448 - design doc: https://docs.google.com/document/d/1UDLhTNv6_kvuDTRlsjfj9WdqtNaQNr8ahrvdBIB6914/edit# - retains_grad <> checkpiont https://docs.google.com/document/d/1maiGmuFUxysQL0AdYUU88kngAaXh_L0XpDcLDh_5Ors/edit Pull Request resolved: https://github.com/pytorch/pytorch/pull/90105 Approved by: https://github.com/albanD	2023-03-14 20:38:36 +00:00
Ramin Azarmehr	234df29901	[MPS] Add C++ API support for MPS backend (#96668 ) - This enables the APIs `torch::mps::is_available()/synchronize()/manual_seed()` for use in PyTorch C++. - Added test case for C++ APIs to `mps_test_allocator.cpp` Fixes #96425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96668 Approved by: https://github.com/kulinseth, https://github.com/albanD, https://github.com/malfet	2023-03-14 20:27:40 +00:00
PyTorch MergeBot	ba4fb9b6ad	Revert "Default specialize_int to False (#96624 )" This reverts commit 1ac8782db244f6cd3d2fd109e3fe94500745e0dd. Reverted https://github.com/pytorch/pytorch/pull/96624 on behalf of https://github.com/kit1980 due to Broke inductor/test_torchinductor_dynamic_shapes.py	2023-03-14 19:43:47 +00:00
Nikita Shulga	da1489e405	Fix signed-unsigned compare in FlattenIndicesCommon.h (#96765 ) One more regression from https://github.com/pytorch/pytorch/pull/94401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96765 Approved by: https://github.com/izaitsevfb, https://github.com/Skylion007	2023-03-14 19:41:47 +00:00
Will Constable	66871d61bb	One line print for check_graph_breaks (#96750 ) New output looks like this <img width="1040" alt="image" src="https://user-images.githubusercontent.com/4984825/225059313-fbac5152-ea8b-46ba-893d-dc1e2f8d82cc.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96750 Approved by: https://github.com/ezyang	2023-03-14 19:35:54 +00:00
albanD	6ea790c5b6	Make share_memory_ call thread safe within itself. (#96664 ) To achieve this, I have a per-StorageImpl (was data_ptr in the previous version of this PR, but moved to StorageImpl to ensure stability of the key before/after sharing) lock created when we are about to share a storage and make sure that all other calls to share memory wait on this lock before moving forward. This does NOT make this call generally thread safe as any call that is not sharing memory will race and lead to UB. This makes ensures that the sample from @robertolat in https://github.com/pytorch/pytorch/issues/95606 works fine. This does NOT fix the example from @imurray in that same issue as the call still race with the `.sum()` call. This race is expected and there is no easy way for us to make it work I'm afraid (see issue for more details). Pull Request resolved: https://github.com/pytorch/pytorch/pull/96664 Approved by: https://github.com/colesbury	2023-03-14 19:27:01 +00:00
cyy	799521fae5	Fixes 96676 (#96714 ) Fixes #96676 PR #95942 introduced some changes in function implementations to replace const parameters by const referenced ones. However, GetBackendDevice was missed and remains the old signature. This quick fix solves the type mismatch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96714 Approved by: https://github.com/antoniojkim, https://github.com/Skylion007	2023-03-14 19:00:59 +00:00
Bin Bao	61d6ccd29a	[CI] switch torchbench to a pinned version (#96553 ) Summary: Previously we were using a branch on torchbench which skips torchaudio. We should switch to make sure a good test coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96553 Approved by: https://github.com/huydhn, https://github.com/ezyang	2023-03-14 18:42:22 +00:00
Edward Z. Yang	1ac8782db2	Default specialize_int to False (#96624 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96624 Approved by: https://github.com/janeyx99	2023-03-14 18:37:47 +00:00
Edward Z. Yang	02f6d14b97	Only allow SymInt across partitioner boundaries, and fixes (#96653 ) This PR does a few things all at once, as I needed to fix several bugs on the way here. The main goal of the PR is to fix the `'float' object has no attribute '_has_symbolic_sizes_strides'` error. The general idea is to heavily penalize non-SymInt but still SymNode cuts in the graph. This doesn't work for default partitioner, so essentially, dynamic shapes with default partitioner is not supported. While doing this, I had a fix a few other bugs in the partitioner: * SymNode operations weren't considered recomputable. But they are very cheap, go wild. * zeros_like wasn't considered recomputable, and this prevented some gradient formulas (e.g., for angle with real inputs) from successfully finding a cut at all * AOTAutograd tests use the default partitioner. I switch them to use min-cut partitioner... * ...but this reveals a bug where if we have nodes in backward outputs that don't depend on tangents, they never get assigned to the backward graph. I fix this by making the backward outputs mandatory to be in backwards. I have to be careful to filter out None backward outputs; those never participate in flow analysis! This causes some wobbling for the min-cut tests, but these seem legitimate: since we're now willing to recompute, the partitioner can reduce the number of SymInts it transmits by just doing some recompute in the backend. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96653 Approved by: https://github.com/ngimel	2023-03-14 18:30:56 +00:00
Sherlock Huang	9cb02b2e72	Mark empty, rand, randn as core aten op (#96661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96661 Approved by: https://github.com/ngimel	2023-03-14 18:27:25 +00:00
Zachary DeVito	4e1060c609	[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541 ) This refactors the stack trace facility specific to memory profiling in python+cuda to make a generic facility to generate combined stack traces. The generic facility (combined_traceback.h) does not require python to be around to work, but will return python stacks if it is present. This facility is then used to add support for stack trace gathering in memory profiling that happens directly from C++. It is also used to expose a python API for gathering and symbolizing combineds stacks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541 Approved by: https://github.com/ezyang	2023-03-14 18:26:05 +00:00
David Berard	6e3d51b08a	[inductor][CI] also skip rexnet_100 on non-dynamic shapes (#96691 ) Recent failures show rexnet_100 accuracy is flaky also on non-dynamic shapes (was already disabled for dynamic shapes in #96474). The failure occurs for the same reason (stem.bn.weight.grad). e.g. https://github.com/pytorch/pytorch/actions/runs/4402868441/jobs/7710977874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96691 Approved by: https://github.com/desertfire	2023-03-14 18:11:59 +00:00
Andrew Gu	a916d64900	[FSDP] Relax `sharded_grad` assert to allow IDLE (#96584 ) `_use_sharded_grad_views()` can be called when re-registering the original parameters in `load_state_dict()`, in which case the training state is `IDLE`. Previously, I only expected `_use_sharded_grad_views()` to be called in `FORWARD` when the sharded gradient is not in `_saved_grad_shard` or `_cpu_grad`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96584 Approved by: https://github.com/fegin, https://github.com/zhaojuanmao	2023-03-14 18:05:57 +00:00
Nikita Vedeneev	05dda7ff65	bsr_dense_mm Triton kernel: fix out kwarg (#96648 ) As per title. The kernel did not handle `out=` correctly and returned a different tensor which only shared storage with `out`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96648 Approved by: https://github.com/cpuhrsch	2023-03-14 18:01:22 +00:00
Nitin Jain	40df3b41aa	[AO] Update qLSTM implementation to remove unsupported backend ops (#96436 ) Summary: The reference quantized LSTM implementation uses unbind and inplace squeeze both of which are not supported when building BoltNN's Espresso IR graph. This change adjusts the reference AO Quantizable LSTM implementation without affecting numerically while enabling removal of unsupported ops in BoltNN. Modifications & Adjustments 1. Unbind ops appear when unstacking tensor in loop. Replaced this by getting first dim from shape and looping using ranged index. 2. Removed unbind ops call where the pattern is `[x = t.unbind(0) -> x[i]]` can be just replaced by `t[i]` as creating a tuple from unbind is unnecessary. 3. inplace squeeze `squeeze_` uses which were not required has been replaced by `squeeze`. See notebook N3235193 which was used for testing quantization flow and inspect the torch scripted quantized model for the set of ops used(See last cell). Test Plan: N3235193 Reviewed By: andrewor14 Differential Revision: D43935389 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96436 Approved by: https://github.com/andrewor14	2023-03-14 17:58:34 +00:00
Driss Guessous	7ec0d6f006	Moves SDPA backward helper native function to functionsmanual.cpp (#95821 ) ## Summary chunk_grad_outputs should have been created within functionsmanual.cpp to begin with. This removes it as a native function and adds to its appropriate home. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95821 Approved by: https://github.com/cpuhrsch, https://github.com/albanD	2023-03-14 17:49:07 +00:00
Rishub Tamirisa	152c1529ca	Add tests for all padding layers to `module_db` in `common_modules.py` (#96641 ) Adding the PR discussed in #96295. - Adds tests for all current padding layers to `module_db` in `torch/testing/_internal/common_modules.py` ( `nn.ReflectionPad`, `nn.ReplicationPad`, `nn.ZeroPad`, `nn.ConstantPad` ) for 1D, 2D, and 3D variants. - Removes tests for the same padding layers from `torch/testing/_internal/common_nn.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96641 Approved by: https://github.com/albanD	2023-03-14 17:42:10 +00:00
Huy Do	4562898ad1	Disable flaky linalg.det.singular tests on ROCm (#96707 ) Related issues: https://github.com/pytorch/pytorch/issues/93044 and https://github.com/pytorch/pytorch/issues/93045. * No access to runner to debug ROCm flakiness * Haven't seen any update on the two issues above * Tests are still flaky whenever they are closed ### Testing The tests are skipped https://ossci-raw-job-status.s3.amazonaws.com/log/11976899251 ``` 2023-03-14T03:39:02.1336514Z test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_det_singular_cuda_complex128 SKIPPED (Flaky on ROCm https://github.com/pytorch/pytorch/issues/93044) [ 27%] ... 2023-03-14T03:41:46.4234072Z test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_det_singular_cuda_complex128 SKIPPED (Flaky on ROCm https://github.com/pytorch/pytorch/issues/93045) [ 44%] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96707 Approved by: https://github.com/clee2000	2023-03-14 17:35:00 +00:00
Edward Z. Yang	0d3bf2fdca	Add missing ceil for libdevice in triton (#96709 ) Towards fixing pnasnet5large Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96709 Approved by: https://github.com/jansel	2023-03-14 17:34:06 +00:00
Kshiteej K	1ec655565d	[fix] resize_, resize_as_ : version bump in ADInplaceOrView (#96598 ) Ref: https://github.com/pytorch/pytorch/pull/96403#discussion_r1132553277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96598 Approved by: https://github.com/albanD	2023-03-14 16:15:34 +00:00
Bin Bao	f03db8d6cb	[reland2][inductor] Add an AOT compilation mode for Inductor CPP backend (#96520 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822. Solved the long compilation issue for inductor cpp tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96520 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-03-14 16:10:54 +00:00
Avik Chaudhuri	178d2a38e0	debug shape guards (#95848 ) Adds logging when shape guards are added and when symbols are specialized to constants. Differential Revision: [D43719743](https://our.internmc.facebook.com/intern/diff/D43719743/) Differential Revision: [D43719743](https://our.internmc.facebook.com/intern/diff/D43719743) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95848 Approved by: https://github.com/ezyang	2023-03-14 16:05:28 +00:00
Wang, Eikan	a37197df99	[Inductor] Enable Inductor to support BF16 atomic_add (#96620 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96620 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-03-14 15:15:09 +00:00
Edward Z. Yang	ff7e510d1e	Correctly use PythonPrinter for generating wrapper code referencing sympy (#96710 ) Otherwise you get stuff like ceiling(s0) which is not valid Python code. Fixes volo_d1_224 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96710 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-03-14 14:35:52 +00:00
Will Constable	f1d4d291b0	update_expected.py to parse artifacts and update graph break stats (#96480 ) TODO (cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire @ZainRizvi) hopefully i can convert the rocks query i'm using to a public API and delete the rocs api usage (and need for apikey) from this before landing. If that's not easy or if i need to make a new query first, maybe i should land this as-is and at least people can use it if they get an apikey. Also, any bad practices in how i parsed/mangled the filenames? Would be nice to make the naming of artifacts more consistent with the job names so less mangling is needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96480 Approved by: https://github.com/ZainRizvi	2023-03-14 13:37:21 +00:00
Yu, Guangye	9239279cc0	Suport tensor type for XPU (#96656 ) # Motivate To support tensor type scenario for XPU. like CUDA: ```python >>> import torch >>> torch.rand(2,3).cuda(0).type(torch.cuda.IntTensor) tensor([[0, 0, 0], [0, 0, 0]], device='cuda:0', dtype=torch.int32) ``` without this PR: ```python >>> import torch >>> import intel_extension_for_pytorch >>> torch.rand(2,3).xpu('xpu:0').type(torch.xpu.IntTensor) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid type: 'torch.xpu.IntTensor' ``` with this PR: ```python >>> import torch >>> import intel_extension_for_pytorch >>> torch.rand(2,3).xpu('xpu:0').type(torch.xpu.IntTensor) tensor([[0, 0, 0], [0, 0, 0]], device='xpu:0', dtype=torch.int32) ``` # Solution Add allXPUTypes in type method to parse all xpu tensor type # Additional UT pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96656 Approved by: https://github.com/albanD	2023-03-14 13:30:41 +00:00
PyTorch MergeBot	a07817ad8f	Revert "[MPS] Add C++ API support for MPS backend (#96668 )" This reverts commit 069ace131c7889c7aaf2ea64fe8eb44a8ff1e983. Reverted https://github.com/pytorch/pytorch/pull/96668 on behalf of https://github.com/DanilBaibak due to breaking internal builds	2023-03-14 12:43:04 +00:00
Wang, Eikan	bdd09e68e4	[Inductor] Legalize BF16 (#96183 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96183 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-03-14 10:16:15 +00:00
Jiong Gong	190e284bd3	[Inductor] apply vec float mask on logical comparison ops in cpp (#96502 ) Fix https://github.com/pytorch/pytorch/issues/96446 The root cause is that the logical comparison op works on the integer vector which is later used in the `where` op that expects a float vector. 1. Make sure float vec mask is applied on logical comparison ops. 2. Fix vec int specialization for `to_float_mask`. Assume int mask as input and returns the float mask with reinterpret cast. 3. Add a no-op specialization for `to_float_mask` function with the float vec as input. 4. Pass value instead of ref to `to_float_mask`. Passing by value should be efficient enough. 5. Remove a conditional check `!=0` in `masked()` since `to_float_mask` is guaranteed to return a float mask. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96502 Approved by: https://github.com/EikanWang, https://github.com/XiaobingSuper, https://github.com/jansel	2023-03-14 08:47:14 +00:00
Wang, Eikan	3f7235463a	[Inductor] Fix the incorrect fusion if a Conv/Linear moduel is called from multiple places (#96485 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96485 Approved by: https://github.com/jansel, https://github.com/jgong5	2023-03-14 07:40:20 +00:00
Wang, Eikan	3cad8d23d0	[Inductor] Skip the hf_T5_base due to intermittent failure on CI (#96649 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96649 Approved by: https://github.com/desertfire	2023-03-14 07:40:20 +00:00
Yanbo Liang	166117e050	control_flow.{cond/map} allows tracked_fakes divergence (#96546 ) Fixes #96473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96546 Approved by: https://github.com/ezyang	2023-03-14 07:06:54 +00:00
Nikita Karetnikov	ec536232a3	[primTorch] add meta implementation for `upsample_nearest2d_backward` (#96612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96612 Approved by: https://github.com/ezyang	2023-03-14 06:51:42 +00:00
Huy Do	6a2dcfd738	Move all ONNX test dependencies to Docker (#96590 ) Per title. This is the first one of a two-part process: [x] Move all ONNX test dependencies to Docker https://github.com/pytorch/pytorch/pull/96590 [ ] Move the test model used by [TestFxToOnnxWithOnnxRuntime.test_gpt2_tiny](https://hud.pytorch.org/failure/FAILED%20test%2Fonnx%2Ftest_fx_to_onnx_with_onnxruntime.py%3A%3ATestFxToOnnxWithOnnxRuntime%3A%3Atest_large_scale_exporter_with_tiny_gpt2%20-%20requests.exceptions.ReadTimeout%3A%20HTTPSConnectionPool(host%3D'huggingface.co'%2C%20port%3D443)%3A%20Read%20timed%20out.%20(read%20timeout%3D10.0)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96590 Approved by: https://github.com/ZainRizvi	2023-03-14 06:19:00 +00:00
Eddie Yan	70090b4daf	[CUDA] Abate spurious resize warnings in MultiMarginLoss backward (#96382 ) Follow-up of #75000 for backward. CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/96382 Approved by: https://github.com/ngimel	2023-03-14 05:54:23 +00:00
Xinfeng	906a1952c6	[DDP] Enable delayed all reduce in DDP (#96673 ) Summary: Enable the functionality of delaying all reduce in DDP to specify the parameters whose all reduce will be hooked to a specific param. This prevents AllReduce blocking All2All in some recommendation models. Test Plan: GitHub CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96673 Approved by: https://github.com/zhaojuanmao	2023-03-14 04:25:25 +00:00
PyTorch MergeBot	d0a4881d95	[vision hash update] update the pinned vision hash (#96703 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96703 Approved by: https://github.com/pytorchbot	2023-03-14 04:07:59 +00:00
Horace He	2a08a62777	Add extra metadata (as comments) to Inductor generated code (#96581 ) New output <img width="942" alt="image" src="https://user-images.githubusercontent.com/6355099/224794006-a993a2a8-d6ff-49da-8891-7b2373030a3d.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96581 Approved by: https://github.com/ngimel, https://github.com/shunting314, https://github.com/voznesenskym	2023-03-14 03:59:59 +00:00
Driss Guessous	f56cb41c2e	Fix calls to sizes to enable dynamic shapes with sdpa (#96674 ) Fixes part of #96414 Replaces any calls to sizes, with sym_sizes. Still seeing an error with the repro script: ``` Bash Exception raised from sizes_default at /scratch/drisspg/work/pytorch/c10/core/TensorImpl.h:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x7d (0x7f697f4a141d in /scratch/drisspg/work/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, char const*) + 0xdd (0x7f697f49fbcd in /scratch/drisspg/work/pytorch/torch/lib/libc10.so) frame #2: c10::TensorImpl::sizes_custom() const + 0x95 (0x7f697f4824c5 in /scratch/drisspg/work/pytorch/torch/lib/libc10.so) frame #3: at::native::empty_like(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x92c (0x7f69809d18ac in /scratch/drisspg/work/pytorch/torch/lib/libtorch_cpu.so) frame #4: <unknown function> + 0x23f5ce7 (0x7f698193bce7 in /scratch/drisspg/work/pytorch/torch/lib/libtorch_cpu.so) ``` still trying to track down this empty call from the looks of it, might be coming from at::layer_norm? the BT from lldb is 221 frames however, so lots of noise Pull Request resolved: https://github.com/pytorch/pytorch/pull/96674 Approved by: https://github.com/ezyang	2023-03-14 03:47:43 +00:00
Will Constable	218eeacacd	Check dynamo graph-breaks in CI (#96346 ) - add graph-breaks baselines - add check_graph_breaks script (message users on regress or improvement) - hook up test.sh for existing accuracy job Refactor graph-break CI check Take steps toward merging checker with existing check flow, consider merging it all the way inside the bench runner. csvs Pull Request resolved: https://github.com/pytorch/pytorch/pull/96346 Approved by: https://github.com/ezyang	2023-03-14 03:39:36 +00:00
Huy Do	2cc8368af3	Clean up duplicated retry function in common.sh (#96682 ) I just realize that this `retry` function is defined twice in: * https://github.com/pytorch/pytorch/blob/master/.ci/pytorch/common_utils.sh#L12-L14 * and https://github.com/pytorch/pytorch/blob/master/.ci/pytorch/common.sh#L26-L28 Also they step on each other toes as `common.sh` load `common_utils.sh` in https://github.com/pytorch/pytorch/blob/master/.ci/pytorch/common.sh#L5 This will keep only the definition in `common_utils.sh` where it should be. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96682 Approved by: https://github.com/clee2000	2023-03-14 03:24:49 +00:00
Nikita Shulga	62c1e33fc9	[BE] Remove fast_nvcc tool (#96665 ) As of CUDA-11.4+ this functionality can be mimicked by passing [`--threads`](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#threads-number-t) option to CUDA compiler Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/96665 Approved by: https://github.com/atalman, https://github.com/PaliC	2023-03-14 03:17:31 +00:00
Nikita Shulga	82daf98151	[Sparse] Move `SparseTensorUtils.` to `native/` (#96696 ) Fixes internal linking problem after `DECLARE_DISPATCH` was introduced in SparseTensorUtils.cpp, but implemented inside the native library. Also, fix `sign-unsigned` compare in `_flatten_indices_impl` Followups: Move code declared/implemented in `SparseTensorUtils.` to `at::native` namespace Pull Request resolved: https://github.com/pytorch/pytorch/pull/96696 Approved by: https://github.com/albanD	2023-03-14 02:56:52 +00:00
chenxujun	c31f5cc26a	Update functional_bfloat16.h (#96027 ) Fix a typo in functional_bfloat16.h Pull Request resolved: https://github.com/pytorch/pytorch/pull/96027 Approved by: https://github.com/Skylion007, https://github.com/jgong5, https://github.com/kit1980	2023-03-14 02:35:37 +00:00
chenxujun	a66474b411	Update vml.h (#96028 ) Fix a typo in vml.h Pull Request resolved: https://github.com/pytorch/pytorch/pull/96028 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-03-14 02:22:01 +00:00
Nikita Shulga	5a8a4030a2	[BE] Add regression test for aten shared build (#96697 ) To expose errors similar to https://github.com/pytorch/pytorch/pull/94401#issuecomment-1466654593 in OSS CI Building `aten_cpu` as a shared library with `-Wl,--no-undefined` simulates behavior of Android NDK toolchain. Test plan: It should fail, see https://github.com/pytorch/pytorch/actions/runs/4410571970/jobs/7728232916#step:14:1386 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96697 Approved by: https://github.com/kit1980	2023-03-14 02:19:17 +00:00
PyTorch MergeBot	a22b92d8ba	Revert "Enable thp(transparent huge pages) for buffer sizes >=2MB (#95963 )" This reverts commit 3bb16a084298ed8b9a1e59622afd80418ff4a2f1. Reverted https://github.com/pytorch/pytorch/pull/95963 on behalf of https://github.com/izaitsevfb due to Breaks internal android builds: unused function c10_compute_alignment [-Werror,-Wunused-function]	2023-03-14 02:15:08 +00:00
chenxujun	86a9fe8abc	Update Exceptions.cpp (#96031 ) Fix a typo in Exceptions.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/96031 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-03-14 02:02:34 +00:00
Edward Z. Yang	507feb805f	Don't specialize torch.Size with specialize_int = False (#96419 ) Fixes https://github.com/pytorch/pytorch/issues/95868 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96419 Approved by: https://github.com/jansel, https://github.com/ngimel	2023-03-14 01:32:58 +00:00
Elias Ellison	da265652d6	Return Live Data Pointers from Checkpoint, swap onto tensors (#95020 ) When we checkpoint the state of the private pool allocator, we will need to make sure that its current live allocated blocks will get properly cleaned up when the tensors they correspond to die. Return DataPtrs for these new allocated blocks that the callee can swap onto live Tensors. The exact api for setting the checkpoint can be manipulated after this as the cudagraph implementation is built out, but this at least shows its sufficiently general. This should be the last PR touching cuda caching allocator necessary for new cudagraphs integration. Differential Revision: [D43999888](https://our.internmc.facebook.com/intern/diff/D43999888) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95020 Approved by: https://github.com/zdevito	2023-03-14 01:22:19 +00:00
Elias Ellison	1cc32aedb0	Handle additional live allocations not in checkpointed state (#94943 ) We choose to ignore certain blocks that are currently allocated when we set the pool to its checkpoint. For those blocks, we need to swap out the deleter function of their corresponding blocks so that a deallocation is not triggered when they die. Differential Revision: [D43999886](https://our.internmc.facebook.com/intern/diff/D43999886) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94943 Approved by: https://github.com/zdevito	2023-03-14 01:00:47 +00:00
Elias Ellison	d798de2b05	Checkpoint CUDA Allocator Private Pool State (#94653 ) Copying note from cuda caching allocator: ``` * Note [Checkpointing PrivatePoolState] * * Refer above to Note [Interaction with CUDA graph capture]. Allocations made * during graph capture are made from a separate private pool. During graph * capture allocations behave as usual. During graph replay the allocator * state does not change even as new tensors are created. The private pool * will not free its blocks to the main caching allocator until cuda graph use * is finished to prevent an allocation from eager clobbering the memory from * a live but unaccounted for tensor that was created during replay. * * `make_graphed_callables`, a series of separate callables chained in * successive cuda graphs, can share a memory pool because after a cuda graph * recording the allocations in the shared private pool exactly reflect the * tensors that are allocated. * * We would like to extend callable chaining to support a graphed callable * tree. In this scenario, we have a tree of callable chains which will be * captured with cuda graphs. In the diagram below, we have a tree with four * callables, A, B, C, and D. Suppose we have captured, and subsequently * replayed, A, B, and C. Then on a new invocation, we replay A and B, but * would now like to record D. At this point the private pool will not reflect * any of the live tensors created during graph replay. Allocations made * during a new recording with the pool could overwrite those live tensors. * * In order to record a new graph capture after replaying prior callables in * the tree, we need the allocator to reflect the state of the live tensors. * We checkpoint the state of the private after each recording, and then * reapply it when we are starting a new recording chain. Additionally, we * must free the allocations for any tensors that died between the end of our * previous graph replaying and our new recording (TODO). All of the allocated * segments that existed in the checkpointed state must still exist in the * pool. There may also exist new segments, which we will free (TODO : link * note [live tensors between iterations] when it exists). * * * ---------------> A ---------------> B ---------------> C * \| * \| * \| * \| * ---------------> D ``` A few TODOs: - need to add logic for freeing tensors that have died between a last replay and current new recording - Add logic for free that might be called on a pointer multiple times (because we are manually freeing live tensors) The two scenarios above have not been exercised in the tests yet. Differential Revision: [D43999889](https://our.internmc.facebook.com/intern/diff/D43999889) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94653 Approved by: https://github.com/zdevito	2023-03-14 00:47:30 +00:00
Li-Huai (Allan) Lin	c95bcb6694	[MPS] Fix flip where no dims need to be flipped (#96605 ) Fixes #96558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96605 Approved by: https://github.com/kulinseth	2023-03-14 00:34:30 +00:00
andrewor14	ca7e53324f	[Quant][fx] Remove unused is_qat args in prepare_fx (#96631 ) Test Plan: python test/test_quantization.py TestQuantizeFx Reviewers: vkuzo, jcaip Subscribers: vkuzo, jcaip Pull Request resolved: https://github.com/pytorch/pytorch/pull/96631 Approved by: https://github.com/vkuzo	2023-03-14 00:33:18 +00:00
eqy	6e3e22d58c	[CUDA][cuFFT] Minor fix for cuFFT plan cache docs (#96373 ) The attributes described in the docs require indexing in to the plan cache manager, as there is a separate plan cache per device. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/96373 Approved by: https://github.com/ngimel	2023-03-14 00:28:14 +00:00
chenxujun	6a492908cc	Update conv_fused.py (#95551 ) Fix typos in conv_fused.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/95551 Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/malfet	2023-03-13 23:42:34 +00:00
Huy Do	ae4d690931	Make linter image available on ECR when rebuilding (#96671 ) This is to fix the annoying error when lint jobs couldn't find the new image on ECR and fail. They success once the image has been pushed there, for example https://github.com/pytorch/pytorch/actions/runs/4407785646/jobs/7722166975 ### Testing * Lint jobs successfully pull new linter image if there are changes to Docker https://github.com/pytorch/pytorch/actions/runs/4408362343 * No change to the Docker image. The existing one is pulled from ECR https://github.com/pytorch/pytorch/actions/runs/4408992880 * Remove `force_push` https://github.com/pytorch/pytorch/actions/runs/4410045959 * Retrying works fine https://github.com/pytorch/pytorch/actions/runs/4410045959/jobs/7727515932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96671 Approved by: https://github.com/malfet, https://github.com/seemethere	2023-03-13 23:24:23 +00:00
Driss Guessous	f330281fb2	Add torch.nn.LayerNorm() to documented list of supported nested tensor ops (#96434 ) Layer norm is supported and this updates the documentation to reflect that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96434 Approved by: https://github.com/cpuhrsch, https://github.com/jbschlosser	2023-03-13 23:16:09 +00:00
Ramin Azarmehr	069ace131c	[MPS] Add C++ API support for MPS backend (#96668 ) - This enables the APIs `torch::mps::is_available()/synchronize()/manual_seed()` for use in PyTorch C++. - Added test case for C++ APIs to `mps_test_allocator.cpp` Fixes #96425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96668 Approved by: https://github.com/kulinseth, https://github.com/albanD	2023-03-13 23:15:37 +00:00
chenxujun	c28b224e0f	Update CUDAGraph.cpp (#96029 ) Fix typos in CUDAGraph.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/96029 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-03-13 23:06:25 +00:00
yiliu30	2ea0cb1207	Fix the typo for the docstring of args in the observer (#95887 ) This PR fixes the typo in `torch.ao.quantization.observer.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95887 Approved by: https://github.com/kit1980	2023-03-13 23:03:57 +00:00
Sridhar Perepu	9159599cd5	Gramatically updated the tech docs (#92896 ) Small typo change in the torch tech docs <img width="1209" alt="Torch storage doc" src="https://user-images.githubusercontent.com/76240270/214272201-5e9cce2a-13cf-48b7-8806-9c492a0eb665.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/92896 Approved by: https://github.com/mikaylagawarecki, https://github.com/kit1980	2023-03-13 22:51:42 +00:00
David Berard	1d792288a5	[dynamo][dashboard] Clear local changes before pulling git repos (#96667 ) Current dashboard issue is due to a .pt file in torchbench that has beeen modified for some reason. This clears any local changes before pulling. Tested in a duplicate dashboard environment with the same .pt file modified: * Before the change to this makefile, `make pull-deps` fails * After the change to this makefile, `make pull-deps` succeeds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96667 Approved by: https://github.com/anijain2305	2023-03-13 22:50:38 +00:00
Spark	16a16d1803	Incorrect links #96515 (#96536 ) Fixes #96515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96536 Approved by: https://github.com/kit1980	2023-03-13 22:26:21 +00:00
Masaki Kozuki	a48d518e45	test_foreach: remove `skipMeta` (#96599 ) Happened to notice that the test doesn't seem to require the guard (at least on my local environment) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96599 Approved by: https://github.com/bdhirsh	2023-03-13 22:14:36 +00:00
Chien-Chin Huang	f5a0b31a95	[FSDP][optim_state_dict] Make FSDP optim_state_dict aware of DDP prefix (#96415 ) Summary: When wrapping FSDP within DDP, optimizer state_dict may be broken due to the prefix of DDP. This PR fixes the issue. Test Plan: CI Differential Revision: D43893609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96415 Approved by: https://github.com/zhaojuanmao	2023-03-13 21:07:34 +00:00
Kyle Yoon	b992199487	[pytorch][coreml] Use from_blob instead of empty in pack_outputs (#96564 ) Summary: We don't want to load when loading model on Core ML and `at::empty` is considered an op. So replace it with from_blob. Test Plan: Run Core ML backend to ensure it works for existing use cases. Also test running Core ML backend without any ops. Differential Revision: D43961679 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96564 Approved by: https://github.com/f-meloni, https://github.com/kimishpatel	2023-03-13 20:23:43 +00:00
Aidyn-A	c69b3b8d4f	[CUDA12] Autograd engine use current device only (#92354 ) This is a device agnostic version #91191. The reason of existence of this PR is device agnostic policy of autograd engine. Hence, the compile time `USE_CUDA` is not supported, so doing something like: `fa1ea9f9bc/torch/csrc/autograd/engine.cpp (L351-L357)` is not effective. In this PR a check upon CUDA devices in device registry is added such that threads set the same CUDA device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92354 Approved by: https://github.com/albanD, https://github.com/ngimel	2023-03-13 20:04:12 +00:00
Horace He	31137a63a7	Changed flop formulas for flop counter to take in shapes directly (#96565 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96565 Approved by: https://github.com/zdevito	2023-03-13 19:58:43 +00:00
Natalia Gimelshein	3f1efadea5	[inductor] fixes addmm pattern matcher to exclude non-conformant patt… (#96634 ) …erns Fixes #96625, #96569 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96634 Approved by: https://github.com/jansel	2023-03-13 19:55:04 +00:00
Joel Schlosser	30d56dd8c1	Support randn_like() for NT (#96528 ) To satisfy an internal ask. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96528 Approved by: https://github.com/mikaylagawarecki, https://github.com/cpuhrsch	2023-03-13 19:39:51 +00:00
Nicolas Macchioni	f673ad6d5c	Add a new knob to separately enable the autotuning in Triton. (#96440 ) Summary: separate triton pointwise autotune from matmul autotune, work done by ckluk Test Plan: sandcastle + CI Differential Revision: D43955699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96440 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-03-13 19:09:27 +00:00
Edward Z. Yang	4454655a4c	Add triton to relevant packages (#96663 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96663 Approved by: https://github.com/janeyx99, https://github.com/malfet, https://github.com/atalman	2023-03-13 19:02:07 +00:00
Edward Z. Yang	a8d1eb1961	Convenience script for getting correct Triton nightly binary (#96669 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96669 Approved by: https://github.com/ngimel, https://github.com/malfet	2023-03-13 18:58:38 +00:00
Rodrigo Kumpera	120c6f6637	Revert all_reduce workaround as it might be causing issues on other parts of the codebase (#96460 ) Recent master breakage on focal and bionic PTD tests since we switched to all_reduce in #95897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96460 Approved by: https://github.com/fegin	2023-03-13 18:56:55 +00:00
Zain Rizvi	19833486dc	Autorun binary builds when a commit pin is updated (#96526 ) Automatically trigger binary builds when a commit pin is updated to ensure the new versions actually get tested. This is to prevent a recurrence of the build breaks introduced by https://github.com/pytorch/pytorch/pull/95896#issuecomment-1463312996 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96526 Approved by: https://github.com/huydhn, https://github.com/seemethere	2023-03-13 18:33:29 +00:00
Richard Zou	7eef469793	Add merge_rule for "functorch" module (#96657 ) This PR enables our non-meta contributors to be able to approve "functorch" PRs without intervention from meta contributors. A PR is deemed a "functorch" PR if it matches one of the patterns in merge_rules.yaml. These patterns are definitely not exhaustive (we modify core pytorch pieces quite often), but should be a good starting point. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96657 Approved by: https://github.com/albanD	2023-03-13 18:05:47 +00:00
Kiuk Chung	55a1bd3fc6	[PT-D] Update CODEOWNERS, merge_rules, and Persons-of-Interest for to… (#96321 ) Synchronize CODEOWNERS, merge_rules, and POI files to reflect kiukchung and d4l3k (Tristan Rice) as one of the maintainers for the distributed module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96321 Approved by: https://github.com/d4l3k, https://github.com/albanD, https://github.com/malfet	2023-03-13 17:38:43 +00:00
Huy Do	bb8dc7f7d9	Dockerize torch deploy setup (#96593 ) The step `conda_install "libpython-static=${ANACONDA_PYTHON_VERSION}"` could fail flakily, for example `5f89d147a1`, so let's put that into Docker. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96593 Approved by: https://github.com/ZainRizvi	2023-03-13 17:26:52 +00:00
Nikita Vedeneev	0b5040b329	sparse_mask: remove syncs by removing calls to coalesce (#94406 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94406 Approved by: https://github.com/cpuhrsch, https://github.com/pearu	2023-03-13 16:30:27 +00:00
Richard Zou	13011afb87	Fix vmap registration for t, t_ (#96539 ) - t, t_ are not CompositeImplicitAutograd - They were previously registered in BatchRulesDecompositions.cpp. - The only thing that should get registered in BatchRulesDecompositions.cpp are CompositeImplicitAutograd - This PR moves their registrations out of there and into BatchRulesViews.cpp. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/96539 Approved by: https://github.com/srossross, https://github.com/kshitij12345, https://github.com/Chillee	2023-03-13 16:08:32 +00:00
Joel Schlosser	024ea1a21e	Support zeros_like() for NT (#96527 ) This is used for the fake tensor fallbacks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96527 Approved by: https://github.com/cpuhrsch	2023-03-13 15:15:08 +00:00
ZiadAmerr	3cdf18cb4f	Corrected `HingeEmbeddingLoss` documentation (#95140 ) Minor correction. `HingeEmbeddingLoss`'s documentation had this piecewise function; but there is no $\Delta$ in the function definition, it was used to denote `margin`. $$l_n = \begin{cases} x_n, & \text{if}\; y_n = 1,\\ \max \{0, \Delta - x_n\}, & \text{if}\; y_n = -1, \end{cases}$$ Following other documentation guidelines, `HuberLoss` has a parameter `delta`, and its piecewise function is defined as follows; using $delta$ as a reference to the `delta` parameter and not $\Delta$. $$l_n = \begin{cases} 0.5 (x_n - y_n)^2, & \text{if } \|x_n - y_n\| < delta \\ delta * (\|x_n - y_n\| - 0.5 * delta), & \text{otherwise } \end{cases}$$ So by analogy, `HingeEmbeddingLoss` should also be the same, thus, the right piecewise function for it should be like the following instead. $$l_n = \begin{cases} x_n, & \text{if}\; y_n = 1,\\ \max \{0, margin- x_n\}, & \text{if}\; y_n = -1, \end{cases}$$ Pull Request resolved: https://github.com/pytorch/pytorch/pull/95140 Approved by: https://github.com/albanD	2023-03-13 14:32:04 +00:00
Rohan Varma	32f11f58c9	DDP native mixed precision (#92882 ) Implements native mixed precision support for DDP in a similar fashion to how it is enabled for FSDP. The implementation works as follows: 1. In DDP init, we save `_mp_param` and `_fp_param` variables to manage mixed precision parameter usage. In particular, _mp_param will represent the parameter in the reduced precision, while _fp_param will represent the param in regular precision. During forward/backward, we swap back and forth as needed. 2. The root module gets a root pre-forward hook that kicks off copies to the reduced precision for all submodules. An event is recorded for each submodule to allow for waiting, as we run these asynchronously. 3. Each module gets a pre-forward hook that waits on its corresponding event. note that modules might be reused during training, in this case the wait is only done for the first module. After this wait, the module's parameters are in reduced precision. 4. In the pre-forward hook, we register a backward hook on the lower precision parameters in order to run reduced precision allreduce + parameter upcast. We can't rely on the Reducer's constructor setting up these hooks because the gradient is accumulated on the low precision param, so we need to register them ourselves. 5. In the backward pass, when the hook runs, we first run allreduce + divide in the reduced precision. Next, we upcast parameters and gradients back to fp32 asynchronously. We also queue a callback at the end of backward to wait on these upcasts so that the upcast is complete before optim.step() runs. 6. Parameters that don't require grad are also cast since they may be used in computation, they are upcast back in the final autograd callback. 7. DDP Ignored parameters are not touched. Follow-ups: 1. Unify comm hooks and make it work with apply optimizer in backward 2. implement keep_low_precision_grads, 3. allow BN, LN, or custom units to run in reduced precision, 4. support for cast_forward_inputs 5. Unify certain APIs / helpers with FSDP where possible, such as for _cast_forward_inputs 6. Integrate this with replicate() API. 7. The order in which we kick off copies and wait for them is set by the iteration order of module.modules(), but this might not be how the modules are used in the actual training. In the worst case, the last module in module.modules() could be used first which would result in waiting for all copies unnecessarily. For static graphs, we should record the module execution order and copy / wait in this order. 8. Entirely unused modules probably don't need to be cast. Differential Revision: [D42515803](https://our.internmc.facebook.com/intern/diff/D42515803/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92882 Approved by: https://github.com/zhaojuanmao	2023-03-13 14:10:31 +00:00
Edward Z. Yang	c7f39c0820	Update CI skips (#96554 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96554 Approved by: https://github.com/janeyx99	2023-03-13 13:40:45 +00:00
XiaobingSuper	279ada515a	inductor(cpu): make variable number used of masked vectorization path align with scalar path (#96510 ) Fix https://github.com/pytorch/pytorch/issues/96484, for CPP reduction vectorization path, there has an assumption that the vectorization path var number used should be aligned with the scalar path, but currently, masked doesn't meet such requirement and will report var not defined error. before: ``` { { { #pragma omp declare reduction(min:at::vec::Vectorized<float>:omp_out = at::vec::minimum(omp_out, omp_in)) initializer(omp_priv={{std::numeric_limits<float>::infinity()}}) float tmp7 = std::numeric_limits<float>::infinity(); auto tmp7_vec = at::vec::Vectorized<float>(tmp7); for(long i0=0; i0<0; i0+=1) { auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16i0); auto tmp0 = at::vec::Vectorized<int>(static_cast<int>(0)); auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(2)); auto tmp2 = tmp0 < tmp1; auto tmp3 = at::vec::Vectorized<float>(0.0); { auto tmp4 = at::vec::Vectorized<float>(in_ptr0[0]); tmp3 = decltype(tmp4)::blendv(tmp3, tmp4, to_float_mask(tmp2) != at::vec::Vectorized<float>(0)); } auto tmp6 = tmp3 + tmp5; tmp7_vec = at::vec::minimum(tmp7_vec, tmp6); } #pragma omp simd simdlen(8) reduction(min:tmp8) for(long i0=0; i0<2; i0+=1) { auto tmp6 = in_ptr1[i0]; auto tmp0 = static_cast<long>(0); auto tmp1 = static_cast<long>(2); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = in_ptr0[0]; return tmp4; } ; auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); auto tmp7 = tmp5 + tmp6; tmp8 = std::min(tmp8, tmp7); } tmp7 = std::min(tmp7, at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return at::vec::minimum(x, y);}, tmp7_vec)); out_ptr0[0] = tmp8; } } } ``` after: ``` { { { #pragma omp declare reduction(min:at::vec::Vectorized<float>:omp_out = at::vec::minimum(omp_out, omp_in)) initializer(omp_priv={{std::numeric_limits<float>::infinity()}}) float tmp8 = std::numeric_limits<float>::infinity(); auto tmp8_vec = at::vec::Vectorized<float>(tmp8); for(long i0=0; i0<0; i0+=1) { auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16i0); auto tmp0 = at::vec::Vectorized<int>(static_cast<int>(0)); auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(2)); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = at::vec::Vectorized<float>(in_ptr0[0]); return tmp4; } ; auto tmp5 = decltype(tmp3())::blendv(at::vec::Vectorized<float>(0.0), tmp3(), to_float_mask(tmp2) != at::vec::Vectorized<float>(0)); auto tmp7 = tmp5 + tmp6; tmp8_vec = at::vec::minimum(tmp8_vec, tmp7); } #pragma omp simd simdlen(8) reduction(min:tmp8) for(long i0=0; i0<2; i0+=1) { auto tmp6 = in_ptr1[i0]; auto tmp0 = static_cast<long>(0); auto tmp1 = static_cast<long>(2); auto tmp2 = tmp0 < tmp1; auto tmp3 = [&] { auto tmp4 = in_ptr0[0]; return tmp4; } ; auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0); auto tmp7 = tmp5 + tmp6; tmp8 = std::min(tmp8, tmp7); } tmp8 = std::min(tmp8, at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return at::vec::minimum(x, y);}, tmp8_vec)); out_ptr0[0] = tmp8; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96510 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-03-13 09:41:23 +00:00
mantaionut	2cbce06fee	Enablee test_inverse_errors_large (#94727 ) Test to see if TestLinAlgCUDA.test_inverse_errors_large_cuda_float64 still fails on CI. The test was not failing in multiple CI runs. I was not able to reproduce the crash locally. Fixes #57482 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94727 Approved by: https://github.com/lezcano	2023-03-13 08:31:41 +00:00
Yanbo Liang	760ad90518	[Dynamo] User defined functions support torch & builtin functions as default arguments (#96563 ) Fixes #96197 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96563 Approved by: https://github.com/jansel	2023-03-13 08:28:52 +00:00
XiaobingSuper	6eca391e83	inductor(cpu): remove __restrict__ keyword to avoid generating wrong result when two pointer point same memory (#96492 ) Fix https://github.com/pytorch/pytorch/issues/93365, https://github.com/pytorch/pytorch/issues/93357 and https://github.com/pytorch/pytorch/issues/96432. Currently, remove `__restrict__` keyword to avoid generating the wrong result, there has a draft PR https://github.com/pytorch/pytorch/pull/96404 to do some memory alias checks before adding `__restrict__ `keyword, but that PR needs to re-designed well for the logic of the memory alias checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96492 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-03-13 07:12:04 +00:00
PyTorch MergeBot	be220690d9	Revert "[primTorch] add meta implementation for `upsample_nearest2d_backward` (#96612 )" This reverts commit fe180596b854164db0ce500d938def8df45790ba. Reverted https://github.com/pytorch/pytorch/pull/96612 on behalf of https://github.com/malfet due to broke lint	2023-03-13 03:07:23 +00:00
kvathupo	2b9d9bcb85	Deprecate non-bool masks in masked_fill (#96594 ) __What?__ Per discussion at #94634, deprecate `masked_fill` with non-bool masks. Deprecation warnings were previously added by #22261, but not for Apple MPS. I can revert the MPS changes if deprecation warnings are wanted first tho. See also #96112. Fixes #85063 and #89320. __Further Development?__ - Fixed the mask dtype checking for the cuda dispatch for `masked_fill` in `aten/src/ATen/native/cuda/Indexing.cu` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96594 Approved by: https://github.com/malfet, https://github.com/ngimel	2023-03-13 01:41:47 +00:00
Nikita Karetnikov	fe180596b8	[primTorch] add meta implementation for `upsample_nearest2d_backward` (#96612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96612 Approved by: https://github.com/ezyang	2023-03-13 00:25:23 +00:00
Edward Z. Yang	99efe3ef5a	Generate type match guard for torch.Size input (#96421 ) I suppose hypothetically, if the user code ends up working polymorphically over the SizeVariable, in such a way that a tuple would work, this type match is not necessary. But we do not carefully test for this. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96421 Approved by: https://github.com/jansel, https://github.com/voznesenskym	2023-03-12 23:04:55 +00:00
Nikita Shulga	1ab883797a	[BE] Dedup hardcoded triton versions (#96580 ) Define it once in `.ci/docker/trition_version.txt` and use everywhere. Also, patch version defined in `triton/__init__.py` as currently it always returns `2.0.0` even if package name is `2.1.0` Followup after https://github.com/pytorch/pytorch/pull/95896 where version needed to be updated in 4+ places Pull Request resolved: https://github.com/pytorch/pytorch/pull/96580 Approved by: https://github.com/huydhn	2023-03-12 20:00:48 +00:00
PyTorch MergeBot	30b968f60d	Revert "[BE] Dedup hardcoded triton versions (#96580 )" This reverts commit c131e51e6248cf04135db317040b5be3ab944d41. Reverted https://github.com/pytorch/pytorch/pull/96580 on behalf of https://github.com/malfet due to Forgot to fix lint	2023-03-12 19:37:52 +00:00
Nikita Shulga	c131e51e62	[BE] Dedup hardcoded triton versions (#96580 ) Define it once in `.ci/docker/trition_version.txt` and use everywhere. Also, patch version defined in `triton/__init__.py` as currently it always returns `2.0.0` even if package name is `2.1.0` Followup after https://github.com/pytorch/pytorch/pull/95896 where version needed to be updated in 4+ places Pull Request resolved: https://github.com/pytorch/pytorch/pull/96580 Approved by: https://github.com/huydhn	2023-03-12 16:56:04 +00:00
Zachary DeVito	4b372e3958	[memory profiling] C++ tracing support (#95357 ) Adds the ability to quickly generate stack traces for C++, and combine Python, TorchScript, and C++ frames into a single trace. This makes it possible for the memory tracer to record allocations inside C++ code (e.g. convolution temporaries, backward operators). The unwinder code is ~10x faster than execinfo.h's backward because it cache fast unwinder routines for instruction pointers that have already been seen. It is also only 1.2--2x slower than copying the entire stack (the approach perf takes), while using 2 orders of magnitude less space per stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357 Approved by: https://github.com/bertmaher	2023-03-12 07:24:14 +00:00
Zachary DeVito	48490cec28	[memory profiling] Move Context object to c10 (#96280 ) Minor refactor so that follow up PR can have objects that meet the GatheredContext inferface without having to depend on CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96280 Approved by: https://github.com/eellison	2023-03-12 07:24:14 +00:00
Zachary DeVito	266089a3fe	[memory snapshots] record scripted stack traces (#95356 ) Adds support for seeing both python and script stack traces in memory debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95356 Approved by: https://github.com/aaronenyeshi	2023-03-12 07:24:14 +00:00
David	e8b0f504e2	Fix unpicklable object in AveragedModel (#95979 ) Fixes #95376 Don't store the callable `avg_fn`, instead test if `avg_fn` is None and call the default impl if it's not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95979 Approved by: https://github.com/janeyx99	2023-03-12 05:13:22 +00:00
gmagogsfm	82d3d053b9	Properly capturing argument names for decorated/wrapped functions (#96557 ) `inspect.getfullargspec` does not properly handle functions/methods wrapped by functools.wraps(). As a result, it gets an empty list of `args` in FullArgSpec. This PR rewrites the logic using `inspect.signature`, which handles functools.wraps() correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96557 Approved by: https://github.com/jansel	2023-03-12 01:40:06 +00:00
Fabian Schuetze	a7a09adb86	Add location information for assertions in `torch.jit.annotations.try_ann_to_type` (#96423 ) There are two assertions in `torch.jit.annotations.try_ann_to_type` that could benefit from adding source level location information. For example, the current assertion: ``` msg = "Unsupported annotation {} could not be resolved because {} could not be resolved." assert valid_type, msg.format(repr(ann), repr(contained)) ``` reports: ``` AssertionError: Unsupported annotation typing.Union[typing.Dict, NoneType] could not be resolved because typing.Dict could not be resolved at ``` I find it beneficial to know from which line of code this assertion was triggered. Adding the location information then reports: ``` AssertionError: Unsupported annotation typing.Union[typing.Dict, NoneType] could not be resolved because typing.Dict could not be resolved at File "/home/schuetze/Documents/work/github/prediction_net/multimodal/models/heads/retina_head.py", line 189 def forward(self, fpn_features: t.Dict[str, torch.Tensor], inputs: t.Dict[str, torch.Tensor], gts: t.Optional[t.Dict] = None) -> t.Dict[str, t.Any]: ~~~~~~~~~~~~~~~~~~ <--- HERE """ """ ``` Adding these location information are related to #96420 but these changes in this PR can be made without any API changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96423 Approved by: https://github.com/davidberard98	2023-03-11 21:49:13 +00:00
Nikita Karetnikov	12735952a0	Symintify `_gather_sparse_backward` (#96591 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96591 Approved by: https://github.com/Skylion007	2023-03-11 20:48:06 +00:00
Nikita Karetnikov	cb7c796b4b	Enable `min.unary_out` (#96441 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96441 Approved by: https://github.com/ngimel	2023-03-11 19:23:33 +00:00
Bert Maher	31a6730411	[pt2][inductor] Ignore trace.upload_tar when pickling config (#96519 ) Summary: if trace.upload_tar is set, it's a function, and it can't be pickled. Test Plan: Used on a Meta-internal workload; also, hacked up test/inductor/test_smoke.py to set trace.upload_tar and ran with TORCH_COMPILE_DEBUG=1 Reviewed By: mlazos Differential Revision: D43915178 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96519 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-03-11 19:20:42 +00:00
Nikita Karetnikov	0d7c44096a	Add `baddbmm` meta function (#96548 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96548 Approved by: https://github.com/ezyang	2023-03-11 19:09:24 +00:00
Nikita Karetnikov	8e0d5bf538	[primTorch] add meta implementation for `aten.min.dim` (#96442 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96442 Approved by: https://github.com/ngimel	2023-03-11 18:51:51 +00:00
Iris	ab148da66c	Add fsspec to requirements.txt (#96532 ) Need this package to support enable distributed checkpoint for different backends. Fsspec package size: ``` du -h /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg 264K /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec/__pycache__ 58K /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec/implementations/__pycache__ 377K /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec/implementations 1017K /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec 96K /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/EGG-INFO 1.2M /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96532 Approved by: https://github.com/osalpekar	2023-03-11 06:42:48 +00:00
PaliC	f3fc4d035d	add timeout and retry to metric upload job (#96582 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96582 Approved by: https://github.com/huydhn	2023-03-11 04:25:41 +00:00
David Berard	b4f434a731	[JIT] mark _exchange_device op as having side effects (#96364 ) In #95305 the _exchange_device ops are getting dead-code-eliminated, so they don't get called. #95306 fixes this by using the output of the op, but it's still possible that JIT might reorder the op around other ops. This PR marks _exchange_device as having side effects so that the ops won't get dead code eliminated or reordered, even if the return is not used. Differential Revision: [D43966285](https://our.internmc.facebook.com/intern/diff/D43966285) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96364 Approved by: https://github.com/eellison	2023-03-11 04:17:58 +00:00
Elias Ellison	f89bd26fe4	update options (#96551 ) Fix for https://github.com/pytorch/pytorch/issues/96540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96551 Approved by: https://github.com/msaroufim	2023-03-11 03:33:27 +00:00
PyTorch MergeBot	362958125a	[vision hash update] update the pinned vision hash (#96570 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96570 Approved by: https://github.com/pytorchbot	2023-03-11 03:17:37 +00:00
Huy Do	c3614c7a61	Add a flag to benchmarks script to keep the test report directory (#96398 ) I notice from the Rockset data that there are only `float32` records, while there should be both dtypes there. It turns out that the benchmarks script generated by `runner.py` always removes the output directory by default, so there are only records from `float32` running later left. For example, `rm -rf /var/lib/jenkins/workspace/test/test-reports` appeared twice in the CI log https://ossci-raw-job-status.s3.amazonaws.com/log/11840774308. I'm adding a new flag `--keep-output-dir` to keep the output directory. This is off by default as I'm not sure how this script is used internally, people probably expect to see the output directory cleaned up everytime. ### Testing Not really want to start the 10h jobs just to test this small flag, so I'm triple check the change to make sure that there is no bug Pull Request resolved: https://github.com/pytorch/pytorch/pull/96398 Approved by: https://github.com/weiwangmeta	2023-03-11 03:16:56 +00:00
kshitij12345	bdecf50b47	[fix] reshape_dim_outof to handle 0 sized dim (#96493 ) Fixes https://github.com/pytorch/pytorch/issues/96345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96493 Approved by: https://github.com/zou3519	2023-03-11 02:52:00 +00:00
Nikita Shulga	1be04be3b2	Remove fetch-depth from _calc_docker_img (#96588 ) As in current form it will only work for PRs with one commit. Checkout full PyTorch repo (but skip submodules) Example of failure in PR with multiple commits, see https://github.com/pytorch/pytorch/actions/runs/4389777316/jobs/7687694067#step:4:68 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96588 Approved by: https://github.com/huydhn, https://github.com/izaitsevfb, https://github.com/osalpekar	2023-03-11 02:18:39 +00:00
Michael Gschwind	61cb544397	Align mask formatting of both masks more closely (#96286 ) Summary: Align mask formatting of both masks more closely Test Plan: sandcastle & github Differential Revision: D43878634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96286 Approved by: https://github.com/cpuhrsch	2023-03-11 02:18:05 +00:00
Aaron Enye Shi	1e6961586b	[Profiler] Memory timeline to show actual timestamps (#96535 ) Summary: Rather than starting the timeline at t=0, keep the actual timestamps of the memory events. Test Plan: CI Tests Reviewed By: leitian, chaekit Differential Revision: D43807624 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/96535 Approved by: https://github.com/davidberard98	2023-03-11 00:25:30 +00:00
Huy Do	51b8ab7879	Clean up references to test_megatron_prototype (#96431 ) This test has been deleted in #96254 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96431 Approved by: https://github.com/clee2000, https://github.com/fduwjj	2023-03-10 23:50:32 +00:00
Nikita Shulga	4242e698a3	[BE][MPS] Add MPS to clang format (#96562 ) I'm getting tired of asking to add space after if and all that jazz, so let's linter do that. Add section for Objective-C language, where column with is extended to 120 characters and `AlignAfterOpenBracket` is set to `Align` All `.mm` changes in this PR are made by running linter as follows: ``` lintrunner --take CLANGFORMAT --all-files --apply-patches ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96562 Approved by: https://github.com/seemethere, https://github.com/janeyx99, https://github.com/ZainRizvi, https://github.com/izaitsevfb, https://github.com/PaliC, https://github.com/albanD	2023-03-10 23:17:54 +00:00
SiriusNEO	a7689e73f6	[Docs] Document of RReLU about its different behavior between training and evaluation (#95624 ) Current document of [Randomized Leaky ReLU (RReLU)](https://pytorch.org/docs/stable/generated/torch.nn.RReLU.html#torch.nn.RReLU) does not demonstrate its different behavior between training and evaluation. This PR adds illustrations about this. Fixes #95605. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95624 Approved by: https://github.com/albanD, https://github.com/H-Huang	2023-03-10 22:33:24 +00:00
Catherine Lee	0bf2ed2eb4	Remove duplicate windows job (#96552 ) They are already present in trunk.yml during migration from 11.6->11.7 to 11.7->11.8, 11.6 trunk jobs were migrated to 11.7, but 11.7 periodic jobs were not migrated, but 11.8 were simply added Pull Request resolved: https://github.com/pytorch/pytorch/pull/96552 Approved by: https://github.com/huydhn	2023-03-10 22:28:56 +00:00
Edward Z. Yang	80ce1a934e	Fix flaky Dynamo export tests (#96488 ) Planning to do a full writeup later. The short story is, sometimes the following chain of events happens: 1. We turn on Dynamo's custom frame handler 2. GC triggers (and all of the finalizers run under Dynamo) 3. GC hits a GeneratorExit frame 4. You end up in the custom frame handler with throw_flag == TRUE and PyErr_Occurred() != NULL If this happens and we blindly call into other Python functions (like the Python callback), the executed Python code will immediately raise an exception (because there's already an ambient exception set.) This is very, very confusing. The fix is to defer to the regular handler when throw_flag is TRUE. I triggered this locally with ``` PYTHONUNBUFFERED=1 pytest test/dynamo/test_dynamic_shapes.py -k 'Unspec and export and not dupes and not reorder' -v -x -s ``` But I also have some tests which trigger the problem synthetically. Fixes https://github.com/pytorch/pytorch/issues/93781 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96488 Approved by: https://github.com/albanD	2023-03-10 21:51:54 +00:00
Yanbo Liang	7fcf8b1829	[Dynamo] Support torch.{cuda/cpu}.amp.autocast (#95416 ) For Meta internal use cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95416 Approved by: https://github.com/jansel	2023-03-10 21:48:08 +00:00
joncrall	d05f2ae476	Require DOCTEST_SHOW environ to run plt.show (#96522 ) @ezyang This is a minor change. I was using the doctests to check that my install wasn't broken via: ```bash xdoctest -m torch --style=google --global-exec "from torch import nn\nimport torch.nn.functional as F\nimport torch" --options="+IGNORE_WHITESPACE" ``` And noticed that it stops in the middle to show this matplotlib figure. I added a condition so it only does the pyplot show if a DOCTEST_SHOW environment variable exists. With this fix the above command runs to completion and is an easy way for users to put torch through its paces given just a fresh install. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96522 Approved by: https://github.com/ezyang	2023-03-10 21:47:20 +00:00
BowenBao	384545bf84	[ONNX] Preserve stacktrace info for decomp (#95929 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95929 Approved by: https://github.com/justinchuby	2023-03-10 21:07:03 +00:00
BowenBao	b97ce3774a	[ONNX] Move graph transform functions to 'passes' (#95664 ) This PR only moved code to their new location. No other actual code changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95664 Approved by: https://github.com/justinchuby	2023-03-10 21:07:03 +00:00
Ivan Zaitsev	41991710b2	Revert "[PyTorch] Use c10::FastMap for memoizing in Pickler (#96360 )" (#96547 ) This reverts commit 69d3fa2e4d93f3367ceb3af62d78aedd317dca6c. Reason: breaks internal meta tests. See [D43926671](https://www.internalfb.com/diff/D43926671) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96547 Approved by: https://github.com/seemethere, https://github.com/malfet	2023-03-10 20:57:06 +00:00
Nikita Shulga	429091140e	[BE][MPS] Use convenience functions (#96521 ) Introduce `getMPSScalarType(const Tensor&)` that calls `getMPSScalarType(t.scalar_type())` And replace `getMPSScalarType(t.scalar_type)` with `getMPSScalarType(t)` throughout the codebase Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/96521 Approved by: https://github.com/seemethere	2023-03-10 20:31:10 +00:00
Nikita Shulga	85961f5728	Fix broken anchor in RELEASE.md (#96525 ) Fixes https://github.com/pytorch/pytorch/issues/96514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96525 Approved by: https://github.com/atalman	2023-03-10 19:35:03 +00:00
Catherine Lee	4519228f60	Reduce pytest blocklist part 2 (#96397 ) Enable pytest for a few unique files. pytest runs tests in a different order than unittest (but still a consistent ordering with respect to itself) and some tests change global state, causing other tests to fail. `test_transpose_non_contiguous` in `test_torchinductor.py` gets impacted from some other test but I'm not sure which one, so my solution is to reset the metrics before the rest of the test is run. `test_register_patterns` in `test_quantize_fx.py` adds extra keys to global variables, so remove them when the test is done via unittest's `addCleanUp` which also works on pytest. pytest doesn't really have an equivalent for `load_tests` so change it to be like `test_jit` that imports all the classes. I also attempted to dynamically import them, but I failed. `test_public_api_surface` in `test_fx.py` checks for a backwards compatibility classification. There is a different test in test_fx that results in `fuser_utils` being imported. pytest runs this test before `test_public_api_surface` while unittest runs it after, so pytest sees `fuser_utils` when crawling through the modules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96397 Approved by: https://github.com/huydhn	2023-03-10 19:10:43 +00:00
Wei Wang	49eed50d19	[Inductor Perf CI] Lower the threshold of performance smoke test speedup. (#96531 ) Avoids issues with https://github.com/pytorch/pytorch/issues/96530 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96531 Approved by: https://github.com/seemethere	2023-03-10 18:58:28 +00:00
Aaron Enye Shi	e948ba07d4	[Profiler] Add export_memory_timeline to save memory timeline plot to file (#96137 ) Summary: Added the functionality to export the memory timeline plot as a list of times and sizes, which the post processing visualization can parse and plot. Test Plan: CI Tests Reviewed By: leitian, fengxizhou Differential Revision: D43680760 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/96137 Approved by: https://github.com/chaekit	2023-03-10 18:20:25 +00:00
David Berard	29cd60dfb7	[CI] handle more dynamo benchmark models that are not expected to be deterministic (#96324 ) Follow-up to #96245. alexnet, Background_Matting, vision_maskrcnn, and vgg16 all have the same problem; but on float32 they were also failing on the previous day so I missed this. Once the amp jobs became available I could see that these have the same issue (on both float32 and amp). Pull Request resolved: https://github.com/pytorch/pytorch/pull/96324 Approved by: https://github.com/desertfire	2023-03-10 18:15:34 +00:00
Catherine Lee	481582eb11	Remove land checks in trymerge (#96401 ) Remove all references to land checks (rebase on viable strict in a different branch) since its no longer used. Adding ciflow/trunk on merge and/or rebasing the entire pr is preferred. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96401 Approved by: https://github.com/huydhn	2023-03-10 18:11:05 +00:00
Catherine Lee	6894bb0a85	Remove on_green and mandatory_only (#96400 ) Our default behavior is on green, and currently the two main modes are on green and force. Surprisingly, both these flags are pretty much not used anywhere. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96400 Approved by: https://github.com/huydhn	2023-03-10 18:11:05 +00:00
Edward Z. Yang	219d5eb4f1	[QOL] Raise a NameError when accessing non-existent variable (#96418 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96418 Approved by: https://github.com/albanD	2023-03-10 17:54:02 +00:00
Nikita Vedeneev	55cf7eef86	add/add_ for sparse compressed formats: fix silent index downcast int64 -> int32 (#95294 ) Fixes https://github.com/pytorch/pytorch/issues/95224. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95294 Approved by: https://github.com/cpuhrsch, https://github.com/amjames	2023-03-10 17:51:40 +00:00
Bin Bao	a651e6253a	[CI] Change compile_threads to 1 when running benchmark accuracy test on CI (#96195 ) Summary: This is not a pretty solution, but it a way to verify if the flakiness is coming from parallel compilation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96195 Approved by: https://github.com/ngimel	2023-03-10 17:39:38 +00:00
Tillmann Falck	939c4ae6cd	[DataPipe] Add `copy` option to `fork` DataPipe (#96030 ) Fixes pytorch/data#1061 and fixes pytorch/data#1032 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96030 Approved by: https://github.com/ejguan, https://github.com/NivekT	2023-03-10 17:31:56 +00:00
Huy Do	e35f020418	Retry XLA dependencies installation step (#96352 ) XLA install some dependencies as part of the CI job and the step could fail sometime due to network flakiness, i.e. `3f840cc627` where it failed to get some nodejs packages. A quick fix would be to retry the step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96352 Approved by: https://github.com/JackCaoG, https://github.com/ZainRizvi	2023-03-10 17:16:50 +00:00
Chien-Chin Huang	55d4842a48	[SPMD] Add defunctionalize_optimizer feature (#96323 ) Summary: The manually adding dependencies between _foreach_add_, _fused_adam_, and output can cause issues when lowering to Inductor. This API removes those dependencies. Test Plan: CI Differential Revision: D43916450 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96323 Approved by: https://github.com/kumpera	2023-03-10 16:05:23 +00:00
Rodrigo Kumpera	c7bd9b9490	Switch AsyncCollectiveTensor to be a wrapper subclass. (#96105 ) Our usage is of a wrapper, so it makes sense that we use one. This makes it possible for FakeTensorMode to work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96105 Approved by: https://github.com/wanchaol, https://github.com/wconstab	2023-03-10 15:13:32 +00:00
Sunita Nadampalli	3bb16a0842	Enable thp(transparent huge pages) for buffer sizes >=2MB (#95963 ) The 2MB thp pages provide better allocation latencies compared to the standard 4KB pages. This change has shown substantial improvement for batch mode usecases where the tensor sizes are larger than 100MB. Only enabled if THP_MEM_ALLOC_ENABLE environment variable is set. re-landing https://github.com/pytorch/pytorch/pull/93888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95963 Approved by: https://github.com/malfet	2023-03-10 13:58:01 +00:00
Xunsong, Huang	b053a0f2ba	[XPU][Profiler] Add API support for XPU profiler to Kineto path (#94502 ) This patch is aimed to add support to XPU profiler which will co-work with Kineto. After this PR, kineto will follow these API to fit itself. Also, the development of interface in python is near done. Signed-off-by: Huang, Xunsong <xunsong.huang@intel.com> Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94502 Approved by: https://github.com/ezyang	2023-03-10 12:17:14 +00:00
Nikita Vedeneev	d0f4d62961	flatten_indices: remove syncs (#94401 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94401 Approved by: https://github.com/cpuhrsch, https://github.com/pearu	2023-03-10 12:03:26 +00:00
Kurt Mohler	1b59c3feb5	Add PyObjectSlot member to StorageImpl (#93342 ) Part of #91395 Also modifies how `StorageImpl`s are stored in JIT static runtime's `MemoryPlanner`, which used to `std::move` `StorageImpl`s into a vector. But `StorageImpl` can no longer be moved. Instead, `MemoryPlanner` now contains a malloced buffer to which we add new `StorageImpl`s using placement new Pull Request resolved: https://github.com/pytorch/pytorch/pull/93342 Approved by: https://github.com/ezyang	2023-03-10 10:40:01 +00:00
kshitij12345	987eade3f3	[fix] resize_ and resize_as_ : version bump (#96403 ) Fixes https://github.com/pytorch/pytorch/issues/93776 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96403 Approved by: https://github.com/ezyang	2023-03-10 06:46:30 +00:00
Mark Richardson	8bce88d9de	[caffe2] dont call cudnnDestroy on thread exit (crashes on windows with cuda 11/12) (#95382 ) Summary: My team has been hitting a mysterious crash for a few months on a windows binary that uses Caffe2 inside a worker thread. When this thread gets destroyed, there is an error at this line in context_gpu.h where the state of this operation gives CUDNN_STATUS_INTERNAL_ERROR instead of CUDNN_STATUS_SUCCESS. When enabling cudnn debug logs (via the env variables nvidia specifies), I can see that the context is destroyed twice, even though this code only destroys it once, so something mysterious is causing a double free. This seems very very similar to the issue/fix described here for pytorch: https://github.com/pytorch/pytorch/issues/17658 https://github.com/apache/tvm/pull/8267 And pytorch handles this in the same way, by just not calling cudnnDestroy This seems to have become an issue with cuda11, but I tested cuda12 as well and found that the issue persists so this needs to be somehow fixed. Test Plan: CI I checked that the specific windows binary I am using is able to create and drestroy caffe2-invoking threads without causing the application to crash. buck run arvr/mode/win/cuda11/opt //arvr/projects/nimble/prod/tools/MonoHandTrackingVis Differential Revision: D43538017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95382 Approved by: https://github.com/malfet	2023-03-10 06:42:51 +00:00
Natalia Gimelshein	76cac70939	new triton main pin (#95896 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95896 Approved by: https://github.com/jansel, https://github.com/malfet	2023-03-10 06:30:41 +00:00
Shunting Zhang	9aa216cb46	reland #96249 : [inductor] show more kernel specific metrics in the benchmark result (#96461 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96461 Approved by: https://github.com/ngimel	2023-03-10 06:18:21 +00:00
PyTorch MergeBot	d0731271cd	Revert "new triton main pin (#95896 )" This reverts commit 6e0359dd4233b0cec51521bec8869f0a46ebd98b. Reverted https://github.com/pytorch/pytorch/pull/95896 on behalf of https://github.com/huydhn due to I am not quite sure what this is about yet, but testing 3.8 wheel starts to fail `6e0359dd42`	2023-03-10 05:41:45 +00:00
BowenBao	076792a3e1	[ONNX][Diagnostics] Speed up 'python_call_stack' by 'traceback' (#96348 ) `inspect.stack()` retrieves all stacktraces, and is not performant. `inspect.stack(0)` speeds up the call greatly, but loses line snippet. Rewrite with `traceback.extract_stack` which is better in both regards. Speeds up `export` call in `test_gpt2_tiny` from ~30s to ~4s under profiling. Before ```log │...├─ 30.794 export_after_normalizing_args_and_kwargs <@beartype(torch.onnx._internal.fx.exporter.export_after_normalizing_args_and_kwargs) at 0x7f815cba0700>:1 │...│ └─ 30.794 export_after_normalizing_args_and_kwargs torch/onnx/_internal/fx/exporter.py:580 ``` After ```log │...├─ 4.427 export_after_normalizing_args_and_kwargs <@beartype(torch.onnx._internal.fx.exporter.export_after_normalizing_args_and_kwargs) at 0x7fd8281b3700>:1 │...│ └─ 4.427 export_after_normalizing_args_and_kwargs torch/onnx/_internal/fx/exporter.py:580 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96348 Approved by: https://github.com/titaiwangms, https://github.com/justinchuby	2023-03-10 05:11:47 +00:00
Chien-Chin Huang	15e58c19ec	[FSDP][optim_state_dict] Copy step tensor so that each parameter has its own step (#96313 ) Summary: When parameters are flattening, multiple parameters share the same step. When unflattening the parameters, current implementation still make these parameters share the same step. When this is not wrong, some training infra get confused by sharing tensor storages. This PR fixes the issue. Test Plan: CI Reviewed By: awgu Differential Revision: D43893592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96313 Approved by: https://github.com/zhaojuanmao	2023-03-10 04:51:30 +00:00
Vasiliy Kuznetsov	cdab1d676c	pt2e short term quant: respect qmin/qmax for linear weight (#96232 ) Summary: Makes the `nnqr.Linear` module respect the qmin/qmax attributes of weight observer. This is to unblock some customer teams who are depending on non-default values of these attributes. Test plan: ``` python test/test_quantization.py -k TestReferenceQuantizedModule.test_linear_decomposed ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/96232 Approved by: https://github.com/andrewor14	2023-03-10 04:46:20 +00:00
PyTorch MergeBot	bea6b1d29a	[vision hash update] update the pinned vision hash (#96369 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96369 Approved by: https://github.com/pytorchbot	2023-03-10 03:54:21 +00:00
Rishub Tamirisa	f3b8638074	Adding nn.ZeroPad1d and nn.ZeroPad3d (#96295 ) Fixes #95796 ### Implementation Adds python implementation for `nn.ZeroPad1d` and `nn.ZeroPad3d` in `torch/nn/modules/padding.py`. Adds cpp implementation for `nn::ZeroPad1d` and `nn::ZeroPad3d` in the following 3 files, refactored with templates similarly to `nn::ConstantPad`'s implementation: <br> - `torch/crsc/api/include/torch/nn/modules/padding.h` - `torch/csrc/api/include/torch/nn/options/padding.h` - `torch/csrc/api/src/nn/modules/padding.cpp` Also added relevant definitions in `torch/nn/modules/__init__.py`. ### Testing Adds the following tests: - cpp tests of similar length and structure as `ConstantPad` and the existing `ZeroPad2d` impl in `test/cpp/api/modules.cpp` - cpp API parity tests in `torch/testing/_internal/common_nn.py` - module init tests in `test/test_module_init.py` Also added relevant definitions in `test/cpp_api_parity/parity-tracker.md` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96295 Approved by: https://github.com/soulitzer	2023-03-10 03:51:41 +00:00
cyy	d0e4ca233e	some reference and move fixes (#95942 ) This PR introduces some modifications: 1. We find out some const function parameters that can be passed by reference and add the reference. 2. We find more opportunists of passing by value and change them accordingly. 3. Some use-after-move errors are fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95942 Approved by: https://github.com/Skylion007	2023-03-10 03:44:09 +00:00
Natalia Gimelshein	6e0359dd42	new triton main pin (#95896 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95896 Approved by: https://github.com/jansel	2023-03-10 03:40:37 +00:00
Ali Kamali	065de43012	Fixing a bug where allocating a 4GB block results in using 8GB of memory (#95827 ) I added two constants. First helps with avoiding rounding while we hit a certain threshold, and second, to control what blocks can be cached. Allocations larger than `kMaxRoundThreshold` will not be rounded to the next power of two anymore. Generally it is expected that larger allocations happen less frequently, and this more or less matches what happens in `CudaCachingAllocator`. Blocks larger than `kMaxCachedSize` will not be cached. This is a separate problem than the above but I noticed this caching is poorly implemented here and doesn't do anything to avoid fragmentation or to help with good resource utilization. For example, the following allocations: ``` t1 = alloc(4GB) del t1 t2 = alloc(10k) t3 = alloc(4GB) ``` this results in allocating 8GB, because the first 4GB block that is cached gets assigned to the 10k allocation wasting the rest of the block. Lastly, ideally I would make this constants configurable, but looking around the code I didn't see any existing mechanisms in ATen to configure things at runtime. Fixes #95823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95827 Approved by: https://github.com/ngimel	2023-03-10 03:21:06 +00:00
Li-Huai (Allan) Lin	a87f3f612e	[MPS] Fall back multi-layer LSTM on macOS 12 (#90909 ) The native implementation of LSTM has been fixed on macOS 13. On macOS 12, the multi-layer LSTM still has a numerical correctness issue that cannot be resolved on OS's side. Thus, we fall back the multi-layer LSTM on macOS 12 to LSTMCell iteration. It might have performance impact but will make LSTM on macOS 12 fully usable. Fixes: #90421 Issues related: #80306, #83144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90909 Approved by: https://github.com/albanD, https://github.com/kulinseth	2023-03-10 03:10:49 +00:00
BowenBao	b0a580a21d	[ONNX] Export logical_not (#96315 ) Fixes https://github.com/pytorch/pytorch/issues/95154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96315 Approved by: https://github.com/justinchuby	2023-03-10 02:25:08 +00:00
Oriol Nieto	5f89d147a1	[ONNX] STFT Support (#92087 ) This PR addresses issue [#81075](https://github.com/pytorch/pytorch/issues/81075), making `torch.stft` compatible with ONNX Opset 17's STFT operator. The conversion works for _most_ of `torch.stft` functionality: - Batched or unbatched inputs - Normalization - Pre-computed windows - Rectangular windows - One-sided returns - Window centering (implicitly supported) What is currently _not_ supported is complex types, due to the lack of conversion functionality between PyTorch and ONNX (https://github.com/pytorch/pytorch/issues/86746). Regardless, this is easy to bypass by setting `return_complex=False` when using `torch.stft`. Note that there is already a draft PR to address this (https://github.com/pytorch/pytorch/pull/83944), but it is currently closed and it only partially addresses the conversion (i.e., most of `torch.stft` functionality is lacking, and unit tests are missing). Pull Request resolved: https://github.com/pytorch/pytorch/pull/92087 Approved by: https://github.com/justinchuby	2023-03-10 02:20:58 +00:00
Scott Wolchok	69d3fa2e4d	[PyTorch] Use c10::FastMap for memoizing in Pickler (#96360 ) These maps don't rely on reference stability, so FastMap should be fine. Differential Revision: [D43926671](https://our.internmc.facebook.com/intern/diff/D43926671/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96360 Approved by: https://github.com/ezyang	2023-03-10 02:18:16 +00:00
Scott Wolchok	cc798f1a4f	[PyTorch] add c10/util/FbcodeMaps.h (#96359 ) Allow us to use folly maps in fbcode and std maps for compatibility in OSS, extending what static runtime is already doing. Differential Revision: [D43926670](https://our.internmc.facebook.com/intern/diff/D43926670/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96359 Approved by: https://github.com/ezyang	2023-03-10 02:18:16 +00:00
Shunting Zhang	cc699c56dc	reland #96248 [inductor] show performance for each autotune config for a kernel (#96458 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96458 Approved by: https://github.com/ngimel	2023-03-10 01:40:04 +00:00
Xiao Wang	cf3d3a583e	Add env PYTORCH_TEST_DO_NOT_USE_PYTEST as an option to not use pytest in unit testing (#96444 ) Set environment variable ``` PYTORCH_TEST_DO_NOT_USE_PYTEST=1 ``` to not use pytest in pytorch unit testing. This change is related to some recent changes, e.g. #96210, #96016, #95844, #95659, that enabled the use of pytest in many test modules. Those test modules were testing normally before, but failed immediately after pytest is used. Sample stacktraces are: ```python root@8e3168a83ee2:/opt/pytorch/pytorch# python test/run_test.py -v -i test_optim -- -v --save-xml Ignoring disabled issues: [] /opt/pytorch/pytorch/test/run_test.py:1225: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. if torch.version.cuda is not None and LooseVersion(torch.version.cuda) >= "11.6": Selected tests: test_optim parallel (file granularity) tests: test_optim serial (file granularity) tests: Ignoring disabled issues: [] Ignoring disabled issues: [] Running test_optim ... [2023-03-09 12:51:59.358110] Executing ['/usr/local/bin/python', '-bb', 'test_optim.py', '-v', '--save-xml', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2'] ... [2023-03-09 12:51:59.358810] Test results will be stored in test-reports/python-pytest/test_optim/test_optim-5e41643c8bac8ace.xml Traceback (most recent call last): File "/opt/pytorch/pytorch/test/test_optim.py", line 4581, in <module> run_tests() File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 796, in run_tests exit_code = pytest.main(args=pytest_args) File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 148, in main config = _prepareconfig(args, plugins) File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 329, in _prepareconfig config = pluginmanager.hook.pytest_cmdline_parse( File "/usr/local/lib/python3.10/site-packages/pluggy/_hooks.py", line 265, in __call__ return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult) File "/usr/local/lib/python3.10/site-packages/pluggy/_manager.py", line 80, in _hookexec return self._inner_hookexec(hook_name, methods, kwargs, firstresult) File "/usr/local/lib/python3.10/site-packages/pluggy/_callers.py", line 55, in _multicall gen.send(outcome) File "/usr/local/lib/python3.10/site-packages/_pytest/helpconfig.py", line 103, in pytest_cmdline_parse config: Config = outcome.get_result() File "/usr/local/lib/python3.10/site-packages/pluggy/_result.py", line 60, in get_result raise ex[1].with_traceback(ex[2]) File "/usr/local/lib/python3.10/site-packages/pluggy/_callers.py", line 39, in _multicall res = hook_impl.function(*args) File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 1060, in pytest_cmdline_parse self.parse(args) File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 1348, in parse self._preparse(args, addopts=addopts) File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 1231, in _preparse self.pluginmanager.load_setuptools_entrypoints("pytest11") File "/usr/local/lib/python3.10/site-packages/pluggy/_manager.py", line 287, in load_setuptools_entrypoints plugin = ep.load() File "/usr/local/lib/python3.10/importlib/metadata/__init__.py", line 171, in load module = import_module(match.group('module')) File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 688, in _load_unlocked File "/usr/local/lib/python3.10/site-packages/_pytest/assertion/rewrite.py", line 168, in exec_module exec(co, module.__dict__) File "/usr/local/lib/python3.10/site-packages/xdist/looponfail.py", line 16, in <module> import execnet File "/usr/local/lib/python3.10/site-packages/execnet/__init__.py", line 14, in <module> from .gateway_base import DataFormatError File "/usr/local/lib/python3.10/site-packages/execnet/gateway_base.py", line 1138, in <module> FLOAT_FORMAT_SIZE = struct.calcsize(FLOAT_FORMAT) BytesWarning: Comparison between bytes and string FINISHED PRINTING LOG FILE of test_optim (/opt/pytorch/pytorch/test/test-reports/test_optim_1pnlesrz.log) test_optim failed! Traceback (most recent call last): File "/opt/pytorch/pytorch/test/run_test.py", line 1428, in <module> main() File "/opt/pytorch/pytorch/test/run_test.py", line 1386, in main raise RuntimeError( RuntimeError: test_optim failed! Tip: You can keep running tests even on failure by passing --keep-going to run_test.py. If running on CI, add the 'keep-going' label to your PR and rerun your jobs. ``` I'd like to propose this option that allows users to use the good old python unit test way instead of pytest to run their testing in CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96444 Approved by: https://github.com/malfet	2023-03-10 01:32:15 +00:00
Edward Z. Yang	ff2e14f200	Skip rexnet_100 in dynamic CI (#96474 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96474 Approved by: https://github.com/yanboliang, https://github.com/msaroufim	2023-03-10 01:23:19 +00:00
Edward Z. Yang	79313345e8	Fix missing items() typo (#96417 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96417 Approved by: https://github.com/Skylion007	2023-03-10 01:13:58 +00:00
Bin Bao	8ef8bd023d	[CI] Use different subdirectories for amp and float32 nightly perf run (#96470 ) Summary: runner.py deletes its output_dir as its first step, so we need to keep two separate subdirectories. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96470 Approved by: https://github.com/huydhn	2023-03-10 01:12:14 +00:00
Edward Z. Yang	384d3ec2b6	Extra CR comments from #95621 (#96043 ) Specifically: `063e441471 (r1120306196)` https://github.com/pytorch/pytorch/pull/95621#discussion_r1125015510 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96043 Approved by: https://github.com/Chillee, https://github.com/albanD	2023-03-10 01:10:48 +00:00
Will Constable	2f6a371ae9	Revert "Optimize nn.Module __call__ fast path for dynamo (#95931 )" (#96242 ) Reverting due to concerns over silent unsoundness (skipped hooks) if users have directly added hooks dicts without using official torch APIs. This reverts commit 26045336ca323fd27cff2a7340fe896117d5fb6e. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96242 Approved by: https://github.com/albanD	2023-03-10 01:05:01 +00:00
Ilya Sherstyuk	6154be1dd1	[ONNX] Fix circular padding to support dynamic axes (#95647 ) This commit fixes a bug where the ONNX exporter for circular padding queried the input tensor shape in order to get the correct 'end' index for a slice node. This doesn't work when the axis in question is has dynamic size. The commit fixes this by setting the 'end' index to INT_MAX, which is the recommended way of slicing to the end of a dimension with unknown size per ONNX spec. See https://onnx.ai/onnx/operators/onnx__Slice.html Also adds a regression test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95647 Approved by: https://github.com/BowenBao	2023-03-10 00:29:33 +00:00
andrewor14	faa4cb29b2	[Quant][fx] Create new FX-based LSTM reference module (#96343 ) Summary: The previous LSTM reference module implementation did not handle dtypes other than quint8 correctly. This is because the internal LSTM custom module quantization used eager mode, which did not insert the q-dq ops properly. E.g., we want the following reference quantized model: ``` [dq -> linear1_fp32 -> q_to_qint32] -> dq -> q_to_quint8 -> [dq - linear2_fp32 -> q_to_quint8] -> dq -> ... ``` This requires two sets of `q - dq` pairs between two adjacent ops that have different dtypes (linear1 and linear2). However, these `q - dq` pairs were not inserted in the old flow, because eager mode required users to insert Quant/DeQuantStubs manually. This commit changes the internal LSTM custom module quantization to use FX graph mode quantization, which automatically inserts the `q - dq` ops that convert the dtypes between adjacent ops correctly. However, using FX graph mode quantization here comes with its own set of challenges that required some hacks to get the end-to-end flow to work. These hacks are detailed in the comments in the util functions. Test Plan: python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams This commit also updates the corresponding test to verify the dtypes as well as the qparams in the reference quantized graph. This test case should serve as an example for users to set up their own LSTM reference module flows. Reviewers: vkuzo, supriyar, jcaip Subscribers: vkuzo, supriyar, jcaip Pull Request resolved: https://github.com/pytorch/pytorch/pull/96343 Approved by: https://github.com/vkuzo	2023-03-09 23:23:48 +00:00
Natalia Gimelshein	05b679ce6a	[inductor] don't match indirect indexing in fusion (#96273 ) Fixes #96064 When deciding whether to fuse nodes, we match indexing like `c0 + 5 * tmp0`, but `tmp0` in the different nodes can refer to totally different values. Even when `tmp0` is the same (like in the added test) inductor still generates wrongly ordered loads and stores (loads come before stores), so better just disable this fusion altogether. We should fix wrong order also: ``` @pointwise(size_hints=[8], filename=__file__, meta={'signature': {0: 'i64', 1: 'fp32', 2: 'fp32', 3: 'fp32', 4: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': ['out_ptr0'], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2, 3), equal_to_1=())]}) @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, out_ptr1, xnumel, XBLOCK : tl.constexpr): xnumel = 5 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0_load = tl.load(in_ptr0 + (0)) tmp0 = tl.broadcast_to(tmp0_load, [XBLOCK]) tmp1 = tl.load(in_ptr1 + (x0), xmask) tmp2 = tl.load(out_ptr0 + (x0 + (5tmp0)), xmask) tl.store(out_ptr0 + (x0 + (5tmp0) + tl.zeros([XBLOCK], tl.int32)), tmp1, xmask) tl.store(out_ptr1 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask) ``` Note: we are loading from `out_ptr0` here (that shouldn't happen), we are loading from it before storing to it. After this PR, the kernel above is split in 2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96273 Approved by: https://github.com/jansel	2023-03-09 23:03:46 +00:00
Natalia Gimelshein	1bde36ba41	test only smaller block_k for mm_plus_mm (#96385 ) Trim number of tested mm_plus_mm configs to work around https://github.com/openai/triton/issues/1298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96385 Approved by: https://github.com/bertmaher, https://github.com/jansel	2023-03-09 23:03:30 +00:00
Kaichen Liu	090b3b95b8	[PyTorch] Add Vulkan support and tests for at::select.int operator, 4 dim/rank tensor case (#96228 ) Summary: Currently, selection along a dimension/rank is only supported for 3D/rank tensors in PyTorch Vulkan. This adds support for 4D/rank tensors at selection along batch, channel (depth), height, and width. Additionally: - The existing implementations have been name-refactored to reflect whether they operate on 3d or 4d tensors. - The params buffer for all select operations now use `ivec2` or `ivec4` only, for memory alignment safety. Test Plan: 1. `buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1` on Apple M1 MacBook 2. Confirm all tests pass with no regression, and the directly affected tests `select_4d_*`, refactored `select_3d_`, pass 3. Test output P636928908, in particular: ``` [...bunch of other tests...] [ RUN ] VulkanAPITest.select_3d_depth_small [ OK ] VulkanAPITest.select_3d_depth_small (1 ms) [ RUN ] VulkanAPITest.select_3d_depth_medium [ OK ] VulkanAPITest.select_3d_depth_medium (0 ms) [ RUN ] VulkanAPITest.select_3d_depth_large [ OK ] VulkanAPITest.select_3d_depth_large (1 ms) [ RUN ] VulkanAPITest.select_3d_height_small [ OK ] VulkanAPITest.select_3d_height_small (0 ms) [ RUN ] VulkanAPITest.select_3d_height_medium [ OK ] VulkanAPITest.select_3d_height_medium (0 ms) [ RUN ] VulkanAPITest.select_3d_height_medium1 [ OK ] VulkanAPITest.select_3d_height_medium1 (0 ms) [ RUN ] VulkanAPITest.select_3d_height_medium2 [ OK ] VulkanAPITest.select_3d_height_medium2 (0 ms) [ RUN ] VulkanAPITest.select_3d_height_large [ OK ] VulkanAPITest.select_3d_height_large (1 ms) [ RUN ] VulkanAPITest.select_3d_width_small [ OK ] VulkanAPITest.select_3d_width_small (0 ms) [ RUN ] VulkanAPITest.select_3d_width_medium [ OK ] VulkanAPITest.select_3d_width_medium (0 ms) [ RUN ] VulkanAPITest.select_3d_width_medium2 [ OK ] VulkanAPITest.select_3d_width_medium2 (0 ms) [ RUN ] VulkanAPITest.select_3d_width_large [ OK ] VulkanAPITest.select_3d_width_large (1 ms) [ RUN ] VulkanAPITest.select_4d_batch_small [ OK ] VulkanAPITest.select_4d_batch_small (0 ms) [ RUN ] VulkanAPITest.select_4d_batch_medium [ OK ] VulkanAPITest.select_4d_batch_medium (0 ms) [ RUN ] VulkanAPITest.select_4d_batch_large [ OK ] VulkanAPITest.select_4d_batch_large (1 ms) [ RUN ] VulkanAPITest.select_4d_depth_small [ OK ] VulkanAPITest.select_4d_depth_small (1 ms) [ RUN ] VulkanAPITest.select_4d_depth_medium [ OK ] VulkanAPITest.select_4d_depth_medium (0 ms) [ RUN ] VulkanAPITest.select_4d_depth_large [ OK ] VulkanAPITest.select_4d_depth_large (1 ms) [ RUN ] VulkanAPITest.select_4d_height_small [ OK ] VulkanAPITest.select_4d_height_small (0 ms) [ RUN ] VulkanAPITest.select_4d_height_medium [ OK ] VulkanAPITest.select_4d_height_medium (0 ms) [ RUN ] VulkanAPITest.select_4d_height_large [ OK ] VulkanAPITest.select_4d_height_large (1 ms) [ RUN ] VulkanAPITest.select_4d_width_small [ OK ] VulkanAPITest.select_4d_width_small (0 ms) [ RUN ] VulkanAPITest.select_4d_width_medium [ OK ] VulkanAPITest.select_4d_width_medium (0 ms) [ RUN ] VulkanAPITest.select_4d_width_large [ OK ] VulkanAPITest.select_4d_width_large (1 ms) [...bunch of other tests...] [ FAILED ] 7 tests, listed below: [ FAILED ] VulkanAPITest.cat_dim1_singledepth_success [ FAILED ] VulkanAPITest.gru_success [ FAILED ] VulkanAPITest.gru_mclareninputs_success [ FAILED ] VulkanAPITest.gru_prepack_success [ FAILED ] VulkanAPITest.lstm_success [ FAILED ] VulkanAPITest.lstm_mclareninputs_success [ FAILED ] VulkanAPITest.lstm_prepack_success ``` Reviewed By: SS-JIA Differential Revision: D42623181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96228 Approved by: https://github.com/SS-JIA	2023-03-09 22:39:33 +00:00
Edward Z. Yang	6a675f7cac	Correctly resolve dispatch keys for PyOperator (#96306 ) Previously, we never actually used resolve_key, which meant that you had to register CPU/CUDA/etc all manually; none of the alias keys worked. Now they work. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96306 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2023-03-09 22:16:31 +00:00
Edward Z. Yang	30c4ea138f	Assert that there are no None arguments to backwards (#96300 ) This assert would have caught https://github.com/pytorch/pytorch/pull/96219 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96300 Approved by: https://github.com/bdhirsh	2023-03-09 22:14:39 +00:00
Edward Z. Yang	bbe1b9bbd4	Fix https://github.com/pytorch/pytorch/issues/96278 (#96299 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96299 Approved by: https://github.com/ngimel	2023-03-09 22:13:52 +00:00
Nikita Shulga	075a49442d	[MPS] Allow `float16` input to float32 `LayerNorm` (#96430 ) Only for forward pass Subset of https://github.com/pytorch/pytorch/pull/96208 Create constant with scalar using `input_mps_dtype` and use `reciprocalWithTensor` instead of `divisionWithPrimaryTensor:1.0 secondaryTensor:` Fixes https://github.com/pytorch/pytorch/issues/96113 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96430 Approved by: https://github.com/kulinseth	2023-03-09 22:09:10 +00:00
Andrew Gu	457396fcdc	[Autograd] `expand_as` instead of `clone` to get `AccumulateGrad` (#96356 ) This PR makes a minor change to the multi-grad hook implementation. This should decrease peak memory since we avoid one `clone()` per tensor passed into the multi-grad hook. Let me know if there are technical reasons why we need to clone. If so, is there a way for some use cases to not clone? Before with `clone()`: ![Screenshot 2023-03-08 at 6 08 41 PM](https://user-images.githubusercontent.com/31054793/223873111-ad9105ab-2958-45a1-a2f5-18e9b254c710.png) After with `expand_as()` -- no more "Memcpy DtoD" kernels: ![Screenshot 2023-03-08 at 6 08 48 PM](https://user-images.githubusercontent.com/31054793/223873104-670b6abc-cd5c-4d1e-a316-cea1bef5832a.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96356 Approved by: https://github.com/soulitzer	2023-03-09 21:58:42 +00:00
Andrew Gu	cb42bc2cf8	[FSDP] Add unsafe setattr gated by env var (#96326 ) This adds the option to use an unsafe `setattr` for `_use_sharded_views()` and `_use_unsharded_views()` gated by the environment variable `FSDP_USE_UNSAFE_SETATTR`, where a value of `1` means to use the unsafe `setattr`. The unsafe option is disabled by default. The unsafe `setattr` may be able to save CPU overhead and may be used to intentionally bypass `setattr` checks. Both `_use_sharded_views()` and `_use_unsharded_views()` must use the unsafe version or use the safe versions atomically. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96326 Approved by: https://github.com/zhaojuanmao, https://github.com/fegin	2023-03-09 21:58:35 +00:00
PyTorch MergeBot	fe05266fda	Revert "[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985 )" This reverts commit deaf9e5e659a1f73656cbbacb39448498e857163. Reverted https://github.com/pytorch/pytorch/pull/95985 on behalf of https://github.com/huydhn due to Sorry for reverting this. It increased the test time significantly for ASAN (and may be other test shards). ASAN tests on PR passed but it was barely not timing out. I have updated my initial findings in https://github.com/pytorch/pytorch/issues/96378	2023-03-09 01:45:24 +00:00
Huy Do	44d8e6c2aa	Retry CI Android emulator test (#96163 ) This is not the first time I spot Android test flakiness such as `893aa5df3f`. From some StackOverflow results, it looks like the failure `Unknown failure: Error: Could not access the Package Manager. Is the system running?` could be fixed by waiting a bit for the emulator to start fully https://stackoverflow.com/questions/15524185/could-not-access-the-package-manager-is-the-system-running-while-installing-and So, I'm adding retry capability here to give the test another chance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96163 Approved by: https://github.com/ZainRizvi	2023-03-09 00:09:10 +00:00
BowenBao	df0ff34bcb	[ONNX] Bump onnx submodule to release 1.13.1 from rc2 (#96325 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96325 Approved by: https://github.com/justinchuby	2023-03-09 00:00:44 +00:00
Edward Z. Yang	32ffd70644	Rewrite fallthrough to more closely match how C++ works (#96304 ) Fallthrough is modeled as a mask which we use to remove keys from the compute dispatch key set for eligibility. It's possible this addresses https://github.com/pytorch/pytorch/issues/89037 in a better way than https://github.com/pytorch/pytorch/pull/95891 but I cannot easily tell as the original repro no longer works and the new PR does not have a test. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96304 Approved by: https://github.com/zou3519, https://github.com/albanD, https://github.com/zhxchen17	2023-03-08 23:00:26 +00:00
Edward Z. Yang	67c329bc9b	Refactor to reduce duplicate logic in torch._ops (#96302 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96302 Approved by: https://github.com/zou3519	2023-03-08 23:00:26 +00:00
Will Constable	4662ae5b62	Add missing types to inductor IR assert (#96221 ) Unclear if there is a more efficient way to define the allowed types for IR (or if we even need this, perhaps we just ditch the assert?) But Inductor experts can deteremine if these added ops are appropriate and if so they fix the reported issue. Fixes #96204 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96221 Approved by: https://github.com/ezyang	2023-03-08 22:55:43 +00:00
Wei Wang	038e838e7b	Make setup linux action be more friendly with gcp linux runners (#96289 ) Fixes issues like the following: https://github.com/pytorch/pytorch/actions/runs/4362155257/jobs/7627059487 has a more serious core dump failure but the log of curl failures (GCP linux trying to get EC2 specific metadata like EC2 AMI-ID, Instance ID, and Instance Type) confused the HUD. <img width="848" alt="image" src="https://user-images.githubusercontent.com/109318740/223670567-330521ba-050a-41c3-9efb-fae6ea3398c0.png"> This PR gets rid of those curl failures. This may have contributed to the impression of "flaky GCP" in #95416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96289 Approved by: https://github.com/huydhn, https://github.com/yanboliang	2023-03-08 22:17:36 +00:00
Gao, Xiang	78e04f8272	Update nvfuser_executor.py (#96218 ) In https://github.com/csarofeen/pytorch/pull/2517 the return value of `compute_contiguity` is changed from tuple to list. This PR handles that change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96218 Approved by: https://github.com/jjsjann123, https://github.com/davidberard98	2023-03-08 22:07:58 +00:00
fduwjj	7863efbd76	[BE][8/N] Remove ShardedTensor from TP FSDP integration test and other tests depending on Sharded Linear (#96254 ) We removed ShardedLinear in https://github.com/pytorch/pytorch/pull/95948 but it broke TP_FSDP integration test because it is using ShardedTensor in the test. Migrating using DTensor fixes the test. DTensor shards the bias too so that we need to change the test a little bit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96254 Approved by: https://github.com/huydhn	2023-03-08 21:56:41 +00:00
Aidyn-A	f5c39b7ba2	[inductor] fix typos in test_torchinductor.py (#96233 ) Fixes typos in `test_torchinductor.py::test_recompile_on_index_cuda` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96233 Approved by: https://github.com/jansel	2023-03-08 21:24:46 +00:00
BowenBao	0f4652f498	[ONNX] Merge 'initializers' into 'TorchScriptGraph' (#95676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95676 Approved by: https://github.com/titaiwangms, https://github.com/wschin	2023-03-08 21:12:20 +00:00
Edward Z. Yang	e9e6b3b6c5	[EASY] Add complex dtypes to partitioner (#96297 ) Also, delete some redundant dtype setting. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96297 Approved by: https://github.com/Chillee	2023-03-08 21:08:26 +00:00
Catherine Lee	a7fe11dec0	--subprocess for pytest (#96210 ) Implements --subprocess flag for pytest, which previously only worked with unittest Pretty much all the tests in the custom handler list use --subprocess Pull Request resolved: https://github.com/pytorch/pytorch/pull/96210 Approved by: https://github.com/huydhn	2023-03-08 21:04:50 +00:00
Catherine Lee	8921b22297	Set ref for linux_job checkout in lint (#96317 ) test-infra's linux_job uses github.ref as the default value for the ref, which is the branch, so it checks out the most recent commit on the branch. Might be better to fix this on the test-infra side instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/96317 Approved by: https://github.com/huydhn	2023-03-08 21:04:30 +00:00
albanD	c8216e558b	Add basic Module serialization BC test (#96238 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96238 Approved by: https://github.com/ezyang	2023-03-08 21:01:27 +00:00
Horace He	5bbec680d7	Fix usages of contextmanager without finally (#96170 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96170 Approved by: https://github.com/ngimel, https://github.com/malfet	2023-03-08 20:59:27 +00:00
Hansong Zhang	34d18c8bee	Remove unimported expecttest deps and usage (#96314 ) expecttest is not imported to OSS BUCK build yet. Using it in target test_torchgen_executorch breaks build. Remove it first to fix the build. Will import and fix in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96314 Approved by: https://github.com/huydhn	2023-03-08 20:54:11 +00:00
Aidyn-A	0f6d6d6124	[TorchScript] Fix torch.cuda._exchange_device (#95306 ) Fixes #95305 I am not sure why these one-line changes fix TorchScript, but it works... Pull Request resolved: https://github.com/pytorch/pytorch/pull/95306 Approved by: https://github.com/ngimel	2023-03-08 20:29:05 +00:00
Bin Bao	deaf9e5e65	[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95985 Approved by: https://github.com/jansel	2023-03-08 20:02:32 +00:00
Thiago Crepaldi	b9c25f186c	Ignore shape inference exception from Caffe2 ATen fallback (#90408 ) Fixes #87318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90408 Approved by: https://github.com/BowenBao	2023-03-08 20:02:11 +00:00
Edward Z. Yang	c988de1040	[EASY] Update inductor training dynamic skips (#96298 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96298 Approved by: https://github.com/Chillee, https://github.com/janeyx99	2023-03-08 19:31:46 +00:00
Bin Bao	b3a079810e	[CI] Add a workflow for quick perf comparison (#96166 ) Summary: ciflow/inductor-perf-test-nightly now contains full dashboard run which takes a very long time. Ed proposed a simplification of the perf run there, but it is still worth to have a set of fast perf test which only includes one configuration (--training --amp). Pull Request resolved: https://github.com/pytorch/pytorch/pull/96166 Approved by: https://github.com/huydhn, https://github.com/weiwangmeta	2023-03-08 19:09:04 +00:00
Huy Do	4a1b971748	Move MacOS x86_64 build and test jobs to periodic (#96279 ) The correlation result can be found at https://github.com/pytorch/test-infra/pull/3852. This is the first step toward reducing the redundancy of having both x86_64 and Apple silicon M1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96279 Approved by: https://github.com/ZainRizvi, https://github.com/seemethere, https://github.com/malfet	2023-03-08 18:52:18 +00:00
PyTorch MergeBot	9137f53ec2	Revert "Error when jit.trace/script is used with torch.compile (#91681 )" This reverts commit fa92b6a7b0e12779baa92d0d11e4161a130fea58. Reverted https://github.com/pytorch/pytorch/pull/91681 on behalf of https://github.com/izaitsevfb due to Breaks internal tests, see T147501786	2023-03-08 18:47:38 +00:00
Zain Rizvi	7362e22f8b	Notify on outdated lintrunner (#96241 ) Let users know if they have an outdated version of lintrunner installed on their box Sets the minimum version to one which uses master as a default mergebase (see https://github.com/pytorch/pytorch/pull/95938) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96241 Approved by: https://github.com/huydhn	2023-03-08 18:41:31 +00:00
Driss Guessous	11aab72dc9	[SDPA] Add an optional scale kwarg (#95259 ) # Summary This PR adds an optional kwarg to torch torch.nn.functional.scaled_dot_product_attention() The new kwarg is a scaling factor that is applied after the q@k.T step of the computation. Made updates to the efficient kernel to support but flash and math were minimally updated to support as well. Will reduce the complexity of: #94729 and has been asked for by a couple of users. # Review Highlights - As far as I know I did this the correct way and this both BC and FC compliant. However I always seem to break internal workloads so I would love if someone can advice I did this right? - I named the optional arg 'scale'. This is probably dumb and I should name it 'scale_factor'. I will make this change but this is annoying and it will require someone thinking we should rename. - 'scale' is interpreted as `Q@K.T * (scale)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95259 Approved by: https://github.com/cpuhrsch	2023-03-08 18:07:40 +00:00
PyTorch MergeBot	3f840cc627	Revert "Ignore shape inference exception from Caffe2 ATen fallback (#90408 )" This reverts commit 1d4e8723705280a82497d366cdf37e6aef49725d. Reverted https://github.com/pytorch/pytorch/pull/90408 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it breaks lint check https://hud.pytorch.org/pr/90408#11855039599. Please fix the error and reland your change	2023-03-08 17:28:21 +00:00
Nikita Shulga	9c5a24b9df	[BE] Delete `pre-cxx-11-abi MacOS libtorch builds (#96301 ) Those ABI flags makes sense only for Linux, libc++ binaries shipped with MacOS has only one ABI flavor. Moreover, those binaries were uploaded to the same location anyway, see [upload job for pre-cxx-11 abi](https://github.com/pytorch/pytorch/actions/runs/4362299843/jobs/7628815268#step:7:97) and [upload job for cxx-11 abi](https://github.com/pytorch/pytorch/actions/runs/4362299812/jobs/7628879450#step:7:97) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96301 Approved by: https://github.com/atalman	2023-03-08 17:25:19 +00:00
andrewor14	e7dd9b1138	[Quant][test] Add test for mixed dtypes in the same model (#96104 ) Summary: This commit adds a test for mixing multiple dtypes for different layers in the same model. The test verifies that FX graph mode quantization converts the dtypes correctly between the layers. Test Plan: python test/test_quantization.py TestQuantizeFx.test_mixed_dtypes Reviewers: jcaip, vkuzo, supriyar Subscribers: jcaip, vkuzo, supriyar Pull Request resolved: https://github.com/pytorch/pytorch/pull/96104 Approved by: https://github.com/jcaip	2023-03-08 17:08:12 +00:00
Thiago Crepaldi	1d4e872370	Ignore shape inference exception from Caffe2 ATen fallback (#90408 ) Fixes #87318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90408 Approved by: https://github.com/BowenBao	2023-03-08 16:57:48 +00:00
Brian Hirsh	98ece75043	[aot autograd] merge all outputs of funtionalization analysis into single metadata (#95991 ) This makes the next PR in the stack cleaner: having the top level entry point to aot autograd perform the functionalization analysis pass once, and plumb the metadata everywhere else that we need it. I put it in a separate PR because I recently learned that this function is used in fbcode, so I'll need to fix up internals when I land this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95991 Approved by: https://github.com/ezyang	2023-03-08 16:22:54 +00:00
Brian Hirsh	29b216acd5	aot autograd: handle detach() and no_grad() mutations on input (#95980 ) Fixes https://github.com/pytorch/pytorch/issues/95167 More details are in that issue. To summarize, the issue shows up when we have some code like this: ``` def f(x): x.detach().mul_(2) # can also happen if the mul_() happens under torch.no_grad() return x + 1 ``` AOTAutograd will then spit out code like this: ``` def compiled_fn(x): x_updated = x.mul(2) out = x_updated + 1 return x_updated, out def CompiledFunction.forward(x): # pseudocode, this is part of an autograd.Function x_updated, out = compiled_function(x): return x_updated, out def runtime_wrapper(x): x_updated, out = CompiledFunction.apply(x) x.copy_(x_updated) x = torch.ones(2, requires_grad=True) out = runtime_wrapper(x) ``` However, the call to `x.copy_(x_updated)` will fail with the error: `a leaf Variable that requires grad is being used in an in-place operation`. This is because `x` is an autograd leaf, and autograd doesn't allow you to mutate leaves. In this case though, the data mutation should be entirely opaque to autograd - all mutations happened underneath a `.detach()` or a `torch.no_grad()`. As Ed pointed out in the issue, we can detect this situation by checking if the mutated input is an autograd leaf. If it is, then it must have been the case that any mutations on it must have been hidden from autograd, since otherwise the eager code would have error'd. The solution I added is to detect this situation, and manually run `x.detach().copy_(x_updated)`, to hide the update from autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95980 Approved by: https://github.com/ezyang	2023-03-08 16:11:06 +00:00
Nikita Karetnikov	bb650b34c4	[inductor] do not handle `int` in `placeholder` (#96230 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96230 Approved by: https://github.com/ezyang	2023-03-08 13:50:40 +00:00
Brian Hirsh	f96bd52841	aot autograd: dont allow symint outputs to get tangents in the bw graph (#96219 ) Previously, if dynamic shapes were turned on and we had a forward graph that returns a symint, then we would generate a backward graph that takes in a tangent input for that symint fwd output. This causes problems for downstream - inductor will see an input that it expects to be a symint, but it gets a `None` from autograd. Confirmed that this repro now passes: ``` benchmarks/dynamo/torchbench.py --devices cuda --inductor --dynamic-shapes --unspecialize-int --accuracy --training --only drq ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96219 Approved by: https://github.com/ezyang	2023-03-08 13:02:34 +00:00
PyTorch MergeBot	6bbae86253	Revert "Fix hooks handling for unpickled nnmodule (#96224 )" This reverts commit 8ca264ef3666ce865520f9877b2980ec109d95da. Reverted https://github.com/pytorch/pytorch/pull/96224 on behalf of https://github.com/ezyang due to inductor regression	2023-03-08 13:01:16 +00:00
Kwanghoon An	a1d7014c0f	Hooking backward for QNNPACK (#94432 ) Summary: Enabling quantized gradient. Test Plan: Algorithmic correctness - Dequantized matmul vs QNNPACK matmul for gradient - P616202766 ``` dequantized matmul : [1.5463, -0.2917, -2.1735, 0.5689, -1.0795] QNNPACK matmul : tensor([[ 1.5463, -0.2917, -2.1735, 0.5689, -1.0795]]) ``` Differential Revision: D42593235 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94432 Approved by: https://github.com/malfet, https://github.com/kimishpatel	2023-03-08 10:21:32 +00:00
Chien-Chin Huang	92edac72aa	[FSDP][optim_state_dict] Fix a memory leakage in optim_state_dict (#96263 ) Summary: The original code uses a class variable to store flat_parameter result. This could cause memory leakage. Test Plan: CI and a E2E run Reviewed By: awgu Differential Revision: D43893577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96263 Approved by: https://github.com/zhaojuanmao	2023-03-08 08:43:42 +00:00
Kulin Seth	2bb022e902	[MPS] Adding xfaillist with all categories of failures. (#96176 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/96176 Approved by: https://github.com/malfet	2023-03-08 08:41:21 +00:00
Max Podkorytov	b90a9c7db2	[static-runtime] fix one forwarding usage (#96271 ) Summary: as titled Test Plan: ci Differential Revision: D43897369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96271 Approved by: https://github.com/davidberard98	2023-03-08 07:38:21 +00:00
PyTorch MergeBot	3ce1e15cf7	Revert "[Dynamo] Support torch.{cuda/cpu}.amp.autocast (#95416 )" This reverts commit c88aa336aa0734f42b4d9db7f624d6cfd9b5065e. Reverted https://github.com/pytorch/pytorch/pull/95416 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But it seems that the smoke test issue is related as it starts to fail consistently in trunk https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor_torchbench_smoketest_perf	2023-03-08 06:51:57 +00:00
Nikita Shulga	941ff109d3	`dl_open_guard` should restore flag even after exception (#96231 ) I.e. follow pattern outlined in https://docs.python.org/3.8/library/contextlib.html#contextlib.contextmanager Also, return early on non-unix platforms (when `sys.getdlopenflags` is not defined) Fixes https://github.com/pytorch/pytorch/issues/96159 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96231 Approved by: https://github.com/atalman	2023-03-08 06:01:27 +00:00
Will Constable	8ca264ef36	Fix hooks handling for unpickled nnmodule (#96224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96224 Approved by: https://github.com/albanD	2023-03-08 05:33:15 +00:00
Jiaxu Zhu	08fb13db65	[Quant] Add lowering for pixel_unshuffle/narrow (#96160 ) Summary: ## Summary torch.nn.functional.pixel_unshuffle and torch.narrow accepts both float and quantized inputs. However, previously we would unnecessarily dequantize quantized inputs into floats before passing them to the function. This commit fixes this by lowering the pattern [dequant - pixel_unshuffle - quant]. [dequant - narrow - quant]. Test Plan: ``` python test/test_quantization.py TestQuantizeFxOps.test_pixel_unshuffle ``` ``` python test/test_quantization.py TestQuantizeFxOps.test_narrow ``` Differential Revision: D43858199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96160 Approved by: https://github.com/andrewor14	2023-03-08 05:25:03 +00:00
Han Qi	9e3f173636	[1/n] Add verifier for EXIR Aten dialect (#94783 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94783 Approved by: https://github.com/zhxchen17	2023-03-08 04:55:54 +00:00
Catherine Lee	3a4275278b	Use GH cache for sccache on GH mac runners (#96142 ) sccache added GH cache as a storage option, so try to use it for the GH provided mac runners. My experiments with this are varied. I tried a couple of different releases and the first run with a cold cache took 1hr (v0.3.3), 1hr (v0.4.0 pre7), 2hr (v0.3.3). Afterwards it usually takes 30 minutes but sometimes longer, but no longer than 1hr. I am using v0.4.0 pre7 because they reduced the amount of configuration/env vars you need to set and the GH cache keys get managed by sccache. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96142 Approved by: https://github.com/huydhn, https://github.com/malfet	2023-03-08 04:18:54 +00:00
Michael Voznesensky	d7db5b05b4	Context manager to push/pop frame summaries (#96054 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96054 Approved by: https://github.com/avikchaudhuri, https://github.com/ezyang	2023-03-08 04:01:49 +00:00
PyTorch MergeBot	bb8645acda	[vision hash update] update the pinned vision hash (#96243 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96243 Approved by: https://github.com/pytorchbot	2023-03-08 03:57:12 +00:00
Bin Bao	664381b293	[CI] Avoid calling torch.use_deterministic_algorithms for some models (#96245 ) tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/96245 Approved by: https://github.com/davidberard98	2023-03-08 03:35:32 +00:00
Hansong Zhang	93ff71ec37	[ET] Add RuntimeContext to ET Aten mode (#96084 ) Summary: In ATen mode, we add the RuntimeContext arg, so we have something like ``` TORCH_API inline at::Tensor & gelu_outf(torch::executor::RuntimeContext & context, const at::Tensor & self, c10::string_view approximate, at::Tensor & out) { return at::gelu_outf(self, approximate, out); } ``` and user can use `<namespace like aten>::gelu_outf` and we will automatically dispatch the registered function in aten kernel using `at::gelu_outf` (dispatched by ATen/Functions.h header) In optimized kernel tests, we can now automatically handle between aten kernel and optimized kernel. The implication is that the test must depend on the correctness of codegen; an error in codegen can break the kernel tests. Test Plan: CI Differential Revision: D43777848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96084 Approved by: https://github.com/larryliu0820	2023-03-08 02:51:47 +00:00
Yanbo Liang	c88aa336aa	[Dynamo] Support torch.{cuda/cpu}.amp.autocast (#95416 ) For Meta internal use cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95416 Approved by: https://github.com/jansel	2023-03-08 01:40:27 +00:00
Yanbo Liang	b8f7bd593c	[Dynamo] Guard name should be valid Python identifier (#96174 ) Fixes #96149 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96174 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-03-08 01:33:29 +00:00
erjia	738cc5e644	Fix validate_input_col for nn.Module or Callable (#96213 ) Forward fix the problem introduced in https://github.com/pytorch/pytorch/pull/95067 Not all `Callable` objects have `__name__` implemented. Using `repr` as the backup solution to get function name or reference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96213 Approved by: https://github.com/NivekT	2023-03-08 01:30:17 +00:00
Scott Wolchok	fdd7e76b95	[PyTorch][easy] Don't call IValue::type twice in Pickler::endTypeTag (#96214 ) The duplicate call is unlikely to be eliminated by the compiler (it can return a new heap-allocated object). Differential Revision: [D43877846](https://our.internmc.facebook.com/intern/diff/D43877846/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96214 Approved by: https://github.com/zhxchen17	2023-03-08 01:29:21 +00:00
Andrew Gu	3623cfb790	[FSDP] Speed up first iter order check (part 2) (#96220 ) For a tensor on GPU, moving it once to CPU and operating on it on CPU is faster than moving it element by element from CPU to GPU. This is a follow-up to also move `world_indices`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96220 Approved by: https://github.com/zhaojuanmao	2023-03-08 01:08:54 +00:00
Joel Schlosser	7324aef9a8	Add torch.empty_like() to documented list of supported nested tensor ops (#96211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96211 Approved by: https://github.com/drisspg	2023-03-07 23:33:34 +00:00
Vladimir S. FONOV	b0b5f3c6c6	Fix gumbel cdf (#91698 ) Fix `Gumbel.cdf` function. Description When transformed parameters is outside of the support of underlying Uniform distribution. This makes behavior of `Gumbel.cdf` consistent with other `TransformedDistribution` that pass value of validate_args to the base distribution. Issue running `Gumbel(0.0,1.0,validate_args=False).cdf(20.0)` would cause `ValueError` exception from `_validate_sample` Testing Test was added to the `test_distributions.py` to check if `Gumbel(0.0,1.0,validate_args=False).cdf(20.0)` successfully returns `1.0` This is a second attempt to push changes , after https://github.com/pytorch/pytorch/pull/82488 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91698 Approved by: https://github.com/fritzo, https://github.com/zou3519	2023-03-07 23:04:47 +00:00
Michael Lazos	203890e1e0	Properly show buck target to run (#96089 ) Summary: Makes the debug dir location configurable with TORCH_COMPILE_DEBUG_DIR env var Test Plan: TORCH_COMPILE_DEBUG_DIR=”.” buck2 run mode/dev-nosan //caffe2/test/inductor:minifier_smoke Reviewed By: bertmaher Differential Revision: D43639955 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96089 Approved by: https://github.com/bertmaher	2023-03-07 22:52:27 +00:00
guyang3532	79d49c60c1	[ONNX] Fix expand_as (#95962 ) Fixes [#ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/95961) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95962 Approved by: https://github.com/BowenBao, https://github.com/justinchuby	2023-03-07 22:11:50 +00:00
BowenBao	bdb076ab43	[ONNX] Skip doctest `torch.onnx._internal.fx` if ImportError (#95686 ) Need to use `exclude` to skip the module altogether. Because xdoctest triggers `ImportError` when trying to import the module. So the whole test fails regardless if skip was added in the docstring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95686 Approved by: https://github.com/kit1980, https://github.com/titaiwangms	2023-03-07 22:05:27 +00:00
BowenBao	82dba844bb	[ONNX] Move symbolic export to separate file (#95650 ) Move things around in the effort of preparing to refactor the code structure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95650 Approved by: https://github.com/titaiwangms	2023-03-07 22:05:27 +00:00
Pritam Damania	d06729746c	[RFC] Add _abort method to ProcessGroupNCCL (#96017 ) Summary: Currently the only way to destroy a process group is calling `dist.destroy_process_group`. However, this API does not guarantee destruction of the ProcessGroup object since it only deletes references inside `distributed_c10d.py`. In cases where the process group is used in multiple places it is not feasible to hunt down all the references and delete them. In particular for NCCL if a collective gets stuck the only way to recover from this is calling ncclCommAbort on all the communicators. Currently there is no API to achieve this. To address this, in this PR I've added an `_abort` method to ProcessGroupNCCL to achieve this, where now we have a guaranteed way to kill all NCCL communicators associated with a ProcessGroup Test Plan: Added a unit test to validate this works as expected Pull Request resolved: https://github.com/pytorch/pytorch/pull/96017 Approved by: https://github.com/wanchaol	2023-03-07 21:37:38 +00:00
Zachary DeVito	d6d8d3484e	_memory_viz.py: Visualize how blocks fit into segments. (#91336 ) Add a segment_plot command that visualizes how blocks are allocated into segments. This is similar to the 'stats' command but produces an interactive html viewer rather than text dump, allowing exploration of stack traces. It also adds the ability to see the layout at any point in the trace by starting from the snapshot and then apply the events backwards to reconstruct what memory would have looked like. Example: ![Screen Shot 2022-12-22 at 3 32 49 PM](https://user-images.githubusercontent.com/370202/209242650-b952372e-37ac-400a-a01c-13be2b5426fa.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91336 Approved by: https://github.com/bhosmer	2023-03-07 21:07:18 +00:00
Zachary DeVito	71f369092d	Revert "Revert "memory viz: Add colors for categories and a legend (#90587 )"" (#96133 ) This reverts commit b38b39c441f12be90fd5d7eafe74246d050665c8. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96133 Approved by: https://github.com/bhosmer	2023-03-07 21:07:18 +00:00
Andrew Gu	32eb3ab7e8	[FSDP] Speed up first iter order check (#96146 ) For a tensor on GPU, moving it once to CPU and operating on it on CPU is faster than moving it element by element from CPU to GPU. The relevant tensor in this case is `world_num_valid_indices`. This closes https://github.com/pytorch/pytorch/issues/95728. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96146 Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma	2023-03-07 20:58:56 +00:00
lijiahao	3d5eba811a	Add shape function for stack op (#92205 ) As @ramiro050 requested in https://github.com/llvm/torch-mlir/pull/1747, this PR moved the shape code for stack op from torch-mlir to pytorch upstream. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92205 Approved by: https://github.com/eellison	2023-03-07 20:45:56 +00:00
Zain Rizvi	5e73cc9310	Update lintrunner to version that uses as default mergebase (#95938 ) Fixes https://github.com/pytorch/pytorch/issues/93156 Upgrades lintrunner to the latest version which can use the `master` branch as the merge base by default (provided it's specified in `.lintrunner.toml` and update `.lintrunner.toml` accordingly Pull Request resolved: https://github.com/pytorch/pytorch/pull/95938 Approved by: https://github.com/huydhn	2023-03-07 20:25:02 +00:00
Max Podkorytov	a5aceba61f	[static-runtime] a pass through checks throwing exceptions (#95983 ) Summary: increasing verbosity where possible Test Plan: CI Differential Revision: D43761268 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95983 Approved by: https://github.com/davidberard98	2023-03-07 19:16:27 +00:00
Li-Huai (Allan) Lin	576762d9d2	Clean up duplicate line in logit sample inputs (#95163 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95163 Approved by: https://github.com/ngimel	2023-03-07 18:57:40 +00:00
Catherine Lee	eea0733045	Reduce pytest blocklist (#96016 ) `TestCase = object` or variations of it get switched to `TestCase = NoTest`. unittest collects test based on subclassing unittest.TestCase, so setting TestCase = object removes it from unittest test collection. pytest collects based on name (https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-python_classes) but can be told to ignore a class (bottom of https://docs.pytest.org/en/7.1.x/example/pythoncollection.html#changing-naming-conventions) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96016 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2023-03-07 18:30:27 +00:00
Horace He	30237e7aec	Provide more informative kernel names in Inductor (#95940 ) Before: `triton_fused_add_83_add_84_relu_13_squeeze_46_var_mean_15_14` After: `triton_fused__native_batch_norm_legit_functional_convolution_relu_14` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95940 Approved by: https://github.com/SherlockNoMad, https://github.com/ngimel, https://github.com/jansel	2023-03-07 18:02:10 +00:00
Nikita Karetnikov	c74f09403b	[inductor] make `philox_rand_like` work with dynamic shapes (#95461 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95461 Approved by: https://github.com/ezyang	2023-03-07 18:01:50 +00:00
Edward Z. Yang	02a18b1a97	Properly avoid wrapping numbers as tensors before backend (#96193 ) This partially reverts https://github.com/pytorch/pytorch/pull/96051 with a proper fix. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96193 Approved by: https://github.com/jansel	2023-03-07 17:57:47 +00:00
Li-Huai (Allan) Lin	2f66b57a7a	[MPS] Fix in-place add and sub with alpha == 0.0 (#96184 ) Apart from fixing the below issue, this PR integrates the test for `sub` into the test for `add` as they are implemented using the same template. Fixes #96065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96184 Approved by: https://github.com/kulinseth	2023-03-07 17:17:53 +00:00
Will Constable	d4f5f9fdb4	Profile dynamo guards (#96119 ) Adds a profiler start and end callback to dynamo's C eval_frame impl, which can be used to profile a region providing a name for visualization. Currently only hooks up one usage to profile cache lookup (primarily covering guards and linear search through linked list). Example profile taken from toy model: `python benchmarks/dynamo/distributed.py --toy_model --profile --dynamo aot_eager` <img width="1342" alt="image" src="https://user-images.githubusercontent.com/4984825/223225931-b2f6c5a7-505a-4c90-9a03-34982f6dc033.png"> Planning to measure overhead in CI, and probably can't afford to check this in enabled by default. Will have to evaluate UX options such as `config.profile_dynamo_cache = True` or some other way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96119 Approved by: https://github.com/jansel	2023-03-07 16:12:22 +00:00
Edward Z. Yang	d0641ed247	[TEST] Turn on unspecialize int dynamic training inductor CI (#96058 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96058 Approved by: https://github.com/janeyx99, https://github.com/voznesenskym	2023-03-07 16:08:45 +00:00
XiaobingSuper	9a575e77ca	inductor: align baddbmm behavior with eager mode for beta=0 and input has nan value (#96087 ) For ```torch.baddbmm(input, mat1,mat2, beta=0)```, if ```beta``` is zero, the multiplication of value ```input*beta``` will be ignored for the eager mode(always gets zero number, see https://pytorch.org/docs/stable/generated/torch.baddbmm.html?highlight=torch+baddbmm#torch.baddbmm), but the inductor is not, the inductor will get a different value if the input has a ```nan``` of ```inf``` value: ``` def fn_test(input, mat1, mat2): return torch.baddbmm(input, mat1, mat2, beta=0.0) opt_fn = torch._dynamo.optimize("inductor")(fn_test) a, b, c = [torch.rand((3,2,2)) for _ in range(3)] real_out = fn_test(a, b, c) a[:] = torch.nan compiled_out = opt_fn(a, b,c) print(compiled_out) print(real_out) ``` before this PR, the output will be like this: ``` tensor([[[0.4272, 0.6037], [0.4279, 0.4219]], [[0.0838, 0.4873], [0.1210, 0.5516]], [[ nan, nan], [ nan, nan]]]) tensor([[[0.4272, 0.6037], [0.4279, 0.4219]], [[0.0838, 0.4873], [0.1210, 0.5516]], [[0.4985, 0.1072], [0.0857, 0.0186]]]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96087 Approved by: https://github.com/jansel, https://github.com/ngimel, https://github.com/jgong5	2023-03-07 14:59:21 +00:00
XiaobingSuper	ac77883e48	fix issue of baddbmm when out has nan value for beta=0 (#96086 ) Fix https://github.com/pytorch/pytorch/issues/96037. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96086 Approved by: https://github.com/ngimel, https://github.com/lezcano	2023-03-07 14:54:05 +00:00
cyy	666efd8d5d	Improve ASAN and TSAN handling in cmake (#93147 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93147 Approved by: https://github.com/malfet	2023-03-07 14:10:13 +00:00
Edward Z. Yang	8dbc549517	Correctly pass $@ to the runner benchmark script (#96190 ) Alternative to https://github.com/pytorch/pytorch/pull/96168 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96190 Approved by: https://github.com/desertfire	2023-03-07 13:49:57 +00:00
Andrew Gu	40ca20bb7b	[Easy] Fix typo "steams" -> "streams" (#95706 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95706 Approved by: https://github.com/Skylion007, https://github.com/H-Huang	2023-03-07 13:38:03 +00:00
Andrew Gu	803e30441f	[FSDP][Docs] Per-device NCCL stream is per PG (#95705 ) `71ad1005f6/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L647-L649)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95705 Approved by: https://github.com/fegin	2023-03-07 13:38:03 +00:00
Nikita Vedeneev	98a4d74a68	COO intersection primitives: performance improvement (#96094 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96094 Approved by: https://github.com/pearu	2023-03-07 13:21:29 +00:00
Edward Z. Yang	98ff841a75	Use maxint to bound integers. (#96121 ) We don't actually support arbitrary precision integers. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96121 Approved by: https://github.com/tugsbayasgalan, https://github.com/lezcano	2023-03-07 12:46:19 +00:00
Edward Z. Yang	a6e3e7905e	Turn on unspecialize int dynamic inductor CI (#96034 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96034 Approved by: https://github.com/voznesenskym	2023-03-07 12:39:55 +00:00
Li-Huai (Allan) Lin	3326c14e86	Add a sample for index_fill to test framework (#91534 ) Currently the index_fill test doesn't include a sample with tensor `value` input. This PR adds one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91534 Approved by: https://github.com/ngimel	2023-03-07 08:36:04 +00:00
Yanbo Liang	12ab4f08b7	[Dynamo] No graph break on namedtuple and potential other functions (#96122 ) ```collections.namedtuple``` caused 40+ ```dynamo.export``` testing failing in 14k github models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96122 Approved by: https://github.com/jansel, https://github.com/mlazos	2023-03-07 08:00:21 +00:00
BowenBao	8ca3c881db	Dynamo.export to preserve names of args & kwargs (#95851 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95851 Approved by: https://github.com/jansel	2023-03-07 05:07:08 +00:00
BowenBao	e38f48ff11	Refactor unittest around dynamo.export wrt function signature (#95850 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95850 Approved by: https://github.com/jansel	2023-03-07 05:07:08 +00:00
BowenBao	c596504292	Type annotate `dynamo.export` (#95742 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95742 Approved by: https://github.com/jansel	2023-03-07 05:07:08 +00:00
Huy Do	18e8aa95f1	Restore _graph_executor_optimize flag after JIT test_profiler (#96135 ) Fixes https://github.com/pytorch/pytorch/issues/91483 Using a separate test class here, so that there is no need to run setup and teardown for all tests in TestJit. The root cause here is that test_profiler could be flaky and fail in the middle without the chance to restore `torch._C._set_graph_executor_optimize` to its original value (https://github.com/pytorch/pytorch/issues/81626). This causes issues for all future tests running after as shown in https://github.com/pytorch/pytorch/issues/91483. I suspect that is also the same root cause for several other flaky tests in the same file https://github.com/search?q=repo%3Apytorch%2Fpytorch+DISABLED+test_jit.TestScript&type=issues. After this fix is merged, I would let retry bot does it job and close these issues after 2 weeks. ### Testing The issue https://github.com/pytorch/pytorch/issues/91483 can now be reproduced by adding `torch._C._set_graph_executor_optimize(False)` locally to see if the test fails: ``` diff --git a/test/test_jit.py b/test/test_jit.py index 2d1161d7466..17745d39182 100644 --- a/test/test_jit.py +++ b/test/test_jit.py @@ -5413,6 +5413,8 @@ a") FileCheck().check("int =").check("ListConstruct").check("aten::cat").run(str(g)) def test_stack(self): + torch._C._set_graph_executor_optimize(False) + with enable_profiling_mode_for_profiling_tests(): @torch.jit.script def func(x): ``` It indeed fails: ``` ====================================================================== FAIL [0.006s]: test_stack (test_jit.TestScript) ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/test_jit.py", line 5437, in test_stack self.assertAutodiffNode(func2.graph_for(x, y), True, ['aten::stack'], []) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_jit.py", line 282, in assertAutodiffNode self.assertEqual(should_autodiff_node, ##[endgroup] File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2975, in assertEqual raise error_metas[0].to_error( AssertionError: Booleans mismatch: True is not False Failure in testing nodes' autodifferentiation. One or more nodes were expected to be autodiffed, but were not found in specified fusible/nonfusible DifferentiableGraph groups. Specifically: ['aten::stack'] were not in one of the DifferentiableGraphs when they were expected to be. Did you intend for these nodes to be autodiffed? If not, remove them from the list of nonfusible nodes. ---------------------------------------------------------------------- Ran 2677 tests in 84.596s FAILED (failures=1, skipped=136, expected failures=13) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96135 Approved by: https://github.com/clee2000	2023-03-07 04:21:19 +00:00
Nikita Shulga	769cc8a614	[MPS] Add type promotion to `torch.addcmul` (#96164 ) Fixes crash while running something like `python -c "import torch;x=torch.rand(3, 3, dtype=torch.float16, device='mps');y=x.addcmul(torch.ones(3, device='mps'), torch.ones(3, device='mps'));print(y)"` Modify `castMPSTensor` to become a no-op if cast is not needed Define `common_dtype` as `c10::promoType` between self, tensor1 and tensor2. Cast to any output type. Add mixed-types test to `TestMPS.test_addcmul`, though it does not cover all the permutations Discovered while looking at https://github.com/pytorch/pytorch/issues/96113 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96164 Approved by: https://github.com/kulinseth	2023-03-07 04:19:30 +00:00
shibo	7038458c5b	Add Generator register for the privateuse1 backend (#93920 ) Fixes #92202 Add generator regiter for the backend of `privateuseone` module: backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/93920 Approved by: https://github.com/bdhirsh	2023-03-07 03:43:23 +00:00
Kazuaki Ishizaki	e9ca902894	fix typo under aten/src/ATen/cudnn (#96063 ) This files fixes typo in comments under `aten/src/ATen/cudnn` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/96063 Approved by: https://github.com/ngimel	2023-03-07 03:17:29 +00:00
Huy Do	19ee61f7fa	Upload torch dynamo performance stats to S3 before Rockset (#96165 ) I have a minor tweak on the uploading workflow to upload to S3 first before Rockset as `upload-test-stats` and `upload-torch-dynamo-perf-stats` both run when inductor-A100-perf finished. There is a potential race condition where the test reports are not yet no S3 when `upload-torch-dynamo-perf-stats` runs (it gets the data from S3). `inductor-A100-perf` is now handled exclusively by `upload-torch-dynamo-perf-stats` to avoid this. It will upload test reports to S3 first before getting them to Rockset. The uploading script works fine with the test reports from https://hud.pytorch.org/pr/95685. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96165 Approved by: https://github.com/desertfire	2023-03-07 03:02:59 +00:00
Kazuaki Ishizaki	2973994259	fix typo in comments under torch/csrc/distributed (#96062 ) This PR fixes typos in comments and messages of `.cpp` and `.hpp` files under `torch/csrc/distributed` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/96062 Approved by: https://github.com/ngimel	2023-03-07 02:56:41 +00:00
Jason Ansel	fe4fec37a4	[inductor] Refactor IR printing (#96024 ) Reland #95567 part 2. The previous version of this had a bug which that added test triggers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96024 Approved by: https://github.com/ngimel	2023-03-07 02:23:06 +00:00
Edward Z. Yang	4ab4d88147	Stop printing data dependent variable stacks by default (#96120 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96120 Approved by: https://github.com/tugsbayasgalan	2023-03-07 02:14:22 +00:00
Nikita Shulga	1cd0929bf7	[BC] Allow only `bool` tensors as mask in `masked_select` (#96112 ) `byte` support was marked as deprecated in 1.8, so it's fine to remove this in 2.1 (or even 2.0) Deprecation warning was added by https://github.com/pytorch/pytorch/pull/22261 Also, fix bunch of syntactic errors in comments Pull Request resolved: https://github.com/pytorch/pytorch/pull/96112 Approved by: https://github.com/ezyang	2023-03-07 01:43:14 +00:00
min-jean-cho	e70ea8d58d	enable taskset core pinning in addition to numactl (#96011 ) - port https://github.com/intel-innersource/frameworks.ai.pytorch.ipex-cpu/pull/740 to `run_cpu` - use-case by https://github.com/pytorch/serve/pull/2166 where `numactl` is unavailable (e.g., requires `privileged` mode) This PR automatically tries taskset if numactl core binding doesn't work. Reference: `taskset` is added to adapt to launcher use-cases such as in docker where `numactl` requires to be ran in `privileged` mode, where the `privileged` mode "wont work for deployments like sagemaker for example" as raised by TorchServe. Please see [torchserve ipex docker discussion](https://github.com/pytorch/serve/pull/1401#issuecomment-1090817704) for reference. To address such use-cases, `taskset` can be used in place of `numactl` to set core affinity. Note that, unlike `numactl`, `taskset` does not provide memory binding to local memories; however, memory binding may not be needed in these use-cases that typically do not span multi sockets. Hence we can automatically try taskset if numactl doesn't work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96011 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-03-07 01:19:46 +00:00
Liao, Xuan	e168dbb90a	[inductor] improve cpp vec implementation of square (#96072 ) For cpp vectorization of `square`, the current implementation is not efficient. The implementation would also affect the performance of `batch normalization` as it uses `square` when calculating variance. This PR replaces the `power` with `multiplication` to gain more performance. Micro-benchmark performance for eager v.s. inductor: op=`aten.native_batch_norm.default` <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> suite \| improvement_0.2 \| improvement_0.5 \| improvement_0.8 \| current_speedup_0.2 \| new_speedup_0.2 \| current_speedup_0.5 \| new_speedup_0.5 \| current_speedup_0.8 \| new_speedup_0.8 -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- torchbench \| 8.82% \| 5.53% \| 32.19% \| 0.608006834 \| 0.661613139 \| 0.691743711 \| 0.729987622 \| 0.76176223 \| 1.00694842 timm \| 59.30% \| 63.01% \| 94.77% \| 0.650648524 \| 1.036498047 \| 0.676425152 \| 1.102667387 \| 0.695693384 \| 1.354992423 </body> </html> Model training performance for eager v.s. inductor: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> model \| improvement \| current_speedup \| new_speedup -- \| -- \| -- \| -- lcnet_050 multi-thread \| 5.16% \| 1.046 \| 1.1 lcnet_050 single-thread \| 21.81% \| 0.94 \| 1.145 mobilenet_v2 multi-thread \| 3.88% \| 1.135 \| 1.179 mobilenet_v2 single-thread \| 37.46% \| 0.929 \| 1.277 </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96072 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2023-03-07 01:13:39 +00:00
PaliC	bf01caf27b	Fix broken contribution stats upload job (#96003 ) Actually allows uploads to S3 to work properly for external contribution stats. Test: https://github.com/pytorch/pytorch/actions/runs/4347343883/jobs/7594534296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96003 Approved by: https://github.com/huydhn	2023-03-07 01:01:02 +00:00
Edward Z. Yang	e6b361bd47	Refactor dynamo benchmark test script to reduce duplication (#96096 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96096 Approved by: https://github.com/desertfire	2023-03-07 00:41:55 +00:00
Huy Do	969586c373	Handle GitHub request failure when filter test config (#96145 ) This is to make the script more resilient to GitHub network flakiness, i.e. https://github.com/pytorch/pytorch/actions/runs/4347281268/jobs/7594804454. When the script couldn't get the list of labels from the PR to do filtering, it should just continue normally and run everything by default. I also refactor the code a bit to re-use the existing `download_json` function which support retries. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96145 Approved by: https://github.com/clee2000, https://github.com/malfet	2023-03-07 00:40:12 +00:00
Edward Z. Yang	847d6520ed	Don't guard on the exact int value on conversion to bool (#96008 ) Fixes https://github.com/pytorch/pytorch/issues/95981 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96008 Approved by: https://github.com/ngimel	2023-03-07 00:40:06 +00:00
Edward Z. Yang	680214ac11	SymIntify a few more relatively non-controversial schemas (#96100 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96100 Approved by: https://github.com/Skylion007	2023-03-06 23:12:40 +00:00
Jason Ansel	95d17dc93d	[inductor] Reland #95567 part 1 (#96023 ) This is the non-problematic part of #95567. The errors were coming from IR printing changes which will be next in the stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96023 Approved by: https://github.com/ngimel, https://github.com/mlazos	2023-03-06 22:57:22 +00:00
Natalia Gimelshein	39e8311a29	unwrap sizevars passed as args (#96051 ) Undoes int arg wrapping that dynamo does Generated line: ``` s1 = arg2_1.item() if isinstance(arg2_1, torch.Tensor) else arg2_1 ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/96051 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-03-06 22:45:47 +00:00
Eli Uriegas	8c8148c887	Revert D43643526: Multisect successfully blamed D43643526 for test or build failures (#96126 ) Summary: This diff is reverting D43643526 Depends on D43693521 D43643526: Avoid copies in matmul (#76828) by generatedunixname499836121 has been identified to be causing the following test or build failures: Tests affected: - [mle/favour:tests - favour_test.py::TestLinears::test_psd](https://www.internalfb.com/intern/test/562950027104300/) Here's the Multisect link: https://www.internalfb.com/intern/testinfra/multisect/1611690 Here are the tasks that are relevant to this breakage: T146911536: 5 tests started failing for oncall prob in the last 2 weeks We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. Test Plan: NA Differential Revision: D43693526 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96126 Approved by: https://github.com/weiwangmeta	2023-03-06 22:30:07 +00:00
Shunting Zhang	962b3f78bd	[inductor] run all kernel benchmarks individually in a compiled module (#95845 ) This is a follow up for PR #95506 to run all the triton kernels in a compiled module individually as suggested by Horace. Here are the steps: 1. Run the model as usual with a benchmark script and with TORCHINDUCTOR_BENCHMARK_KERNEL enabled. e.g. ``` TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only resnet18 --disable-cudagraphs --training ``` 2. From the output we will see 3 lines like ``` Compiled module path: /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py ``` That's because we have one graph module for fwd/bwd/optitimizer respectively. Each graph module will have one such output corresponding to the compiled module. 3. We can run the compiled module directly. Without any extra arguments, we just maintain the previous behavior to run the call function -- which just does what the original graph module does but in a more efficient way. But if we add the '-k' argument, we will run benchmark for each individual kernels in the file. ``` python /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py -k ``` Example output: <img width="430" alt="Screenshot 2023-03-01 at 4 51 06 PM" src="https://user-images.githubusercontent.com/52589240/222302996-814a85be-472b-463c-9e85-39d2c9d20e1a.png"> Note: I use the first 10 characters of the hash to identify each kernel since 1. hash is easier to get in the code :) 2. name like `triton__3` only makes sense within a compiled module, but a hash can make sense even without specifying the compiled module (assuming we have enough bytes for the hash) If we found a triton kernel with hash like c226iuf2wi having poor performance, we can look it up in the original compiled module file. It works since we comment each compiled triton kernel with the full hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95845 Approved by: https://github.com/Chillee	2023-03-06 21:30:33 +00:00
Huy Do	6c9a51cdc9	Install NVIDIA driver in bazel workflow (#96020 ) This is to address the missing NVIDIA driver in the new Bazel GPU build and test workflow as documented in https://github.com/pytorch/pytorch/issues/79851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96020 Approved by: https://github.com/vors, https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/seemethere	2023-03-06 21:21:24 +00:00
Edward Z. Yang	893aa5df3f	Promote "Skipping frame" message to INFO log level (#95968 ) Without this, when you skip frame because of a graph break, at INFO logging level all you see is: ``` [INFO] Step 1: torchdynamo start tracing run_n_iterations [INFO] Step 1: torchdynamo start tracing forward_pass ``` With this promotion, you now see: ``` [INFO] Step 1: torchdynamo start tracing run_n_iterations [INFO] Skipping frame because there is a graph break in a for/while loop [INFO] Step 1: torchdynamo start tracing forward_pass ``` This is MUCH more useful, while only adding a single log message per already existing INFO log message. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95968 Approved by: https://github.com/albanD, https://github.com/janeyx99	2023-03-06 20:16:42 +00:00
fduwjj	28aa2efd14	[7/N][BE] Remove Partial Tensor and its dependency (#95949 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95949 Approved by: https://github.com/wanchaol	2023-03-06 19:57:46 +00:00
fduwjj	6dddc0d689	[6/N][BE] Remove Sharded Linear Op for ShardedTensor (#95948 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95948 Approved by: https://github.com/wanchaol	2023-03-06 19:57:19 +00:00
fduwjj	4e396a54e8	[5/N][BE] Remove Replicated Tensor class (#95947 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95947 Approved by: https://github.com/wanchaol	2023-03-06 19:50:17 +00:00
Eli Uriegas	b38b39c441	Revert "memory viz: Add colors for categories and a legend (#90587 )" This reverts commit ee4384250589f870f24e4d24894a03824ed1c49e.	2023-03-06 11:38:58 -08:00
Michael Voznesensky	22c9896ea4	Use original arg names if possible (#95898 ) Use graphargs rm Pull Request resolved: https://github.com/pytorch/pytorch/pull/95898 Approved by: https://github.com/suo	2023-03-06 19:04:49 +00:00
Edward Z. Yang	6fff232280	Delete torch._functorch.config.use_dynamic_shapes (#96102 ) As requested in https://github.com/pytorch/pytorch/pull/95975#discussion_r1124837162 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96102 Approved by: https://github.com/Skylion007	2023-03-06 18:50:20 +00:00
Bin Bao	483e193d0e	[CI] Use CUDA 11.8 to run inductor benchmark tests (#96059 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96059 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-03-06 18:34:54 +00:00
Huy Do	000d9ec189	Use working pre-built sccache binary (#95997 ) From https://github.com/pytorch/pytorch/pull/95938 where a new Docker image build fails to start sccache. This issue starts to happen today (Mar 3rd). The server fails to start with a cryptic `sccache: error: Invalid argument (os error 22)` ``` =================== sccache compilation log =================== ERROR 2023-03-03T20:31:14Z: sccache::server: failed to start server: Invalid argument (os error 22) sccache: error: Invalid argument (os error 22) =========== If your build fails, please take a look at the log above for possible reasons =========== + sccache --show-stats sccache: error: Connection to server timed out ``` I don't have a good explanation for this yet. The version of sccache we build from https://github.com/pytorch/sccache is ancient. If I start to build the exact same version on Ubuntu Docker image now, the issue will manifest. But the older binary built only few days ago `e50ff3fcdb` works without any issue. So I fix sccache binary to that version instead of rebuilding it every time in the image as a temporary mitigation while trying to root cause this further. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95997 Approved by: https://github.com/ZainRizvi	2023-03-06 18:24:56 +00:00
Edward Z. Yang	69bfdcd244	Only print bytecode of inlined function if output_code is True (#95969 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95969 Approved by: https://github.com/janeyx99, https://github.com/jansel	2023-03-06 18:06:35 +00:00
Kazuaki Ishizaki	69aa6b4bb9	fix typo in comments under torch/csrc/autograd (#96061 ) This PR fixes typos in comments of `.cpp` and `.h` files under `torch/csrc/autograd` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/96061 Approved by: https://github.com/soulitzer	2023-03-06 18:05:14 +00:00
Khushi Agrawal	301a28bf8c	[primTorch] move diagonal & add linalg.diagonal refs (#95774 ) Fixes #85419 Also, add `_refs.linalg.diagonal`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95774 Approved by: https://github.com/lezcano	2023-03-06 17:59:47 +00:00
Edward Z. Yang	1fd7ea1ba8	Update skips for RecursionError (#96109 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96109 Approved by: https://github.com/huydhn	2023-03-06 17:55:38 +00:00
Rodrigo Kumpera	5b2ab0dd4f	Multiple fixes for functional collectives. (#95897 ) _functional_collectives.py: Ensure we always wait all collectives. derivatives.yaml: mark all_reduce as non differentiable gen_variable_type.py: Add all_reduce to DONT_ENFORCE_TENSOR_IMPL_USE_COUNT common_dtensor.py: replace dist.barrier with all_reduce Pull Request resolved: https://github.com/pytorch/pytorch/pull/95897 Approved by: https://github.com/wconstab, https://github.com/fegin	2023-03-06 15:35:07 +00:00
mantaionut	3beafc91d1	USE_FAST_NVCC Windows (#95206 ) USE_FAST_NVCC now works on Windows. Fixes #67100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95206 Approved by: https://github.com/ezyang	2023-03-06 15:04:24 +00:00
jon-chuang	7a192cc51c	dynamo: wrap graph break inst in try except block - with context manager setup/teardown (#94758 ) Replacement to https://github.com/pytorch/pytorch/pull/94672. Follow up to https://github.com/pytorch/pytorch/pull/94137. We simply replace the set grad mode try except blocks with one for a more generic contextmanager (using `__enter__` and `__exit__`), storing the context manager into a `symbolic_local` for the duration of the try block. (see https://github.com/pytorch/torchdynamo/issues/207 for the original motivation) This allows us to handle calling inner functions with graph breaks for any arbitrarily deep nesting of live context managers subclassing `AbstractContextManager`. (see tests) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94758 Approved by: https://github.com/yanboliang	2023-03-06 14:04:17 +00:00
Nikita Vedeneev	18d6ce9622	sparse compressed tensor validation without syncs for low-(batch)dim tensors. (#94048 ) As per title. Sync is still unavoidable for super high-dim tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94048 Approved by: https://github.com/alexsamardzic, https://github.com/cpuhrsch	2023-03-06 13:37:30 +00:00
zhuhaozhe	ebaf9af76e	use float to init reduction value (#95452 ) Fixes https://github.com/pytorch/pytorch/issues/95195, https://github.com/pytorch/pytorch/issues/95185 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95452 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-06 08:49:36 +00:00
XiaobingSuper	dcc159d3b6	inductor: pre-convert a TensorBox's layout to FixedLayout at FX side if one user of it is a CPU external customer kernel (#95873 ) Given the following case: ``` import torch import torch._dynamo class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() self.conv1 = torch.nn.Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1)) self.conv2 = torch.nn.Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1)) self.silu = torch.nn.SiLU(inplace=False) def forward(self, x,): x = self.silu(x) y1 = self.conv1(x) y2 = self.conv2(x) return y1, y2 model = Model().eval() model = model.to(memory_format=torch.channels_last).eval() opt_model = torch._dynamo.optimize('inductor')(model) x = torch.randn(128, 64, 112, 112).to(memory_format=torch.channels_last) with torch.no_grad(): for i in range(3): out = opt_model(x) ``` the silu is used by two external kernels, and there always have redundant memory copy: ``` kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/dl/cdljpywww2h2ag4o35mwbvm45hhasxnxkhqgbupxnk3y7olula65.h" extern "C" void kernel(const float* __restrict__ in_ptr0, float* __restrict__ out_ptr0, float* __restrict__ out_ptr1) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=0; i0<6422528; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16i0); auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp()); auto tmp2 = tmp0 tmp1; tmp2.store(out_ptr0 + 16i0); tmp2.store(out_ptr1 + 16i0); } #pragma omp for simd simdlen(8) for(long i0=102760448; i0<102760448; i0+=1) { auto tmp0 = in_ptr0[i0]; auto tmp1 = decltype(tmp0)(1) / (decltype(tmp0)(1) + std::exp(-tmp0)); auto tmp2 = tmp0 * tmp1; out_ptr0[i0] = tmp2; out_ptr1[i0] = tmp2; } } } } ''') ``` This PR will pre-convert the `silu`'s layout to FixedLayout at FX side(will be realized to avoid multi-realize at external kernel) if one user of it is a CPU external customer kernel, after this PR, the output code is: ``` kernel_cpp_0 = async_compile.cpp(''' #include "/tmp/torchinductor_xiaobing/dl/cdljpywww2h2ag4o35mwbvm45hhasxnxkhqgbupxnk3y7olula65.h" extern "C" void kernel(const float* __restrict__ in_ptr0, float* __restrict__ out_ptr0) { #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=0; i0<6422528; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16i0); auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp()); auto tmp2 = tmp0 tmp1; tmp2.store(out_ptr0 + 16i0); } #pragma omp for simd simdlen(8) for(long i0=102760448; i0<102760448; i0+=1) { auto tmp0 = in_ptr0[i0]; auto tmp1 = decltype(tmp0)(1) / (decltype(tmp0)(1) + std::exp(-tmp0)); auto tmp2 = tmp0 tmp1; out_ptr0[i0] = tmp2; } } } } ''') ``` Currently, this PR only considers the CPU external customer kernel, but for other external kernels, there may have the same issue. For Timm eca_halonext26ts , this PR will give about 8% performance improvement(BS=128, 20 cores on SKX). Pull Request resolved: https://github.com/pytorch/pytorch/pull/95873 Approved by: https://github.com/jansel	2023-03-06 06:28:40 +00:00
Huy Do	cc775fb8c4	Upload torch dynamo perf stats to Rockset (#95675 ) The new workflow is run after `inductor` or `inductor-perf-test-nightly` finish in trunk (not on PR). All test reports CSV files are ingested into https://console.rockset.com/collections/details/inductor.torch_dynamo_perf_stats ### Testing Run * inductor-A100-perf ``` python -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 4272892998 --workflow-run-attempt 1 --repo pytorch/pytorch ``` to ingest some data from `9b7abc4fac` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95675 Approved by: https://github.com/weiwangmeta	2023-03-06 05:28:20 +00:00
Bin Bao	02792ff16f	[CI] Make inductor-perf-test-nightly produce data for dashboard (#95685 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95685 Approved by: https://github.com/ezyang, https://github.com/huydhn	2023-03-06 03:14:03 +00:00
Mark Saroufim	fa92b6a7b0	Error when jit.trace/script is used with torch.compile (#91681 ) Fixes https://github.com/pytorch/pytorch/issues/93485 ```python import torch from torchvision.models import resnet50 model = resnet50(weights=None) compile_model = torch.compile(model) print(type(compile_model)) example_forward_input = torch.rand(1, 3, 224, 224) c_model_traced = torch.jit.trace(compile_model, example_forward_input) # or torch.jit.script torch.jit.save(c_model_traced, "c_trace_model.pt") ``` Should I raise a warning if a user tries to compile a scripted or traced model as well? It works just fine now on resnet but not sure if it's that something we want to explicitly discourage Pull Request resolved: https://github.com/pytorch/pytorch/pull/91681 Approved by: https://github.com/desertfire	2023-03-06 02:03:35 +00:00
Horace He	e8cd173aae	Fix node provenance tracking (#95901 ) Before: ``` triton_fused_add_83_add_84_convolution_15_relu_12_relu_13_squeeze_46_var_mean_15_14 ``` After: ``` triton_fused_add_83_add_84_relu_13_squeeze_46_var_mean_15_14 ``` For this kernel ``` @persistent_reduction( size_hints=[512, 64], reduction_hint=ReductionHint.INNER, filename=__file__, meta={'signature': {0: 'fp32', 1: 'fp32', 2: 'fp32', 3: 'fp32', 4: 'fp32', 5: 'fp32', 6: 'fp32', 7: 'fp32', 8: 'fp32', 9: 'fp32', 10: 'i32', 11: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': ['in_out_ptr0'], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10), equal_to_1=())]} ) @triton.jit def triton_(in_out_ptr0, in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, out_ptr0, out_ptr2, out_ptr3, out_ptr4, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr): xnumel = 512 rnumel = 49 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK)[None, :] rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (49x0)), rmask & xmask, other=0) tmp8 = tl.load(in_ptr1 + (x0), xmask) tmp22 = tl.load(in_ptr2 + (x0), xmask) tmp24 = tl.load(in_ptr3 + (x0), xmask) tmp30 = tl.load(in_ptr4 + (x0), xmask) tmp2 = tl.where(rmask & xmask, tmp0, 0) tmp3 = tl.sum(tmp2, 1)[:, None] tmp4 = 49.0 tmp5 = tmp3 / tmp4 tmp6 = 0.1 tmp7 = tmp5 tmp6 tmp9 = 0.9 tmp10 = tmp8 * tmp9 tmp11 = tmp7 + tmp10 tmp12 = tmp0 - tmp5 tmp13 = tmp12 * tmp12 tmp15 = tl.where(rmask & xmask, tmp13, 0) tmp16 = tl.sum(tmp15, 1)[:, None] tmp17 = tmp16 / tmp4 tmp18 = 1e-05 tmp19 = tmp17 + tmp18 tmp20 = tl.libdevice.rsqrt(tmp19) tmp21 = tmp12 * tmp20 tmp23 = tmp21 * tmp22 tmp25 = tmp23 + tmp24 tmp26 = tl.where(0 != 0, 0, tl.where(0 > tmp25, 0, tmp25)) tmp27 = 1.0208333333333333 tmp28 = tmp17 * tmp27 tmp29 = tmp28 * tmp6 tmp31 = tmp30 * tmp9 tmp32 = tmp29 + tmp31 tl.store(in_out_ptr0 + (x0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp5, xmask) tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp11, xmask) tl.store(out_ptr2 + (r1 + (49*x0) + tl.zeros([XBLOCK, RBLOCK], tl.int32)), tmp26, rmask & xmask) tl.store(out_ptr3 + (x0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp20, xmask) tl.store(out_ptr4 + (x0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp32, xmask) ``` Tbh this still isn't super great provenance tracking, since ops like layernorms are decomposed. I might add some extra provenance tracking during decompositions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95901 Approved by: https://github.com/jansel, https://github.com/mlazos	2023-03-05 21:52:48 +00:00
Facebook Community Bot	36a6e2c54b	Automated submodule update: kineto (#95798 ) This is an automated pull request to update the first-party submodule for [pytorch/kineto](https://github.com/pytorch/kineto). New submodule commit: `d9954ad558` Test Plan: Ensure that CI jobs succeed on GitHub before landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95798 Approved by: https://github.com/cpuhrsch	2023-03-05 19:59:41 +00:00
Jason Ansel	5dd52e250f	[inductor] Add some simple decomps (#96039 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96039 Approved by: https://github.com/ngimel	2023-03-05 17:07:56 +00:00
Bin Bao	60cf95610d	[CI] Skip xcit_large_24_p8_224 in TIMM (#96048 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96048 Approved by: https://github.com/jansel	2023-03-05 14:54:46 +00:00
Bin Bao	1359d16fe8	[CI] Further tighten the checking of two eager runs (#95902 ) Summary: To catch nondeterminism in eager if there is any. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95902 Approved by: https://github.com/jansel	2023-03-05 14:53:02 +00:00
Natalia Gimelshein	43e71cddb0	[inductor] use triu ref instead of lowering (#96040 ) Fixes #95958 Generated code is functionally identical with ref and lowering, only minor differences Pull Request resolved: https://github.com/pytorch/pytorch/pull/96040 Approved by: https://github.com/jansel	2023-03-05 07:24:34 +00:00
Wanchao Liang	789fc4c292	[dtensor] refactor shape/offset calculation (#95923 ) Shape offset calculation is commonly used and extract them into a separate util Pull Request resolved: https://github.com/pytorch/pytorch/pull/95923 Approved by: https://github.com/fduwjj	2023-03-05 06:33:32 +00:00
Edward Z. Yang	af8dbe7ec2	Fix training enablement in AOTAutograd (#95975 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95975 Approved by: https://github.com/ngimel, https://github.com/voznesenskym	2023-03-05 04:28:29 +00:00
Wenzhe Xue	b7a3f331f1	Add doc test in graph_drawer.py (#95919 ) Add a doc test, extending #95534 . I found I need to put the xdoctest under a class method. Otherwise if it's right under the class definition, the test cannot be found. @Erotemic Do I miss anything? The xdoctest has been tested: ``` $ pytest --xdoctest torch/fx/passes/graph_drawer.py::FxGraphDrawer.get_dot_graph:0 =========== test session starts ================== platform linux -- Python 3.9.15, pytest-7.2.1, pluggy-1.0.0 rootdir: /localdisk/wenzhexu/dev/forked_pytorch, configfile: pytest.ini plugins: xdoctest-1.1.1 collected 1 item torch/fx/passes/graph_drawer.py . [100%] ============ 1 passed in 1.13s =================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95919 Approved by: https://github.com/ezyang	2023-03-05 02:23:18 +00:00
Jason Ansel	5da6da659a	[inductor] Enable some decomps (#96038 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96038 Approved by: https://github.com/ngimel	2023-03-05 02:03:35 +00:00
Michael Gschwind	03b6e6979c	Transformers: fix src and key padding mask bool regression (#96009 ) Summary: fix src and pad mask bool regression This fixes a regression introduced previously with #92733. That PR unified testing of masks to remove Byte Tensors as permissible mask, introduced mask compatibility check, and mask conversion to FP mask. The problem addressed in this PR was that after the first mask had been converted, a check for mask compatibility would fail. Test Plan: sandcastle & github Differential Revision: D43782858 Fixes https://github.com/pytorch/pytorch/issues/95702 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96009 Approved by: https://github.com/malfet	2023-03-05 01:50:46 +00:00
alexdremov	78da315afd	[MPS] Fix bidirectional LSTM & small one-direction LSTM fix (#95563 ) Fixes #94754 With this PR I hope to finish my breathtaking journey of fixing MPS LSTM. Here, I enable `bidirectional` on MPS. Also, I've noticed that cache key did not account for all parameters, so there could have been problems with one-directional LSTM when created without bias or dropout and then with one of them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95563 Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/malfet	2023-03-05 00:19:54 +00:00
Edward Z. Yang	c7c4a20321	Update dynamic skips (#95966 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95966 Approved by: https://github.com/janeyx99, https://github.com/voznesenskym	2023-03-04 23:01:58 +00:00
Masaki Kozuki	49f6849f58	Fix codegen logic for foreach derivatives (#95263 ) follow-up https://github.com/pytorch/pytorch/pull/93901. Unexpected numerical mismatches observed in some foreach functions' backward result seemed to be caused by the wrong order of `IndexRangeGenerator::range` call. This pr has `args_with_derivatives` have the same or similar order of `foreach_native_function.func.arguments.flat_non_out` --- what the current master generates for `_foreach_mul.List`: ```cpp variable_list ForeachMulBackward0List::apply(variable_list&& grads) { std::lock_guard<std::mutex> lock(mutex_); TORCH_CHECK(!other_released_, ERR_BACKWARD_TWICE); TORCH_CHECK(!self_released_, ERR_BACKWARD_TWICE); IndexRangeGenerator gen; auto other_ix = gen.range(other_size_); auto self_ix = gen.range(self_size_); variable_list grad_inputs(gen.size()); auto other = unpack_list(other_); auto self = unpack_list(self_); if (task_should_compute_output({ other_ix })) { std::vector<Tensor> grad_result; grad_result.reserve(grads.size()); for (const auto & i : c10::irange(grads.size())) { grad_result.emplace_back(mul_tensor_backward(grads[i], self[i], other[i].scalar_type())); } copy_range(grad_inputs, other_ix, grad_result); } if (task_should_compute_output({ self_ix })) { std::vector<Tensor> grad_result; grad_result.reserve(grads.size()); for (const auto & i : c10::irange(grads.size())) { grad_result.emplace_back(mul_tensor_backward(grads[i], other[i], self[i].scalar_type())); } copy_range(grad_inputs, self_ix, grad_result); } return grad_inputs; } ``` with this PR the generated backward is ```cpp variable_list ForeachMulBackward0List::apply(variable_list&& grads) { std::lock_guard<std::mutex> lock(mutex_); TORCH_CHECK(!self_released_, ERR_BACKWARD_TWICE); TORCH_CHECK(!other_released_, ERR_BACKWARD_TWICE); IndexRangeGenerator gen; auto self_ix = gen.range(self_size_); <----- diff auto other_ix = gen.range(other_size_); <----- diff variable_list grad_inputs(gen.size()); auto self = unpack_list(self_); auto other = unpack_list(other_); if (task_should_compute_output({ other_ix })) { std::vector<Tensor> grad_result; grad_result.reserve(grads.size()); for (const auto & i : c10::irange(grads.size())) { grad_result.emplace_back(mul_tensor_backward(grads[i], self[i], other[i].scalar_type())); } copy_range(grad_inputs, other_ix, grad_result); } if (task_should_compute_output({ self_ix })) { std::vector<Tensor> grad_result; grad_result.reserve(grads.size()); for (const auto & i : c10::irange(grads.size())) { grad_result.emplace_back(mul_tensor_backward(grads[i], other[i], self[i].scalar_type())); } copy_range(grad_inputs, self_ix, grad_result); } return grad_inputs; } ``` The change is to fix the order of `self_ix` and `other_ix`.[](url) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95263 Approved by: https://github.com/soulitzer	2023-03-04 20:03:54 +00:00
YJ Shi	a10897a344	[Dynamo] Fix number of inputs in onnxrt and tvm backend (#95429 ) This PR intends to fix #95428 by only binding active inputs to onnxrt's inference session and tvm's runtime lib after model conversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95429 Approved by: https://github.com/jansel	2023-03-04 18:14:53 +00:00
Will Constable	26045336ca	Optimize nn.Module __call__ fast path for dynamo (#95931 ) This PR optimizes the guards overhead introduced by dynamo tracing module forward hooks. It can and maybe should be followed by a wider change proposed by @voznesenskym to optimize specialized nnmodules by 'observing' any user mutations and directly invalidating the root guard, obviating the need to install other nnmodule guards. (But this observer change seems more involved...) Idea: maintain a flag, and keep it up to date whenever adding or removing hooks. Use the flag rather than dict checks to enter the call fast path. - need to extend RemovableHandle to keep a ref to nnModule so it can update the flag on removal. - also need to handle the flag in ScriptModule which still uses the python call impl when called from python. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95931 Approved by: https://github.com/ezyang, https://github.com/voznesenskym	2023-03-04 15:09:40 +00:00
Yanbo Liang	6ca286df69	[Dynamo] Support call dict with list/tuple as input (#95928 ) Fixes Meta internal use case Pull Request resolved: https://github.com/pytorch/pytorch/pull/95928 Approved by: https://github.com/jansel	2023-03-04 05:52:33 +00:00
Jason Ansel	43dd043ea7	Revert "[inductor] Improve error messages (#95567 )" (#96014 ) This reverts commit 62b775583f008effc510e5f5c3e2b30a85a53465. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96014 Approved by: https://github.com/Chillee	2023-03-04 04:03:31 +00:00
vasiliy	dc70e8175f	Add various uninterpreted bit tensor data types (try 2) (#95860 ) Summary: This is a retry of https://github.com/pytorch/pytorch/pull/94992 which was reverted due to CI issues. This PR adds a set of unintrepreted data types on PyTorch which can be used to implement experimental functionality out of core (think fp8, int4, int16 quant, etc). @bypass-github-export-checks Test Plan: ``` python test/test_quantization.py -k TestBits ``` Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95860 Approved by: https://github.com/atalman	2023-03-04 03:35:59 +00:00
PyTorch MergeBot	5e1067bcc2	[vision hash update] update the pinned vision hash (#95932 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95932 Approved by: https://github.com/pytorchbot	2023-03-04 03:32:32 +00:00
soulitzer	7ff9612e34	Improve error message for instance norm when channels is incorrect (#94624 ) Fixes https://github.com/pytorch/pytorch/issues/90514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94624 Approved by: https://github.com/jbschlosser	2023-03-04 02:06:48 +00:00
Nikita Shulga	436993d52b	[MPS] Error on unsupported types (#95982 ) I.e. attempt to create tensor of all possible types and make sure that it raises a structured error for non-MPS types Also, rename `test_resize_as_all_dtypes_and_devices` to `test_resize_as_mps_dtypes` and `test_resize_all_dtypes_and_devices` to `test_resize_mps_dtypes` and run both test for all MPS dtypes (rather than just bool, float16 and bfloat16 as they were running before) Fixes https://github.com/pytorch/pytorch/issues/95976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95982 Approved by: https://github.com/kulinseth	2023-03-04 01:29:07 +00:00
Jane Xu	f4b33791fd	[BE] Remind people there are rsts to update in docs/source (#95914 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95914 Approved by: https://github.com/msaroufim	2023-03-04 01:23:46 +00:00
Edward Z. Yang	d303665d33	Make int unspecialization actually work (#95621 ) OK, so this PR used to be about reducing the number of constants we specialize on, but it turns out that unspecialization was ~essentially never used (because we still constant specialized way too aggressively) and I ended up having to fix a bunch of issues to actually get tests to pass. So this PR is now "make int unspecialization actually work". As part of this, I have to turn off unspecialization by default, as there are still latent bugs in inductor. The general strategy is that an unspecialized int is represented as a SymInt. Representing it as a 0d tensor (which is what the code used to do) is untenable: (1) we often need unspecialized ints to participate in size computations, but we have no way of propagating sympy expressions through tensor compute, and (2) a lot of APIs work when passed SymInt, but not when passed a Tensor. However, I continue to represent Numpy scalars as Tensors, as they are rarely used for size computation and they have an explicit dtype, so they are more accurately modeled as 0d tensors. * I folded in the changes from https://github.com/pytorch/pytorch/pull/95099 as I cannot represent unspecialized ints as SymInts without also turning on dynamic shapes. This also eliminates the necessity for test_unspec.py, as toggling specialization without dynamic shapes doesn't do anything. As dynamic shapes defaults to unspecializing, I just deleted this entirely; for the specialization case, I rely on regular static shape tests to catch it. (Hypothetically, we could also rerun all the tests with dynamic shapes, but WITH int/float specialization, but this seems... not that useful? I mean, I guess export wants it, but I'd kind of like our Source heuristic to improve enough that export doesn't have to toggle this either.) * Only 0/1 integers get specialized by default now * A hodgepodge of fixes. I'll comment on the PR about them. Fixes https://github.com/pytorch/pytorch/issues/95469 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95621 Approved by: https://github.com/jansel, https://github.com/Chillee	2023-03-04 01:22:08 +00:00
Masaki Kozuki	7d765cdc66	Fix wrong handling of `grad_scale` & `found_inf` in fused optimizers (#95847 ) Fixes #95781. The cause seems to be that the current implementation doesn't correctly pass `found_inf` when `grad_scale` is `None`. Therefore parameters can get mistakenly updated by gradients whose some elements are invalid, i.e. nan or inf. Related #94060 I forgot about this wrong handling after #94344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95847 Approved by: https://github.com/janeyx99	2023-03-04 01:21:21 +00:00
Edward Z. Yang	d214d82acd	Prettify assert expr in self.symbol_to_source failure (#95972 ) Main QOL improvement is to print the name() of Source, not the repr. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95972 Approved by: https://github.com/Skylion007	2023-03-04 01:04:28 +00:00
Chien-Chin Huang	4d9c499dc2	[SPMD] Introduce the cross-iteration graph optimization framework (#94803 ) Introduce the cross-iteration graph optimization framework that allow users to write a graph optimization that moves nodes cross iterations. Differential Revision: [D43247247](https://our.internmc.facebook.com/intern/diff/D43247247/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94803 Approved by: https://github.com/anj-s	2023-03-04 00:59:40 +00:00
Angela Yi	5a07c3d3d1	Remove fake inputs from control flow (#95988 ) Previously running make_fx with tracing_mode="symbolic" resulted in `RuntimeError: Creating a new Tensor subclass FakeTensor but the raw Tensor object is already associated to a python object of type FakeTensor`. This is probably due to there existing multiple FakeTensorModes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95988 Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17	2023-03-04 00:58:52 +00:00
Horace He	9a781ce3e1	Add flop formulas for sdpa (#95854 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95854 Approved by: https://github.com/drisspg	2023-03-04 00:33:34 +00:00
Mark Saroufim	7db5f8c765	Improve Discoverability of Inductor Optimizations (#95824 ) Finding out what the inductor configs mean has been a confusing point for the community so creating some top level functions that just print them out to console if people don't wanna muck around the source code Pull Request resolved: https://github.com/pytorch/pytorch/pull/95824 Approved by: https://github.com/jansel	2023-03-04 00:30:10 +00:00
Kazuaki Ishizaki	7d02ecfabb	Fix typo in RELEASE.md and CONTRIBUTING.md (#95965 ) This PR fixes typos in `RELEASE.md` and `CONTRIBUTING.md` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95965 Approved by: https://github.com/kit1980	2023-03-04 00:14:05 +00:00
Michael Voznesensky	ac07de4a61	Add export docs, improve asserts (#94961 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94961 Approved by: https://github.com/tugsbayasgalan	2023-03-03 23:40:00 +00:00
Edward Z. Yang	027ebca4d7	Don't use guardless contiguity/stride-like implementations (#95733 ) These prevent us from simplifying tests involving unbacked SymInts, and then you end up with unbacked SymInt in guards, which is bad. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95733 Approved by: https://github.com/tugsbayasgalan	2023-03-03 21:56:41 +00:00
Edward Z. Yang	f8b57ba635	[EASY] Unindent some blocks (#95967 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95967 Approved by: https://github.com/Skylion007	2023-03-03 21:05:36 +00:00
Felix	e6f3e16d89	Fix: `validate_input_col` for partial functions (#95067 ) Fixes #95066 #### Proposed change: do not call `str()` on a `Callable` to determine its name #### Reasoning: Please see https://github.com/pytorch/pytorch/issues/95066 for reasoning and examples #### Effect: * The code example given in https://github.com/pytorch/pytorch/issues/95066 now executes instantly. * If invalid input is provided, the stacktrace now prints nicely as ``` ValueError: The function foo takes 1 parameters, but 2 are required. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95067 Approved by: https://github.com/NivekT, https://github.com/ejguan	2023-03-03 21:05:07 +00:00
Zachary DeVito	ee43842505	memory viz: Add colors for categories and a legend (#90587 ) Adds a category legend to memory trace plots that colors allocations by their role (activation, parameter, gradient, etc.) as captured by kineto. Differential Revision: [D43757381](https://our.internmc.facebook.com/intern/diff/D43757381) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90587 Approved by: https://github.com/aaronenyeshi	2023-03-03 20:42:22 +00:00
Nikita Karetnikov	c72fbf2e5a	[inductor] do not use `ceil` in `arange` ref (#95773 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95773 Approved by: https://github.com/ezyang	2023-03-03 20:38:18 +00:00
Nikita Karetnikov	feffcafe09	[inductor] use FP div in CPP expr printer (#95698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95698 Approved by: https://github.com/ezyang, https://github.com/jgong5	2023-03-03 20:38:18 +00:00
Ke Sang	6c061e5145	[DTensor][Shampoo] add _tenso.zero function (#95863 ) Summary: implement zeros function inside DTensor API - user specify the zeros tensor shape, and the function will create local zero tensor given the placement information Test Plan: {F889157756} - unit test for util function for compute_local_tensor_size - unit test for _tensor.zeros Reviewed By: wanchaol Differential Revision: D43630718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95863 Approved by: https://github.com/wanchaol	2023-03-03 19:36:44 +00:00
Rodrigo Kumpera	1d3c394d5e	[MTPG] Improve all_reduce and handle bwd thread support (#95524 ) This implements all reduce ops in all_reduce and a PG being used from a thread different than the one that created it. We should be this >< close to getting complex training tests working. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95524 Approved by: https://github.com/H-Huang	2023-03-03 18:53:36 +00:00
Iris	a7698a8260	[DCP] Add DCP FSDP sharded_state_dict checkpoint example to DCP .rst file (#95517 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95517 Approved by: https://github.com/kumpera	2023-03-03 18:09:10 +00:00
PyTorch MergeBot	4026c62174	Revert "Don't use guardless contiguity/stride-like implementations (#95733 )" This reverts commit deaf077de82789656c707d4b4b2c2e0d1ecee684. Reverted https://github.com/pytorch/pytorch/pull/95733 on behalf of https://github.com/ezyang due to apparently this regresses executorch tests internally	2023-03-03 17:43:05 +00:00
Catherine Lee	7f5f0b3665	Run _nvfuser/test_torchscript serially (#95951 ) Started at `ce4cbac914 (11734276291)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95951 Approved by: https://github.com/huydhn	2023-03-03 17:41:09 +00:00
PyTorch MergeBot	879400e4e8	Revert "[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822 )" This reverts commit 73b66098b2f43be508e1975fd6a425ed6308b993. Reverted https://github.com/pytorch/pytorch/pull/94822 on behalf of https://github.com/clee2000 due to broke inductor_tmm_cpu_accuracy, `73b66098b2 (11745396725)`	2023-03-03 17:33:27 +00:00
Catherine Lee	d21577f28c	Run more tests through pytest (#95844 ) Run more tests through pytest. Use a block list for tests that shouldn't run through pytest. As far as I can tell, the number of tests run, skipped, and xfailed for those not on the blocklist are the same. Regarding the main module: Usually tests are run in CI, we call `python <test file>`, which causes the file to be imported under the module name `__main__`. However, pytest searches for the module to be imported under the file name, so the file will be reimported. This can cause issues for tests that run module level code and change global state, like test_nn, which modifies lists imported from another file, or tests in test/lazy, which initialize a backend that cannot coexist with a second copy of itself. My workaround for this is to run tests from the `__main__` module. However, this results in pytest being unable to rewrite assertions (and possibly other things but I don't know what other things pytest does right now). A better solution might be to call `pytest <test file>` directly and move all the code in run_tests(argv) to be module level code or put it in a hook in conftest.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95844 Approved by: https://github.com/huydhn	2023-03-03 17:32:26 +00:00
Svetlana Karslioglu	004bcffc6a	Fix formatting (#95906 ) Fixing list formatting by adding a missing blank line: Before: ![Screenshot 2023-03-02 at 3 17 28 PM (2)](https://user-images.githubusercontent.com/5317992/222585127-9b6ed4dd-4719-4756-b2ac-1ba6e8f97b87.png) After: ![Screenshot 2023-03-02 at 3 16 48 PM (2)](https://user-images.githubusercontent.com/5317992/222585172-3ef35a48-641f-4b73-9f7b-f419a122196b.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95906 Approved by: https://github.com/orionr	2023-03-03 16:18:12 +00:00
Xia, Weiwen	f3c25cd348	[Quant][PT2.0] fix issues for rearranging weight observer for decomposed linear (#94296 ) Summary Linear is decomposed to `t - addmm/mm` after `dynamo.export`. And weight's observer is inserted between `t` and `addmm/mm` in the first place. `_rearrange_weight_observer_for_addmm()` is then called to move the observer between weight and `t`. ``` before: weight - t - observer \ input - observer - addmm/mm after: weight - observer - t \ input - observer - addmm/mm ``` We found two issues of `_rearrange_weight_observer_for_addmm()`: - It does not call `m.recompile()` in the end, so it does not function correctly. - It does not support `aten.mm.default` which is from decomposed linear without bias. This PR fixes the two issues and renames the function to `_rearrange_weight_observer_for_decomposed_linear`. Test plan python test/test_quantization.py -k test_rearrange_weight_observer_for_decomposed_linear Pull Request resolved: https://github.com/pytorch/pytorch/pull/94296 Approved by: https://github.com/jgong5, https://github.com/andrewor14	2023-03-03 15:54:11 +00:00
Nikita Vedeneev	d809020fc8	Triton kernel for bsr @ dense (#94823 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94823 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2023-03-03 15:11:28 +00:00
Ning Xu	88e554b513	Move label check failure to mergebot (#94707 ) Fixes https://github.com/pytorch/pytorch/issues/88098 This is a mirror of the same PR (https://github.com/Goldspear/pytorch/pull/3) that has been reviewed in my fork (due to it's a stacked PR). ============== ## Context This the 3rd of the 3 PRs to address the issue 88098. ## What Changed 1. check_labels.py no longer fails, but only leaving a comment 2. trymerge.py now would fail if no required labels provided ## Tests * dummy-repo trymerge run [fails without required label](https://github.com/Goldspear/pytorch-dummy/actions/runs/4162819216) and resulted in [a label error comment](https://github.com/Goldspear/pytorch-dummy/pull/3#issuecomment-1427756769) * the above pr was [correctly merged](https://github.com/Goldspear/pytorch-dummy/pull/3) after label is added. ## Note to Reviewers 1st PR: https://github.com/pytorch/pytorch/pull/94179 2nd PR: https://github.com/pytorch/pytorch/pull/94899 3rd PR: this one Pull Request resolved: https://github.com/pytorch/pytorch/pull/94707 Approved by: https://github.com/ZainRizvi	2023-03-03 15:09:14 +00:00
Bin Bao	73b66098b2	[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822 ) Summary: The AOT mode currently works for the CPP backend. When turned on, Inductor compiles the model code into a .so file with aot_inductor_entry as the entry function. If the AOT compilation fails, Inductor will explicitly fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94822 Approved by: https://github.com/jansel	2023-03-03 14:18:09 +00:00
blzheng	3eb8eaa177	Inductor: fix crash when indexing an empty tensor by invalid index (#95046 ) This pr is for #94830. Aligned the behavior with eager. Reference: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorAdvancedIndexing.cpp#L550-L558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95046 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-03 14:12:02 +00:00
XiaobingSuper	d4e0d895e9	inductor: fix permute_linear_fusion/linear_permute_fusion has nor 'bias' keyError issue (#95930 ) Fix https://github.com/pytorch/pytorch/issues/95912. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95930 Approved by: https://github.com/ngimel, https://github.com/jiawenliu64	2023-03-03 12:02:49 +00:00
Sergii Dymchenko	35bf5bac26	Fix "sandcastle_skip_if decorator name is confusing" (#95649 ) Fixes https://github.com/pytorch/pytorch/issues/89473 See the issue https://github.com/pytorch/pytorch/issues/89473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95649 Approved by: https://github.com/atalman, https://github.com/malfet	2023-03-03 09:29:40 +00:00
Will Constable	0147a408c3	Refactor inductor collectives around base class (#95920 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95920 Approved by: https://github.com/wanchaol	2023-03-03 09:00:48 +00:00
Will Constable	92a2107375	Support Inductor collectives with wait or collective outside graph (#95893 ) Inductor implementations of collectives/wait must match eager impls in _functional_collectives in terms of interacting with _register_tensor_work API. If they do, then splitting a collective-wait pair so one half is in a compiled graph should work fine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95893 Approved by: https://github.com/kumpera	2023-03-03 09:00:48 +00:00
Will Constable	7206b5e9e5	Remove pydispatcher from test since no longer needed (#95890 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95890 Approved by: https://github.com/kumpera	2023-03-03 09:00:48 +00:00
Wanchao Liang	738beaa6b8	[dtensor] fix experimental_op slice_scatter (#95894 ) Test Plan: test with spmd e2e flow Differential Revision: D43740349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95894 Approved by: https://github.com/fegin	2023-03-03 08:41:22 +00:00
Denis Vieriu	304a95435d	[MPS] Disallow reshape in slice (#95905 ) Disallow reshapes for arrayViews. Current code allows a base shape of `[2, 4, 256]` to be sliced into `[4, 1, 256]` (view's shape) - which is not possible. Slicing a smaller dimension into a bigger one will always error out. Fixes https://github.com/pytorch/pytorch/issues/95883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95905 Approved by: https://github.com/razarmehr, https://github.com/kulinseth	2023-03-03 08:08:34 +00:00
cyy	a32be76a53	Disable more warnings on Windows CI test (#95933 ) These warnings are disabled to avoid long log on Windows tests. They are also disabled on CMake buildings currently. '/wd4624': MSVC complains "destructor was implicitly defined as delete" on c10::optional and other templates '/wd4076': "unexpected tokens following preprocessor directive - expected a newline" on some header '/wd4068': "The compiler ignored an unrecognized [pragma]" Pull Request resolved: https://github.com/pytorch/pytorch/pull/95933 Approved by: https://github.com/ezyang	2023-03-03 07:11:13 +00:00
Xuehai Pan	22d3ac79d2	[torchgen] Prettify generated type annotations (#95877 ) Changes: 1. Use class inheritance for `torch/return_types.pyi`: Before: ```python max = NamedTuple("max", [("values", Tensor), ("indices", Tensor)]) ``` After: ```python class max(NamedTuple): values: Tensor indices: Tensor ``` ------ 2. Add missing spaces in generated type annotations. 1. Always has a space after `,`. 2. If an argument is annotated, then there need spaces around `=` when it has a default value. ```diff - def func(..., out: Optional[Tensor]=None, ...) -> Tensor: + def func(..., out: Optional[Tensor] = None, ...) -> Tensor: ``` 3. If an argument is not annotated, then there should be no spaces around `=` when it has a default value. ```python def contiguous(self, memory_format=torch.contiguous_format) -> Tensor: ... ``` ------ 3. ~Remove redundant import alias in `torch/nn/functional.pyi`:~ (Reverted) UPDATE: `mypy` needs the alias to work. Before: ```python from .. import conv1d as conv1d from .. import conv2d as conv2d from .. import conv3d as conv3d from .. import conv_transpose1d as conv_transpose1d from .. import conv_transpose2d as conv_transpose2d from .. import conv_transpose3d as conv_transpose3d from .. import conv_tbc as conv_tbc from .. import avg_pool1d as avg_pool1d from .. import relu_ as relu_ from .. import selu_ as selu_ from .. import celu_ as celu_ from .. import rrelu_ as rrelu_ from .. import pixel_shuffle as pixel_shuffle from .. import pixel_unshuffle as pixel_unshuffle from .. import channel_shuffle as channel_shuffle from .. import native_channel_shuffle as native_channel_shuffle from .. import pdist as pdist from .. import cosine_similarity as cosine_similarity ``` After: ```python from .. import ( conv1d, conv2d, conv3d, conv_transpose1d, conv_transpose2d, conv_transpose3d, conv_tbc, avg_pool1d, relu_, selu_, celu_, rrelu_, pixel_shuffle, pixel_unshuffle, channel_shuffle, native_channel_shuffle, pdist, cosine_similarity, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95877 Approved by: https://github.com/ezyang	2023-03-03 07:08:40 +00:00
Max Podkorytov	3bb76e6ced	[static-runtime] increase verbosity for schema check (#95937 ) Summary: as titled Differential Revision: D43758690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95937 Approved by: https://github.com/wushirong, https://github.com/hl475, https://github.com/houseroad	2023-03-03 06:50:28 +00:00
Nicolas Macchioni	76ade51307	[pt2][inductor] turn off cache search by default (#95662 ) Summary: set `search_autotune_cache=False` by default due to inductor compilation regression on HF models, while working on reducing overhead Test Plan: CI Differential Revision: D43641286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95662 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-03-03 06:09:54 +00:00
David Berard	53b4f6c0f6	Revert "[jit] Add c++ stacktraces for jit::ErrorReport (#94842 )" (#95886 ) This reverts commit 70029214f300f611e7dd816b5f64426224f6ab96. It broke some internal tests. Differential Revision: [D43735833](https://our.internmc.facebook.com/intern/diff/D43735833) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95886 Approved by: https://github.com/malfet, https://github.com/qihqi	2023-03-03 05:49:40 +00:00
leslie-fang-intel	d1a168f176	equal_quantized_cpu requires both inputs are quantized tensor (#95875 ) Summary Fix the issue https://github.com/pytorch/pytorch/issues/95291, `equal_quantized_cpu` requires both inputs are quantized tensor. Test Plan ``` python -m pytest test_quantization.py -k test_quantized_equal ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95875 Approved by: https://github.com/vkuzo, https://github.com/jgong5	2023-03-03 05:33:23 +00:00
Will Constable	4e02ad7538	Rename inductor collectives test (#95889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95889 Approved by: https://github.com/kumpera	2023-03-03 04:58:39 +00:00
Wei Wang	ce4cbac914	Change linux.gcp.a100 to linux.gcp.a100.large (#95913 ) To avoid making workloads like https://github.com/pytorch/pytorch/blob/master/.github/workflows/inductor.yml#L52 queue for a long time. For example, in the past the runners were all used to run https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml and perf smoke test jobs in https://github.com/pytorch/pytorch/actions/workflows/inductor.yml did not get runners. <img width="614" alt="image" src="https://user-images.githubusercontent.com/109318740/222570066-7aec611d-0feb-42cb-8b1b-d93bd36f4d17.png"> This PR makes sure the queue only happens to long-running workloads and we have a plan to address it with more runners with basic auto-scaling feature enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95913 Approved by: https://github.com/malfet, https://github.com/huydhn	2023-03-03 04:32:03 +00:00
Huy Do	2ee85f9e8b	Extend filter logic to remaining CI workflows (#95437 ) Per title, this extends CI filter logic to all remaining smaller workflows across pull, trunk, and periodic. These jobs can then be disabled dynamically. Before this, the filter logic only exists in major platform workflows including linux, windows, macos, and rocm. * These smaller workflows now accept the `test-matrix` input with one the default shard * `filter-test-configs` logic is added as a filter step This is needed after https://github.com/pytorch/pytorch/pull/95442 in the event where we need to disable these jobs ### Testing * Disable https://github.com/pytorch/pytorch/issues/95746 for testing. Confirm in https://github.com/pytorch/pytorch/actions/runs/4299047707/jobs/7493851429 that the job is disabled (skipped) * Disable https://github.com/pytorch/pytorch/issues/95752 for testing. Confirm in https://github.com/pytorch/pytorch/actions/runs/4299049008/jobs/7512566000 that the job is disabled (skipped). Note that MPS is the special case where it could be triggered by `Mac MPS` or `trunk` workflows. So disabling MPS would strictly require both to be disabled. IMO, this is not a big issue as we would only need to disable `trunk` most of the times. Devs who attach `ciflow/mps` and use `Mac MPS` workflow know what they are doing Pull Request resolved: https://github.com/pytorch/pytorch/pull/95437 Approved by: https://github.com/clee2000	2023-03-03 04:23:21 +00:00
Edward Z. Yang	53c9866ffa	Print the actual values with scalar mismatch. (#95879 ) When you do assertEqual between two ints, previously it would only print ``` Absolute difference: 1 Relative difference: 0.3333333333333333 ``` Now it prints: ``` Expected 3 but got 2. Absolute difference: 1 Relative difference: 0.3333333333333333 ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95879 Approved by: https://github.com/dagitses, https://github.com/albanD	2023-03-03 02:22:20 +00:00
PyTorch MergeBot	c22e7c7bf3	Revert "sparse compressed tensor validation without syncs for low-(batch)dim tensors. (#94048 )" This reverts commit 7901f2d1560bb858f62fc8c28ff5672dd8d53914. Reverted https://github.com/pytorch/pytorch/pull/94048 on behalf of https://github.com/seemethere due to Sign compare between size_t and int64_t is not allowed	2023-03-03 02:03:14 +00:00
PyTorch MergeBot	d7637801d3	Revert "COO intersection primitives: performance improvement (#92976 )" This reverts commit b033594943876d68b9278d4c2ed04fc3c968f001. Reverted https://github.com/pytorch/pytorch/pull/92976 on behalf of https://github.com/seemethere due to Need to revert this so I can revert https://github.com/pytorch/pytorch/pull/94048 cleanly	2023-03-03 01:38:56 +00:00
Horace He	1b1b9c8706	Add flop counter utility (#95751 ) Overall, an example usage. Note that this also captures backwards FLOPs. ``` import torchvision.models as models import torch from torch.utils.flop_counter import FlopCounterMode inp = torch.randn(1, 3, 224, 224, device='cpu') mod = models.resnet18() flop_counter = FlopCounterMode(mod, depth=1) with flop_counter: mod(inp).sum().backward() ``` <img width="326" alt="image" src="https://user-images.githubusercontent.com/6355099/222023068-3491e405-f195-4e11-b679-36b19a1380c7.png"> You can control the depth of the module hierarchy with the `depth` attribute (which defaults to 2). For example, if I don't limit it, this is what it outputs. <img width="366" alt="image" src="https://user-images.githubusercontent.com/6355099/222023306-3d880bb6-f534-4f98-bf10-83c4353acefc.png"> ## Other APIs FlopCounterMode(custom_mapping=...): Allows for custom flop counting functions FlopCounterMode.get_table(depth=...): Explicitly get the table as a string FlopCounterMode.flop_counts: Contains the flop information as a Dict[hierarchy: str, Dict[Op, int]] FlopCounterMode.register_hierarchy(f, name): Allows you to register additional "hierarchies" for a function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95751 Approved by: https://github.com/ngimel, https://github.com/albanD	2023-03-02 23:19:49 +00:00
Wonjoo Lee	3095c95828	Fixes for PyTorch/XLA functionalization integration (#94537 ) Fixes for PyTorch/XLA functionalization integration --- Some notable changes include: - More asserts in `FunctionalTensorWrapper`, so bugs show up more cleanly in cases where we e.g. forget to wrap an output - Make the *_scatter ops `CompositeExplicitAutogradNonFunctional`, so we get a better error message and XLA doesn't accidentally try to us them - Fix LTC/XLA codegen in core to handle multi-tensor out= ops with no returns - Better erroring: Allow XLA to use the CPU fallback from core in a way so that it always errors on view ops, which XLA should no longer see. - Update MetaConverter to exclude XLA tensors in raising NotImplemented… - Add `_propagate_xla_data` op - Add meta tensor support for some ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/94537 Approved by: https://github.com/bdhirsh	2023-03-02 23:02:34 +00:00
Wanchao Liang	f397d1700f	Inductor reduce_scatter_tensor (#95764 ) This adds reduce_scatter to the functional collective and adds the inductor lowering support Pull Request resolved: https://github.com/pytorch/pytorch/pull/95764 Approved by: https://github.com/kumpera	2023-03-02 22:05:30 +00:00
PaliC	3df1a9baca	Upload external contribution data to s3 (#95747 ) Context: We want to create a metric panel to track external contributions to the PyTorch repo This PR creates a daily job to track how many external contributions occurred the day before and uploads it to a s3 collection which is accessible by rockset. `upload_external_contrib_stats.py` is a python script which grabs the neccesary stats from github and sticks them into an s3 bucket. It is used here to do daily uploads, but can generally be used for larger queries as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95747 Approved by: https://github.com/huydhn, https://github.com/kit1980	2023-03-02 21:57:28 +00:00
Howard Huang	02fa2291f7	Add support for custom backend (#95072 ) Fixes https://github.com/pytorch/pytorch/issues/92344 A custom backend can be specified by passing in a string with format `"<device_type1>:<backend_name>,<device_type2>:<backend_name>"`, e.g. `"cpu:gloo,cuda:custom_backend"`. Differential Revision: [D43630050](https://our.internmc.facebook.com/intern/diff/D43630050) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95072 Approved by: https://github.com/kwen2501	2023-03-02 21:41:49 +00:00
Sergei Vorobev	b2875268c9	[bazel] use GPU machine and run GPU tests (#95721 ) Fixes #79354 Run bazel build and tests on GPU machine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95721 Approved by: https://github.com/huydhn	2023-03-02 21:09:24 +00:00
Nikita Shulga	61fa43a1f2	[GHF] Add submodule updates check (#95885 ) Originally planned to integrate it somehow into the `lintrunner`, but this poses too many challenges, one of them is that it deliberately ignores submodule updates. On the other hand, almost all the information, other than list of the submodules is already present in the GitHubPR info. Incorporate small BE change into `test_trymerge.py`, that moves `@mock.patch` from individual test to the class definition. Fixes https://github.com/pytorch/pytorch/issues/74326 and https://github.com/pytorch/test-infra/issues/1521 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95885 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2023-03-02 18:05:26 +00:00
Rodrigo Kumpera	7ebd816aab	Switch DTensor to use funcol::all_reduce. (#95804 ) This is relanding the troubling part of #95009 that caused a regression. BC: This changes the signature and semantics of DeviceMesh::all_reduce. DeviceMesh::all_reduce now uses a functional collective under the hood which makes it more easily traceable. You no longer need to use CommTensor to get a trace. all_reduce now is async only and uses AsyncCollectiveTensor to ensure proper stream synchronization. Signature changed: removed async_op param and changes return type from Optional[Work] to torch.Tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95804 Approved by: https://github.com/fegin	2023-03-02 17:55:01 +00:00
Jason Ansel	00ebbba623	Remove torch._inductor.config.triton.convolution (#95842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95842 Approved by: https://github.com/ngimel	2023-03-02 17:44:41 +00:00
Nikita Vedeneev	b033594943	COO intersection primitives: performance improvement (#92976 ) This PR improves COO intersection primitives by: * making it sync-less (dims <= 8, can be changed to any value that fits stack). * improving performance with much less kernel calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92976 Approved by: https://github.com/cpuhrsch, https://github.com/pearu	2023-03-02 17:42:39 +00:00
PyTorch MergeBot	06562529d2	Revert "Upload external contribution data to s3 (#95747 )" This reverts commit f418e1f8b63c0c15f52b373a57bfd9d65d02b172. Reverted https://github.com/pytorch/pytorch/pull/95747 on behalf of https://github.com/clee2000 due to broke lint on master, merge base is too old, https://github.com/pytorch/pytorch/actions/runs/4315881630/jobs/7531170401 `f418e1f8b6 (11721314649)`	2023-03-02 17:34:14 +00:00
Kazuaki Ishizaki	1712a18170	Fix typos under torch/_C directory (#95710 ) This PR fixes typos in files under `torch/_C` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95710 Approved by: https://github.com/H-Huang	2023-03-02 17:29:38 +00:00
Edward Z. Yang	e83d0a1893	Improve unittest class printing for generated classes (#95806 ) Previously they printed like `torch._dynamo.testing.make_test_cls_with_patches.<locals>.DummyTestClass`; now they print as `torch._dynamo.testing.StaticDefaultDynamicShapesUnspecTests` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95806 Approved by: https://github.com/dagitses	2023-03-02 17:03:41 +00:00
PaliC	f418e1f8b6	Upload external contribution data to s3 (#95747 ) Context: We want to create a metric panel to track external contributions to the PyTorch repo This PR creates a daily job to track how many external contributions occurred the day before and uploads it to a s3 collection which is accessible by rockset. `upload_external_contrib_stats.py` is a python script which grabs the neccesary stats from github and sticks them into an s3 bucket. It is used here to do daily uploads, but can generally be used for larger queries as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95747 Approved by: https://github.com/huydhn, https://github.com/kit1980	2023-03-02 16:03:32 +00:00
Nikita Karetnikov	5309c44210	[inductor] enable `test_alexnet_prefix_dynamic_shapes` on CUDA (#95766 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95766 Approved by: https://github.com/ezyang	2023-03-02 14:25:52 +00:00
Eddie Yan	db8e91ef73	[CUDA] Split out compute capability 8.7 and 7.2 from others (#95803 ) Follow up of #95008 to avoid building Jetson compute capabilities unnecessarily, also adds missing 7.2. CC @ptrblck @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/95803 Approved by: https://github.com/ezyang	2023-03-02 14:13:15 +00:00
Denis Vieriu	d0dd898943	[MPS] Remove remaining casts from 13.3 (#95870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95870 Approved by: https://github.com/kulinseth	2023-03-02 12:44:59 +00:00
Natalia Gimelshein	3a7fd20108	fix nll loss decomposition to properly ignore ignore_index (#95833 ) Fixes #95794 This is a hotfix for decomposition only (that is currently used by inductor), reference still accesses invalid indices. Perhaps `_nll_loss_nd` and this decomp should be unified, cc @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire @lezcano Pull Request resolved: https://github.com/pytorch/pytorch/pull/95833 Approved by: https://github.com/lezcano, https://github.com/Chillee	2023-03-02 08:37:56 +00:00
Edward Z. Yang	c86d23a1ef	Allow point-ranges on floating point inf (#95799 ) Fixes https://github.com/pytorch/pytorch/issues/95797 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95799 Approved by: https://github.com/eellison	2023-03-02 08:14:11 +00:00
Denis Vieriu	7bdfdbbd5f	[MPS] Add macOS 13.3 selector check (#95866 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95866 Approved by: https://github.com/DenisVieriu97	2023-03-02 07:11:48 +00:00
yanbing-j	d9f822b566	Add dimension check in tensordot (#94497 ) This PR is to add dimension check in tensordot. The expected dimension should be smaller than `dim_a` or `dim_b`. Fix #91589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94497 Approved by: https://github.com/jgong5, https://github.com/albanD	2023-03-02 05:45:11 +00:00
Jane Xu	75cb99e549	[optim] Widen the cases for defaulting to foreach (#95820 ) Big OOP correction continued. Also added a test this time to verify the defaulting was as expected. The key here is realizing that the grouping for foreach already assumes that the non-param tensorlists follow suit in dtype and device, so it is too narrow to check that _all_ tensors were on CUDA. The main leeway this allowed was state_steps, which are sometimes cpu tensors. Since foreach _can_ handle cpu tensors, this should not introduce breakage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95820 Approved by: https://github.com/albanD	2023-03-02 04:15:33 +00:00
Jane Xu	2bcf863fad	[optim] include nn.Parameter as foreach supported (#95811 ) This PR is a result of a realization that models are NOT subscribed to the foreach defaulting as have been claimed on our documentation for months now. BIG OOPS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95811 Approved by: https://github.com/albanD	2023-03-02 04:15:33 +00:00
PyTorch MergeBot	45fd1b390e	[vision hash update] update the pinned vision hash (#95843 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95843 Approved by: https://github.com/pytorchbot	2023-03-02 03:51:11 +00:00
Mikayla Gawarecki	4973ca5e3e	[sdpa] Add broadcasting for batch and num_heads dimensions to fused kernel nested preproc (#95657 ) Adds a path with the strategy mentioned [here](https://github.com/pytorch/pytorch/pull/95346#issuecomment-1441283506) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95657 Approved by: https://github.com/drisspg	2023-03-02 03:44:55 +00:00
Jason Ansel	62b775583f	[inductor] Improve error messages (#95567 ) Example error message before/after (710 to 131 lines): https://gist.github.com/jansel/6fecad057738089fa95bf08c3de9fc8a Pull Request resolved: https://github.com/pytorch/pytorch/pull/95567 Approved by: https://github.com/mlazos	2023-03-02 02:20:55 +00:00
atalman	d1ec9a51e9	Bump version 2.0.0 -> 2.1.0 (#95790 ) Same as: https://github.com/pytorch/pytorch/pull/90491 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95790 Approved by: https://github.com/albanD, https://github.com/malfet	2023-03-02 00:38:46 +00:00
Denis Vieriu	4d3352ed90	[MPS] Remove casts from reduction/cumsum/sort ops starting with macOS 13.3 (#95817 ) MPS in macOS13.3 has added support for int64 in reduction ops / cumsum / sort / argsort. This change removes the hard-coded casts and error messages prior macOS 13.3, allowing the op to run natively with int64. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95817 Approved by: https://github.com/kulinseth	2023-03-02 00:26:24 +00:00
Michael Lazos	184fb9f11d	Small doc update for torch_compile_debug (#95809 ) Updates the troubleshooting documentation with the folder structure of the debug directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95809 Approved by: https://github.com/msaroufim	2023-03-02 00:25:28 +00:00
Xuehai Pan	1fd119948e	[3/3] Update `.pyi` Python stub files and enable `'UFMT'` linter (#95268 ) Changes: - #95200 1. Recognize `.py.in` and `.pyi.in` files as Python in VS Code for a better development experience. 2. Fix deep setting merge in `tools/vscode_settings.py`. - #95267 3. Use `Namedtuple` rather than `namedtuple + __annotations__` for `torch.nn.utils.rnn.PackedSequence_`: `namedtuple + __annotations__`: ```python PackedSequence_ = namedtuple('PackedSequence_', ['data', 'batch_sizes', 'sorted_indices', 'unsorted_indices']) # type annotation for PackedSequence_ to make it compatible with TorchScript PackedSequence_.__annotations__ = {'data': torch.Tensor, 'batch_sizes': torch.Tensor, 'sorted_indices': Optional[torch.Tensor], 'unsorted_indices': Optional[torch.Tensor]} ``` `Namedtuple`: Python 3.6+ ```python class PackedSequence_(NamedTuple): data: torch.Tensor batch_sizes: torch.Tensor sorted_indices: Optional[torch.Tensor] unsorted_indices: Optional[torch.Tensor] ``` - => this PR: #95268 4. Sort import statements and remove unnecessary imports in `.pyi`, `.pyi.in` files. 5. Format `.pyi`, `.pyi.in` files and remove unnecessary ellipsis `...` in type stubs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95268 Approved by: https://github.com/huydhn	2023-03-01 23:50:56 +00:00
Kazuaki Ishizaki	b3d8fae042	Fix typos in documents under torch directory (#95709 ) This PR fixes typo in `.md` files under `torch` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95709 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-03-01 23:43:35 +00:00
David Berard	b35e67142c	[JIT] Improve source attribution for NamedTuple type inference (#95761 ) Most errors thrown during torchscript scripting or execution have a SourceRange attached that can be used to identify where the error is coming from. NamedTuple type inference previously didn't have SourceRanges attached; this PR adds them. Differential Revision: [D43685662](https://our.internmc.facebook.com/intern/diff/D43685662) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95761 Approved by: https://github.com/eellison	2023-03-01 23:40:13 +00:00
William Wen	053205aab5	[dynamo] Fix OrderedDict reconstruction bytecode (#95800 ) Fixes OrderedDict reconstruction issue found in https://github.com/pytorch/pytorch/pull/95250 with an attempt to fix it here https://github.com/pytorch/pytorch/pull/95725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95800 Approved by: https://github.com/yanboliang, https://github.com/clee2000	2023-03-01 23:39:09 +00:00
cyy	6786a24fd2	fix some tiny code issues (#95757 ) This PR tries to fix: 1. a misspelled NDEBUG preprocessing condition. 2. get ride of all writable-strings warnings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95757 Approved by: https://github.com/soulitzer	2023-03-01 23:27:32 +00:00
Mark Saroufim	f7b26bdd22	Remove mention of dynamo.optimize() in docs (#95802 ) This should be self containable to merge but other stuff that's been bugging me is * Instructions on debugging IMA issues * Dynamic shape instructions * Explaining config options better Will look at adding a config options doc Pull Request resolved: https://github.com/pytorch/pytorch/pull/95802 Approved by: https://github.com/svekars	2023-03-01 23:24:09 +00:00
Edward Z. Yang	deaf077de8	Don't use guardless contiguity/stride-like implementations (#95733 ) These prevent us from simplifying tests involving unbacked SymInts, and then you end up with unbacked SymInt in guards, which is bad. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95733 Approved by: https://github.com/tugsbayasgalan	2023-03-01 23:14:58 +00:00
Thiago Crepaldi	a9a3a1bd14	Apply peephole for eval mode when constant folding is enabled only (#95801 ) Fixes https://github.com/microsoft/onnx-converters-private/issues/150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95801 Approved by: https://github.com/BowenBao	2023-03-01 23:07:38 +00:00
Michael Voznesensky	8093abce3e	Always get attr static out (#95771 ) Discussion here https://github.com/pytorch/pytorch/issues/95630#issuecomment-1449596766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95771 Approved by: https://github.com/jansel	2023-03-01 23:05:44 +00:00
Michael Voznesensky	34a7c79eac	Rename func (#95639 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95639 Approved by: https://github.com/ezyang	2023-03-01 23:03:09 +00:00
Douglas Lehr	de86538f55	[ROCM] Restrict pytorch rocm to only use triton 2.0.x (#95793 ) To align with upstream, we are requiring triton dependency to be between 2.0.0 and 2.1. This will allow PyTorch 2.0 on ROCM to stay flexible enough to pick up any performance/stability improvements from Triton, without needing to cut a separate PyTorch version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95793 Approved by: https://github.com/huydhn	2023-03-01 22:50:44 +00:00
PyTorch MergeBot	2936c8b9ce	Revert "Enable thp(transparent huge pages) for buffer sizes >=2MB (#93888 )" This reverts commit 2cc845eb1a45c7ea494c33262a97f9a348818261. Reverted https://github.com/pytorch/pytorch/pull/93888 on behalf of https://github.com/seemethere due to Reverting due to internal build issues, Meta employees see: https://fburl.com/sandcastle/1p4zvldk	2023-03-01 22:33:04 +00:00
Wei Wang	13340638f4	Update inductor-perf-test-nightly.yml (#95807 ) Try to cancel previous commits to avoid wasted runs on older commits. Not sure if a different user's push would cancel an ongoing job. Currently multiple commits from the same open PR would be running, even though most likely the latest commit's status is of interest. This tries to see if old workflows could get cancelled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95807 Approved by: https://github.com/huydhn	2023-03-01 22:15:37 +00:00
Mikayla Gawarecki	63796d35ef	[sdpa] move seq_len_1 check and replace with seq_len_0 check in sdp_utils (#95486 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95486 Approved by: https://github.com/drisspg, https://github.com/cpuhrsch	2023-03-01 21:46:45 +00:00
Huy Do	72b9d45e76	Clean up install_triton and install_filelock in CI (#95754 ) After Dockerize triton in https://github.com/pytorch/pytorch/pull/95233, we can now clean up `install_triton` and `install_filelock` in the CI to improve its reliability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95754 Approved by: https://github.com/weiwangmeta	2023-03-01 21:41:58 +00:00
Tugsbayasgalan Manlaibaatar	dd88954511	Preserve specialize_int_float during export (#95741 ) In the next PR, i will error when dynamo tries to add "implicit" input so that it doesn't fail during sanity check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95741 Approved by: https://github.com/yanboliang	2023-03-01 21:26:16 +00:00
Kulin Seth	5d9d8c6154	[MPS] Add fixes for div with floor and raise error for div_trunc (#95769 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95769 Approved by: https://github.com/DenisVieriu97	2023-03-01 20:52:28 +00:00
Ning Xu	5ba4dafccd	Retry Merge: extract utils from check labels ptr (#94899 ) Fixes #88098 This is the rebased and retry merging branch of the reverted PR: https://github.com/pytorch/pytorch/pull/94597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94899 Approved by: https://github.com/kit1980	2023-03-01 20:40:30 +00:00
mfkasim1	975333d80c	Logaddexp for complex in CPU (#95717 ) Continuation of PR #93153 where I implemented logaddexp for complex, but didn't expose it to `torch.logaddexp`. So this PR is to expose the complex logaddexp to `torch.logaddexp`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95717 Approved by: https://github.com/lezcano	2023-03-01 20:37:46 +00:00
Edward Z. Yang	97fbceead4	[EASY] Make has_hint work on more things than just SymInt. (#95792 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95792 Approved by: https://github.com/Skylion007	2023-03-01 20:30:23 +00:00
Bin Bao	879f0c3fee	[CI] Increate the timeout limit for benchmark test (#95787 ) Summary: xcit_large_24_p8_224 occasionally hits TIMEOUT on CI. Bump up the limit to reduce flakiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95787 Approved by: https://github.com/ezyang, https://github.com/ZainRizvi	2023-03-01 19:54:25 +00:00
Xuehai Pan	ef731cdaf0	[2/3] Update `.pyi` Python stub files: Prettify `rnn.py` by using type annotated `NamedTuple` (#95267 ) Changes: - #95200 1. Recognize `.py.in` and `.pyi.in` files as Python in VS Code for a better development experience. 2. Fix deep setting merge in `tools/vscode_settings.py`. - => this PR: #95267 3. Use `Namedtuple` rather than `namedtuple + __annotations__` for `torch.nn.utils.rnn.PackedSequence_`: `namedtuple + __annotations__`: ```python PackedSequence_ = namedtuple('PackedSequence_', ['data', 'batch_sizes', 'sorted_indices', 'unsorted_indices']) # type annotation for PackedSequence_ to make it compatible with TorchScript PackedSequence_.__annotations__ = {'data': torch.Tensor, 'batch_sizes': torch.Tensor, 'sorted_indices': Optional[torch.Tensor], 'unsorted_indices': Optional[torch.Tensor]} ``` `Namedtuple`: Python 3.6+ ```python class PackedSequence_(NamedTuple): data: torch.Tensor batch_sizes: torch.Tensor sorted_indices: Optional[torch.Tensor] unsorted_indices: Optional[torch.Tensor] ``` - #95268 4. Sort import statements and remove unnecessary imports in `.pyi`, `.pyi.in` files. 5. Format `.pyi`, `.pyi.in` files and remove unnecessary ellipsis `...` in type stubs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95267 Approved by: https://github.com/janeyx99	2023-03-01 19:37:23 +00:00
Xuehai Pan	a46e550d06	[1/3] Recognize `.py.in` and `.pyi.in` files as Python in VS Code (#95200 ) Changes: - => this PR: #95200 1. Recognize `.py.in` and `.pyi.in` files as Python in VS Code for a better development experience. 2. Fix deep setting merge in `tools/vscode_settings.py`. - #95267 3. Use `Namedtuple` rather than `namedtuple + __annotations__` for `torch.nn.utils.rnn.PackedSequence_`: `namedtuple + __annotations__`: ```python PackedSequence_ = namedtuple('PackedSequence_', ['data', 'batch_sizes', 'sorted_indices', 'unsorted_indices']) # type annotation for PackedSequence_ to make it compatible with TorchScript PackedSequence_.__annotations__ = {'data': torch.Tensor, 'batch_sizes': torch.Tensor, 'sorted_indices': Optional[torch.Tensor], 'unsorted_indices': Optional[torch.Tensor]} ``` `Namedtuple`: Python 3.6+ ```python class PackedSequence_(NamedTuple): data: torch.Tensor batch_sizes: torch.Tensor sorted_indices: Optional[torch.Tensor] unsorted_indices: Optional[torch.Tensor] ``` - #95268 4. Sort import statements and remove unnecessary imports in `.pyi`, `.pyi.in` files. 5. Format `.pyi`, `.pyi.in` files and remove unnecessary ellipsis `...` in type stubs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95200 Approved by: https://github.com/janeyx99	2023-03-01 19:16:56 +00:00
jjsjann123	e096bca5f9	adding symbolic link to get CI to run tests where cmake is not run on CI node (#95402 ) Fixes #95155 which breaks CI and no nvfuser python tests are run on CI nodes. Thanks to @davidberard98 for noticing this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95402 Approved by: https://github.com/davidberard98	2023-03-01 19:01:18 +00:00
Shunting Zhang	5d29b68bbc	[inductor] generate triton kernel benchmark (#95506 ) A PR to generate benchmark code for individual triton kernels. We can explore improving autotuning with the saved compiled kernel directly. This potentially can speedup our iteration and separate the concern with the upstream components that generate the compiled module. Since I'm still ramping up on inductor, I'll reflect what I learned here so people can correct me if I'm wrong. In inductor, WrapperCodeGen class is used to generate the compiled module for CUDA (or triton). Here is an example compiled module for a toy model like: `def f(x): return sin(x) + cos(x)` https://gist.github.com/shunting314/c6ed9f571919e3b414166f1696dcc61b . A compiled module contains the following part: - various triton kernels - a wrapper (or a method named call . The name is hardcoded) that calls the triton kernels and potentially ATen kernels to efficiently do the same work as the original Fx graph being compiled by inductor - some utility code that generate random inputs and run the wrapper The triton kernels in the compiled module are annotated with decorator like pointwise which is used for autotuning. This PR add a config so enabling it will just trigger the path of the compiled module being printed. It can be controlled from environment variable as well. The path to each compiled triton kernel is added as comment in the compiled module. E.g. ``` # kernel path: /tmp/torchinductor_shunting/gn/cgn6x3mqoltu7q77gjnu2elwfupinsvcovqwibc6fhsoiy34tvga.py triton__0 = async_compile.triton(''' import triton import triton.language as tl ... """) ```` Example command: ``` TORCHINDUCTOR_OUTPUT_COMPILED_MODULE_PATH=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --training --dashboard --only AlbertForMaskedLM --disable-cudagraphs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95506 Approved by: https://github.com/Chillee	2023-03-01 18:29:07 +00:00
ajithvallabai	e9c70b0b20	Fix typo and grammatical errors in community docs and dynamo docs (#95692 ) Fixes typo and grammatical errors in community docs and dynamo docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/95692 Approved by: https://github.com/H-Huang	2023-03-01 18:10:46 +00:00
Rodrigo Kumpera	3e8eedd78e	Round of fixes for functional collectives (#95714 ) Move collective registration to torch.__init__ to handle multipy warmup. Fix all_reduce with non-contiguous tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95714 Approved by: https://github.com/wconstab	2023-03-01 17:52:14 +00:00
Wei Wang	46f092dc66	Add jinja2 as mandatory dependency (#95691 ) Should fix #95671 for nightly wheels issue. v2.0.0 RC does not need this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95691 Approved by: https://github.com/malfet	2023-03-01 17:28:55 +00:00
Andrew M. James	2bcc0e9e18	Expand sparse.softmax zero nnz tests to cover cases of previously reported FPE. (#95646 ) - Test cases with zero `nnz` added for `sparse.log_softmax`. - Test cases with zero `nnz` for both `sparse.log_softmax` and `torch.sparse_softmax` expanded to cover the backward pass. These test additions prove resolution to #95371 and #82107. Fixes #82107 #95371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95646 Approved by: https://github.com/cpuhrsch, https://github.com/pearu, https://github.com/nikitaved	2023-03-01 17:26:51 +00:00
Peter Bell	c5f6092591	Use FindCUDAToolkit to find cuda dependencies (#82695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/82695 Approved by: https://github.com/malfet	2023-03-01 17:26:36 +00:00
Nikita Vedeneev	7901f2d156	sparse compressed tensor validation without syncs for low-(batch)dim tensors. (#94048 ) As per title. Sync is still unavoidable for super high-dim tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94048 Approved by: https://github.com/alexsamardzic, https://github.com/cpuhrsch	2023-03-01 17:25:11 +00:00
Denis Vieriu	e5a959a2d4	[MPS] Fix views with 3 or more sliced dimensions (#95762 ) Fixes https://github.com/pytorch/pytorch/issues/95482 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95762 Approved by: https://github.com/razarmehr	2023-03-01 16:16:49 +00:00
Bin Bao	7d097e3695	[CI] Reduce the frequency of running inductor-perf-test-nightly (#95778 ) Summary: This to prepare for extending inductor-perf-test-nightly to collect dashboard numbers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95778 Approved by: https://github.com/ezyang	2023-03-01 14:34:04 +00:00
Bin Bao	9835c93aba	[CI] Change the way tests are triggered with dynamo and inductor (#94539 ) Summary: Currently running PyTorch tests with dynamo and inductor is controlled by environment variables, and CI sets them based on test config name matching. Change them to use options of run_test.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94539 Approved by: https://github.com/huydhn	2023-03-01 13:06:23 +00:00
Nikita Karetnikov	e3892fd16b	[inductor] correctly infer dtype of `full` (#95593 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95593 Approved by: https://github.com/ezyang, https://github.com/ngimel	2023-03-01 10:13:21 +00:00
Wang, Eikan	9da903f180	[Inductor] Fix the logical_and/logical_or vectorization issue (#95609 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95609 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-03-01 08:21:57 +00:00
Wang, Eikan	c1f5e50fd1	[Inductor] Vectorize channels-last adaptive_avg_pool2d (#95608 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95608 Approved by: https://github.com/jansel	2023-03-01 08:21:57 +00:00
Wang, Eikan	074ae720f4	[Inductor] Fix the issue that at::vec does not support indexing (#95459 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95459 Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper, https://github.com/jansel	2023-03-01 08:21:57 +00:00
Wanchao Liang	7a772bfff9	[dtensor] add submesh example to checkpoint_example (#95655 ) This PR adds a submesh example for checkpoing purposes Pull Request resolved: https://github.com/pytorch/pytorch/pull/95655 Approved by: https://github.com/XilunWu	2023-03-01 08:19:27 +00:00
Driss Guessous	3fa939625b	Rearrange some transformer tests (#95745 ) This changes the test placement to be more inline with the class hierarchy in the test_transformers.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/95745 Approved by: https://github.com/cpuhrsch	2023-03-01 07:18:49 +00:00
Michael Voznesensky	1e2e149570	Dynamic dim guards (#95584 ) Guards for dynamic dims, essentially authored/co-authored by @ezyang by triple checking my (originally faulty) logic. Comments in code explain the guard decision tree. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95584 Approved by: https://github.com/ezyang	2023-03-01 06:17:41 +00:00
Edward Z. Yang	e628a3e724	Don't generate guards that refer to unbacked SymInts (#95732 ) This regresses unbacked batch resnet, but I have a plan to recover that too. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95732 Approved by: https://github.com/tugsbayasgalan	2023-03-01 06:14:27 +00:00
Brian Hirsh	9b86b53285	allow privateuse1 key to be used with legacy constructor (#95748 ) fixes https://github.com/pytorch/pytorch/issues/95734 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95748 Approved by: https://github.com/ezyang	2023-03-01 06:11:00 +00:00
Edward Z. Yang	93f1aa5511	raw_values is dead (#95703 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95703 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-03-01 05:38:43 +00:00
Wei-Sheng Chin	9227fd741c	Avoid recursion in graph traverse (#95723 ) It's easy to reach recursion limit in Python when calling `dfs_find_cycle` in big graphs (e.g., searching for attention heads in GPT-2 via SubgraphMatcher). Let's switch to queue-based graph tarversing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95723 Approved by: https://github.com/SherlockNoMad, https://github.com/Skylion007	2023-03-01 04:35:22 +00:00
Nikita Shulga	e970dd9dcf	[CI] Compile on M1 natively (#95719 ) We have plenty of runners now, let's use them for compilation as well. To achieve that, remove `xcode-version: "13.3.1"` property and tweak Metal framework detection logic to work with command line tools(which are installed in `/Library/Developer/CommandLineTools`) and SDK is in `/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk`) rather than full Xcode installation. TODO: Fix/enable OpenMP accelerated native builds (which are currently broken with `OMP: Error #15: Initializing libomp.dylib, but found libomp.dylib already initialized.`), but this matches existing behavior as cross-builds are compiled with OpenMP disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95719 Approved by: https://github.com/huydhn	2023-03-01 04:20:42 +00:00
Bin Bao	e79b2b7792	[CI] Force clear triton cache between running each test (#95729 ) Summary: The idea is to see if this reduces some of the flakiness we have seen on CI. If it does help, then we have a problem in our caching implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95729 Approved by: https://github.com/ngimel	2023-03-01 04:10:03 +00:00
PyTorch MergeBot	d3d75a5cd8	[vision hash update] update the pinned vision hash (#95665 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95665 Approved by: https://github.com/pytorchbot	2023-03-01 04:07:29 +00:00
Natalia Gimelshein	21b1134be6	[inductor] fix type promotion for comparison operations (#95736 ) Fixes #95695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95736 Approved by: https://github.com/Skylion007, https://github.com/desertfire, https://github.com/jansel	2023-03-01 03:29:55 +00:00
Mikayla Gawarecki	6930f30ccd	Small bugfix in nested matmul bmm path head_dim acquisition (#95744 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95744 Approved by: https://github.com/drisspg	2023-03-01 03:27:08 +00:00
Michael Lazos	e50ff3fcdb	Fix kernel name bug (#95739 ) [T146374491](https://www.internalfb.com/intern/tasks/?t=146374491): [Inductor] Descriptive kernel names not displaying in trace Use the descriptive kernel name for the triton function name if indicated in the config Pull Request resolved: https://github.com/pytorch/pytorch/pull/95739 Approved by: https://github.com/ngimel	2023-03-01 03:02:47 +00:00
Wei Wang	65f49ab663	[Inductor Perf Test Workflow] Remove pull request trigger and rely on ciflow/ label only (#95755 ) Mitigates A100 queue issue. Workflow seems to run twice upon pull request changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95755 Approved by: https://github.com/seemethere	2023-03-01 02:47:49 +00:00
Ron Green	1c526664d5	feat(dockerfile): shrink layers & build cleaner (#95375 ) this change will reduce the layer size as it will not save the layers also it will build cleaner on other machines as it won't ask for a user interaction when running the build Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95375 Approved by: https://github.com/ezyang	2023-03-01 02:39:56 +00:00
Kiersten Stokes	60a1d29585	Correct OneCycleLR doc example code to explicitly call optimizer.step() (#95730 ) Fixes #89358 as suggested in the issue comment A screenshot of the example code in the built docs: <img width="1168" alt="Screenshot 2023-02-28 at 4 46 45 PM" src="https://user-images.githubusercontent.com/31816267/221999156-02b28f2a-85b3-4aa8-841d-e4c66a39a33f.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95730 Approved by: https://github.com/janeyx99	2023-03-01 02:15:50 +00:00
Denis Vieriu	ed1957dc19	[MPS] Add support for masked_scatter (#95743 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95743 Approved by: https://github.com/kulinseth	2023-03-01 01:36:36 +00:00
Andrew Gu	d9cd9a13bc	[BE][DDPOptimizer] De-dup `p` and `param` (#95654 ) The `param` from `param = target.get_parameter(name)` should be the same as `p` from `target.named_parameters()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95654 Approved by: https://github.com/wconstab	2023-03-01 01:17:09 +00:00
Michael Lazos	94bec94f5a	Initial minifier smoke test + runbook (#95670 ) Summary: Adds a manual smoke test for the minifier in fb code to use as an example for the runbook. (We already have automatic tests which should be running) See draft runbook: https://docs.google.com/document/d/18I0KYhWiYo4taC4foR2UcijJXYyEcZV4McBJQIUSSJw/edit# Test Plan: buck2 run mode/dev-nosan //caffe2/test/inductor:minifier_smoke Run displayed minifier launcher script, and it should reduce the graph from 5 to 3 nodes Differential Revision: D43415890 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95670 Approved by: https://github.com/yanboliang, https://github.com/anijain2305	2023-03-01 01:02:23 +00:00
Edward Z. Yang	7ea3aab45d	Remove dead ZeroGuard (#95701 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95701 Approved by: https://github.com/Skylion007	2023-03-01 01:02:04 +00:00
William Wen	cf3638a9cc	[dynamo] Clear cache on dynamo dashboard accuracy tests (#95726 ) Might fix some flaky accuracy tests? Pull Request resolved: https://github.com/pytorch/pytorch/pull/95726 Approved by: https://github.com/ngimel, https://github.com/anijain2305, https://github.com/desertfire	2023-03-01 00:50:19 +00:00
Huy Do	40d54cf8bf	Apply filter logic to disabled jobs dynamically (#95442 ) Apply filter logic to disabled jobs dynamically. The list of disabled jobs is published at https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json. When the workflow (i.e. `pull`) and the platform (i.e. `linux-bionic-py3.8-clang9`) names match, job will be disabled (skipped) if they are in the list. Note that getting the current job name within the GitHub action is fairly hacky. This is a TODO item. ### Testing * Unit testing * This PR. https://github.com/pytorch/pytorch/issues/94861 disables `pull / linux-bionic-py3.8-clang9 / test (dynamo)` in the CI. We have: * No dynamo tests running in `pull / linux-bionic-py3.8-clang9` https://github.com/pytorch/pytorch/actions/runs/4272505289/jobs/7437706181 * Other dynamo tests, i.e. `pull / linux-bionic-py3.11-clang9`, are run normally https://github.com/pytorch/pytorch/actions/runs/4272505289/jobs/7437706054 * This PR. https://github.com/pytorch/pytorch/issues/95642 disables `pull / linux-bionic-cuda11.7-py3.10-gcc7-sm86 / test`. All test jobs for `pull / linux-bionic-cuda11.7-py3.10-gcc7-sm86` are skipped https://github.com/pytorch/pytorch/actions/runs/4287330986/jobs/7468179694 * This PR. https://github.com/pytorch/pytorch/issues/95656 disables `pull / linux-bionic-py3_8-clang8-xla / build`. All build and test jobs for `pull / linux-bionic-py3_8-clang8-xla` are skipped https://github.com/pytorch/pytorch/actions/runs/4287330986/jobs/7470478905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95442 Approved by: https://github.com/clee2000	2023-03-01 00:10:35 +00:00
BowenBao	2fbbc3362b	[ONNX] Support 'dtype' argument for 'aten::norm' (#95637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95637 Approved by: https://github.com/titaiwangms	2023-03-01 00:07:34 +00:00
Natalia Gimelshein	88a31f4be6	hoist precomputed exprs from indices (#95690 ) This generates compilable code for maskrcnn graph 13, with ceilings hoisted to be computed on the host. But it now fails with ``` File "/scratch/ngimel/work/pytorch/torch/_dynamo/symbolic_convert.py", line 379, in wrapper self.output.compile_subgraph(self, reason=reason) File "/scratch/ngimel/work/pytorch/torch/_dynamo/output_graph.py", line 562, in compile_subgraph pass1.foreach(stack_values) File "/scratch/ngimel/work/pytorch/torch/_dynamo/codegen.py", line 166, in foreach self(i) File "/scratch/ngimel/work/pytorch/torch/_dynamo/codegen.py", line 148, in __call__ output.extend(value.reconstruct(self)) File "/scratch/ngimel/work/pytorch/torch/_dynamo/variables/dicts.py", line 40, in reconstruct codegen.create_load_python_module(collections), TypeError: create_load_python_module() missing 1 required positional argument: 'push_null' from user code: File "/scratch/ngimel/work/env/lib/python3.9/site-packages/torchvision-0.15.0a0+928b05c-py3.9-linux-x86_64.egg/torchvision/models/detection/backbone_utils.py", line 58, in forward x = self.fpn(x) ``` looks like we never execute this `create_load_python_module()` path for other subgraphs. Any advice on how to fix this @voznesenskym @jansel ? Pull Request resolved: https://github.com/pytorch/pytorch/pull/95690 Approved by: https://github.com/jansel	2023-02-28 23:32:36 +00:00
Will Constable	dc10ab15b7	Warn on modification of OptimizedModule.forward (#95673 ) Partially addresses #95641 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95673 Approved by: https://github.com/ezyang	2023-02-28 23:21:23 +00:00
Will Constable	6bdef7a5ff	Warn on dynamo OptimizedModule.forward() (#95672 ) Partially addresses #95641 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95672 Approved by: https://github.com/ezyang	2023-02-28 23:21:03 +00:00
Edward Z. Yang	20dfce591c	Add support for Inductor + symbolic shapes + training (#93059 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93059 Approved by: https://github.com/ezyang	2023-02-28 22:44:31 +00:00
David Berard	70029214f3	[jit] Add c++ stacktraces for jit::ErrorReport (#94842 ) Summary: This PR adds C++ stacktraces to jit::ErrorReports. After this PR, if you run with `TORCH_SHOW_CPP_STACKTRACES=1` environment variable and a jit::ErrorReport is thrown, then the C++ stacktrace should be displayed. More background: This behavior already occurs for c10::Error; but not for jit::ErrorReport. jit::ErrorReport _does_ usually have a python stacktrace for the python source, but it is sometimes still helpful to know where in the C++ codebase the error came from. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94842 Approved by: https://github.com/qihqi	2023-02-28 22:37:51 +00:00
Catherine Lee	e3c5c369ba	Run tests in USE_PYTEST_LIST through run_tests (#95659 ) Part of my effort to move everything to pytest and decrease the number of testrunner frameworks in ci Gives xmls but they might look a weird b/c module level tests vs tests in classes. Doesn't give skip/disable test infra because those are tied to classes. (for future ref, could either put tests in classes or move the check_if_enable stuff into a pytest hook) Tested in CI and checked that the same number of tests are run Pull Request resolved: https://github.com/pytorch/pytorch/pull/95659 Approved by: https://github.com/huydhn	2023-02-28 22:09:01 +00:00
Jane Xu	e5b9d98752	Rephrase zero_grad docs (#95643 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95643 Approved by: https://github.com/albanD	2023-02-28 22:04:23 +00:00
Huy Do	ba43d908f9	Build Triton in Docker image (#95233 ) See a bunch of timeout error when trying to clone and build Triton today `c6d8d10b3e`, so let's build triton as part of the Docker image. * The pinned commit file is moved to the Docker context at `.ci/docker/ci_commit_pins/triton.txt`, and `.github/ci_commit_pins/triton.txt` is now a soft link pointing to it * New Docker images are built whenever the pinned commit is updated * The build logic is in `.ci/docker/common/install_triton.sh` which copies `install_triton` step in the CI. The latter can be removed in a separate PR after this one Pull Request resolved: https://github.com/pytorch/pytorch/pull/95233 Approved by: https://github.com/weiwangmeta, https://github.com/malfet	2023-02-28 22:01:37 +00:00
Huy Do	b55d0d2aef	Fix trymerge changed files count (#95720 ) The value from the PR info includes only unique files != The number of files changed (both are technically correct, depending on how you view it) I'm trying to merge this PR https://github.com/pytorch/pytorch/pull/95233 which makes `.github/ci_commit_pins/triton.txt` a softlink. So the PR includes 2 changes to that file 1) to delete the file and 2) to add it as a symlink. ``` [ ".ci/docker/build.sh", ".ci/docker/ci_commit_pins/triton.txt", ".ci/docker/common/common_utils.sh", ".ci/docker/common/install_triton.sh", ".ci/docker/requirements-ci.txt", ".ci/docker/ubuntu-cuda/Dockerfile", ".ci/docker/ubuntu/Dockerfile", ".github/ci_commit_pins/triton.txt", <-- ".github/ci_commit_pins/triton.txt", <-- ".github/workflows/build-triton-wheel.yml" ] ``` Trymerge doesn't like that and rejects the merge due to `Changed file count mismatch` https://github.com/pytorch/pytorch/actions/runs/4295438799/jobs/7485853815 . This is because the PRInfo GraphQL result from GitHub only counts 9 of them https://paste.sh/zVsOnWoT#p_3RKX_VMjj-e71vwsTeA01W (search for `changedFiles`). It means that the name are dedup, so that only unique file names are counted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95720 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/ZainRizvi	2023-02-28 21:55:21 +00:00
Sunita Nadampalli	2cc845eb1a	Enable thp(transparent huge pages) for buffer sizes >=2MB (#93888 ) The 2MB thp pages provide better allocation latencies compared to the standard 4KB pages. This change has shown significant improvement for batch mode usecases where the tensor sizes are larger than 100MB. Only enabled if `THP_MEM_ALLOC_ENABLE` environment variable is set. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93888 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-02-28 21:12:46 +00:00
Kevin Zheng (FRL)	f1dbfe2f2a	[ao][fx] Enable observed -> quantized float for static quantized MultiheadAttention (#95636 ) Test Plan: Sandcastle cc andrewor14 any suggestions here? Differential Revision: D43631794 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95636 Approved by: https://github.com/andrewor14	2023-02-28 20:50:19 +00:00
Li-Huai (Allan) Lin	fafb410985	Clean up unused `fill_` sample inputs (#95117 ) The OpInfo of it has been integrated into `UnaryUfuncInfo('fill',...)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95117 Approved by: https://github.com/ngimel	2023-02-28 20:27:13 +00:00
Edward Z. Yang	835122c89f	Add missing f-string specifiers (#95707 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95707 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-02-28 20:20:05 +00:00
Jason Ansel	e13b804105	Add standalone torch._inductor.compile() API (#95594 ) This fixes support for inductor compiling non-dynamo generated FX graphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95594 Approved by: https://github.com/bertmaher, https://github.com/desertfire	2023-02-28 20:05:03 +00:00
Jacob Szwejbka	fc324d3485	[quant][pt2e] Add support for dynamic quantization with symmetric quant for input (#94854 ) Summary: Previously we assumed asymmetric quantization for dynamic quantization, this diff adds the support of symmetric quantization for the input in dynamic quantization Test Plan: buck run executorch/exir/tests:quant_lowering_custom_backend_pass -- "executorch.exir.tests.test_quant_lowering_custom_backend_pass.TestQuantLoweringCustomBackendPass.test_quantized_linear_dynamic" Reviewed By: digantdesai Differential Revision: D43134794 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94854 Approved by: https://github.com/digantdesai	2023-02-28 19:39:31 +00:00
cyy	f8ad64d5eb	[dynamo] avoid truncation of python pointers (#95619 ) This PR is separated from #94927 . It aims to fix to the MSVC warnings that passed python pointers are truncated to a smaller integer type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95619 Approved by: https://github.com/Skylion007	2023-02-28 19:38:34 +00:00
Wanchao Liang	1e15a272ff	[dtensor][BE] remove redundant tests (#94838 ) All test cases in test_tp_sharding_ops.py already been covered by test_dtensor_ops.py, deleting it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94838 Approved by: https://github.com/XilunWu	2023-02-28 18:42:49 +00:00
Wanchao Liang	2a1cb9640c	[dtensor] support creating DTensor in submesh (#95458 ) This PR supports creating DTensor in a submesh, if the rank is not participating in the mesh, we assign the local tensor to be empty tensor, and do nothing in the operator dispatch Differential Revision: [D43643577](https://our.internmc.facebook.com/intern/diff/D43643577) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95458 Approved by: https://github.com/XilunWu	2023-02-28 17:54:26 +00:00
Wanchao Liang	261eb46ddd	[dtensor] refactor get_coordiniate (#95457 ) This refactor get_coordinate to return a optional[list] instead of directly the coordinate on dim, this is so that we can check if the rank is inside the mesh easily Differential Revision: [D43643579](https://our.internmc.facebook.com/intern/diff/D43643579) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95457 Approved by: https://github.com/XilunWu	2023-02-28 17:54:26 +00:00
Wanchao Liang	bb9a05b116	[dtensor] use tracing for metadata prop (#95456 ) This PR uses tracing for metadata prop, so that we can get correct shape/stride metadata without manual calculation by ourselves. The follow up PR on this would be adopt tracing for the sharding prop itself Differential Revision: [D43643578](https://our.internmc.facebook.com/intern/diff/D43643578) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95456 Approved by: https://github.com/XilunWu	2023-02-28 17:54:22 +00:00
Driss Guessous	80614783e3	Enabling FlashAttention for SDPA when given NestedTensor (#95438 ) # Summary Previously, for NestedTensor inputs flash_attention was disabled due to an Illegal Memory Access error that was occurring on the "cutlass" branch of flash-attention that had be incorporated into core. Since we have switched to the main branch of flash_attention we the existing repro script did not produce the same memory error. This PR re-enables the FlashAttention Path for NTs. As well it unifies the nested preprocessing between the two implementations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95438 Approved by: https://github.com/mikaylagawarecki	2023-02-28 17:49:38 +00:00
Michael Gschwind	57f2c5888f	Update skip message to reflect why test is being skipped (#95127 ) Summary: Update skip message to reflect why test is being skipped Test Plan: github Differential Revision: D43423288 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95127 Approved by: https://github.com/cpuhrsch	2023-02-28 17:37:13 +00:00
Michael Gschwind	4fada6eb95	MHA torch.jit.script fix for in_proj_weight = None (#95653 ) Summary: MHA fix to support in_proj_weight being None Test Plan: sandcastle Differential Revision: D43628206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95653 Approved by: https://github.com/davidberard98, https://github.com/cpuhrsch	2023-02-28 17:29:29 +00:00
Will Constable	1a72712645	Add dynamo graph break stats to CI (#95635 ) Adds columns to csv produced by accuracy job including dynamo graph break stats. Example output from torchbench CI job: <img width="771" alt="image" src="https://user-images.githubusercontent.com/4984825/221716236-9276684e-1be8-43e1-837e-f41671d4e0e3.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95635 Approved by: https://github.com/ezyang	2023-02-28 16:17:46 +00:00
Li-Huai (Allan) Lin	f33180fb7f	[MPS] Add pow.Scalar (#95201 ) 1. Adds `pow.Scalar`. 2. Modifies testing `atol` and `rtol` to get pow output match tests pass. 3. Xfails numerically incorrect dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95201 Approved by: https://github.com/kulinseth	2023-02-28 16:11:15 +00:00
yanbing-j	71ad1005f6	Add prelu into Autocast CPU whitelist (#95366 ) ### Motivation Add `prelu` to lower precision cast policy on AutocastCPU to fix https://github.com/pytorch/pytorch/issues/95365 : Before: Within the scope of torch.cpu.amp.autocast(dtype=torch.bfloat16) , `prelu` cannot address the scenario of different datatypes of `input` and `weight`, will get a RuntimeError. This scenario is common in autocast, e.g, with `autocast` to `bf16`, if the `op` before `prelu` comes out a `bf16` output, which is the input of `prelu`, and `prelu's` weight is `fp32`, then it will get a RuntimeError. After: Within the scope of torch.cpu.amp.autocast(dtype=torch.bfloat16) , prelu be forced to run with `bf16` data type. Before https://github.com/pytorch/pytorch/pull/91238, when input is `bf16`, weight will be forced to cast to `bf16`. After https://github.com/pytorch/pytorch/pull/91238, this kind of test scenario will raise a RuntimeError. There is no precision loss since the workable one is also casting to `bf16`. And this also alighs with Autocast CUDA whitelist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95366 Approved by: https://github.com/ngimel, https://github.com/lezcano, https://github.com/leslie-fang-intel	2023-02-28 13:13:18 +00:00
yanbing-j	b87229f19d	Reland #94719 - Update ideep to add primitive cache for ARM (#95688 ) ### Description This PR is to update ideep to add primitive cache in order to speed up ARM's PyTorch workloads. Reland https://github.com/pytorch/pytorch/pull/94719, which is unintentional reverted by https://github.com/pytorch/pytorch/pull/94939#issuecomment-1447501258. Fixes https://github.com/pytorch/pytorch/issues/94264. ### Performance test Use TorchBench test in ICX with 40 cores Intel OpenMP & jemalloc were preloaded ![image](https://user-images.githubusercontent.com/61222868/221760391-fb6cbabe-6d88-4155-b216-348e718e68b9.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95688 Approved by: https://github.com/ezyang	2023-02-28 12:25:11 +00:00
Andrew Gu	05943712a4	[MTA] Skip size-0 tensors in `multi_tensor_apply` (#94655 ) This PR skips size-0 tensors to avoid possible stack corruption in `multi_tensor_apply()`. A follow-up PR will add more unit tests in `test_foreach.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94655 Approved by: https://github.com/ngimel	2023-02-28 07:14:32 +00:00
Li-Huai (Allan) Lin	9e16f1281f	[MPS] Add copysign op. (#95552 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95552 Approved by: https://github.com/kulinseth	2023-02-28 06:49:46 +00:00
Li-Huai (Allan) Lin	b7c2a65139	[MPS] Fix type casting copy with storage offset (#95573 ) This PR handles the case where the `dst` tensor of type casting has a storage offset by creating a temporary buffer to store results and then copy them back to the dst with the offset added. Fixes #95417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95573 Approved by: https://github.com/kulinseth	2023-02-28 05:24:31 +00:00
Shawn Xu	7c66333c08	[pt] add share_memory_ to aten TensorBase (#95557 ) Summary: This is the part 2 of adding `share_memory_()` support to C++ ATen lib. See inline comments for API considerations and current behavior rationale. Test Plan: Since https://github.com/pytorch/pytorch/pull/95228 already adds the UT, this is not repeating it. Github CI Differential Revision: D43575383 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95557 Approved by: https://github.com/ezyang	2023-02-28 05:07:24 +00:00
Edward Z. Yang	58648822b6	Handle int/float arguments for cpp codegen in inductor (#95533 ) This is a little questionable because we don't actually know what the dtype of the sympy expression is, and it's not clear we can rely on the assumptions. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95533 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-02-28 03:57:35 +00:00
Sergei Vorobev	447f5b5e2d	[bazel] enable sccache+nvcc in CI (#95528 ) Fixes #79348 This change is mostly focused on enabling nvcc+sccache in the PyTorch CI. Along the way we had to do couple tweaks: 1. Split the rules_cc from the rules_cuda that embeeded them before. This is needed in order to apply a different patch to the rules_cc compare to the one that rules_cuda does by default. This is in turn needed because we need to workaround an nvcc behavior where it doesn't send `-iquote xxx` to the host compiler, but it does send `-isystem xxx`. So we workaround this problem with (ab)using `-isystem` instead. Without it we are getting errors like `xxx` is not found. 2. Workaround bug in bazel https://github.com/bazelbuild/bazel/issues/10167 that prevents us from using a straightforward and honest `nvcc` sccache wrapper. Instead we generate ad-hock bazel specific nvcc wrapper that has internal knowledge of the relative bazel paths to local_cuda. This allows us to workaround the issue with CUDA symlinks. Without it we are getting `undeclared inclusion(s) in rule` all over the place for CUDA headers. ## Test plan Green CI build https://github.com/pytorch/pytorch/actions/runs/4267147180/jobs/7428431740 Note that now it says "CUDA" in the sccache output ``` + sccache --show-stats Compile requests 9784 Compile requests executed 6726 Cache hits 6200 Cache hits (C/C++) 6131 Cache hits (CUDA) 69 Cache misses 519 Cache misses (C/C++) 201 Cache misses (CUDA) 318 Cache timeouts 0 Cache read errors 0 Forced recaches 0 Cache write errors 0 Compilation failures 0 Cache errors 7 Cache errors (C/C++) 7 Non-cacheable compilations 0 Non-cacheable calls 2893 Non-compilation calls 165 Unsupported compiler calls 0 Average cache write 0.116 s Average cache read miss 23.722 s Average cache read hit 0.057 s Failed distributed compilations 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95528 Approved by: https://github.com/huydhn	2023-02-28 03:51:11 +00:00
chenxujun	49ba11962e	Update Dispatcher.cpp (#95589 ) Update Dispatcher.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/95589 Approved by: https://github.com/ezyang	2023-02-28 03:46:05 +00:00
ajithvallabai	3944e7c3e8	Fix grammatical errors in contribution guide (#95454 ) Fixed following errors in contribution guide. "deep neural networks using a on tape-based autograd systems." to "deep neural networks using a tape-based autograd systems." "the best entrance point and are great places to start." to "the best entrance points and are great places to start." Pull Request resolved: https://github.com/pytorch/pytorch/pull/95454 Approved by: https://github.com/ezyang	2023-02-28 03:44:40 +00:00
Kazuaki Ishizaki	46385b3e48	Fix typos under torch/_dynamo directory (#95599 ) This PR fixes typos in comments and messages of `.py` files under `torch/_dynamo` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95599 Approved by: https://github.com/ezyang	2023-02-28 03:44:24 +00:00
Stas Bekman	38c32e19c8	fix DeprecationWarning (#95545 ) This PR fixes 2 `DeprecationWarning` instances: ``` python3.8/site-packages/torch/utils/tensorboard/__init__.py:4 /home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/utils/tensorboard/__init__.py:4: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. if not hasattr(tensorboard, "__version__") or LooseVersion( python3.8/site-packages/torch/utils/tensorboard/__init__.py:6 /home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/utils/tensorboard/__init__.py:6: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. ) < LooseVersion("1.15"): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95545 Approved by: https://github.com/ezyang	2023-02-28 03:43:57 +00:00
Edward Z. Yang	3762e801ba	Update dynamic skips (#95587 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95587 Approved by: https://github.com/voznesenskym	2023-02-28 03:26:55 +00:00
Bin Bao	8b0543381b	[Inductor] Support sparse_grad for torch.gather (#95490 ) Summary: https://github.com/pytorch/pytorch/issues/95187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95490 Approved by: https://github.com/ngimel	2023-02-28 03:26:39 +00:00
Tugsbayasgalan Manlaibaatar	454c48b987	Add experimental torch.export prototype (#95070 ) This is WIP PR for adding torch.export API in OSS. Couple of points: - I intentionally named it as experimental_export so that ppl don't get confused thinking this is our official API - We don't plan to use AOTAutograd backend just yet. The reason we have it here is because the functionalization AOTAutograd uses is what we need for export (handling of param/buffer mutation etc). In the near future, I will extract the functionalization part and use it on top of make_fx. What we have right now is merely a placeholder. - The reason we want to do it now is because we want to have some minimal tests running in OSS so that we can catch regressions earlier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95070 Approved by: https://github.com/gmagogsfm, https://github.com/zhxchen17	2023-02-28 02:40:19 +00:00
PyTorch MergeBot	801b3f8fc7	Revert "Use FindCUDAToolkit to find cuda dependencies (#82695 )" This reverts commit 7289d22d6749465d3bae2cb5a6ce04729318f55b. Reverted https://github.com/pytorch/pytorch/pull/82695 on behalf of https://github.com/peterbell10 due to Breaks torchaudio build	2023-02-28 02:29:09 +00:00
Sherlock Huang	f8692dcc4a	Node.stack_trace should have innermost frame last (#95592 ) Both fx.Tracer and Dynamo should store node.stack_trace in the "innermost frame last" order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95592 Approved by: https://github.com/ezyang	2023-02-28 02:14:40 +00:00
Brian Hirsh	b818b3fe1c	better error message when functionalization cant handle op (#95392 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95392 Approved by: https://github.com/mikekgfb, https://github.com/cpuhrsch, https://github.com/ezyang, https://github.com/xw285cornell	2023-02-28 00:24:40 +00:00
Brian Hirsh	ddd6b53d80	fix embedding_backward_dense decomp with broadcasting (#95499 ) Fixes https://github.com/pytorch/pytorch/issues/95182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95499 Approved by: https://github.com/ezyang, https://github.com/ngimel	2023-02-28 00:24:40 +00:00
Brian Hirsh	84e2d957a1	fix primtorch handling for sub.scalar with alpha and float64 arg (#95421 ) This fixes the primtorch issue stemming from https://github.com/pytorch/pytorch/issues/95181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95421 Approved by: https://github.com/ngimel, https://github.com/SherlockNoMad	2023-02-28 00:24:38 +00:00
Michael Voznesensky	eff5ae8746	Better mark_dynamic assertions (#95566 ) This PR allows us to reuse the static per tensor decision making we make at fake tensorification time. We can use this to avoid setting up dynamic dim guards later if the tensor was never a candidate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95566 Approved by: https://github.com/ezyang	2023-02-28 00:02:22 +00:00
Catherine Lee	4e926db1f8	Add super().setUp() in test_symbolic_shape_analysis (#95336 ) Instead of the usual `super().setUp()`, use `super(JitTestCase, self).setUp()` since JitTestCase.setUp() seems to interfere with the test (see the results on the first commit of this PR). `super(JitTestCase, self).setUp()` skips the setUp method of JitTestCase Fixes https://github.com/pytorch/pytorch/issues/95341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95336 Approved by: https://github.com/huydhn	2023-02-27 23:22:36 +00:00
Svetlana Karslioglu	d7146e7870	Update copyright (#95652 ) Updating the copyright to reflect on the website. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95652 Approved by: https://github.com/atalman	2023-02-27 23:15:55 +00:00
David Berard	10bf019b71	[jit] Add shapes info to the output type of CallFunction nodes after tracing, if the output is a tensor (#95544 ) Summary: jit.trace usually adds shape information to all the jit::Values in its graph. This is mostly a side effect of how jit tracing is performed, but many users use this behavior for debugging and for better understanding the graph. Previously, CallFunction nodes (inserted by calling jit.script-ed functions) did _not_ have this information attached. This PR attaches this information for the tensor output values. Details: * First the jit tracer sets a global TracerState object * Then the jit tracer invokes the python callable that is to be traced * When the python function gets to a jit.script-ed function, [invokeScriptFunctionFromPython](`8693604bc6/torch/csrc/jit/python/pybind_utils.h (L1060)`) is called. It inserts a FunctionCall. * Then after the actual scripted function gets called and we have a concrete output, we attach the concrete output [IValue to the TracerState](`8693604bc6/torch/csrc/jit/python/pybind_utils.h (L1001)`) * ^^ the setValueTrace call (linked in previous list item) is where this PR makes changes; we revise the jit::Value output of the CallFunction node to use the type of the concrete tensor, which will have actual shapes associated. Test: added a test verifying that shape info appears in the output type for a CallFunction node in a jit-traced graph. Differential Revision: [D43592880](https://our.internmc.facebook.com/intern/diff/D43592880) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95544 Approved by: https://github.com/qihqi	2023-02-27 22:50:29 +00:00
Catherine Lee	5272d6e6e5	Remove mentions of distributed/_shard/test_replicated_tensor (#95632 ) The file was removed in https://github.com/pytorch/pytorch/pull/95453, which cause some issues with the multigpu job in periodic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95632 Approved by: https://github.com/huydhn	2023-02-27 22:41:02 +00:00
fduwjj	38fdd28db4	[4/N][Deprecate ST][BE] Move warnings of Partial Tensor to functions (#95631 ) To solve https://github.com/pytorch/pytorch/issues/95623 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95631 Approved by: https://github.com/wanchaol	2023-02-27 22:28:04 +00:00
PyTorch MergeBot	33cf62359d	Revert "Convert operator.not_ to torch.logical_not (#94626 )" This reverts commit 97510c6d50e2c8215aa0dd0c703497a29c774598. Reverted https://github.com/pytorch/pytorch/pull/94626 on behalf of https://github.com/ezyang due to not correct	2023-02-27 21:50:51 +00:00
Will Constable	cc6da7b901	Inductor allgather_into_tensor (#95530 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95530 Approved by: https://github.com/kumpera	2023-02-27 21:38:36 +00:00
Joel Schlosser	68eec90cfd	Support elementwise add / mul for [B, ] nested, [B, 1] dense (CUDA only) (#95620 ) Small hack to reuse the 3D custom kernel from #88289 for [B, ] nested, [B, 1] dense elementwise add / mul. Simply treat the inputs as [B, *, 1], [B, 1, 1]. This is added to satisfy an internal ask. Future work: full general broadcasting support between mixed nested / dense. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95620 Approved by: https://github.com/cpuhrsch, https://github.com/drisspg	2023-02-27 21:07:09 +00:00
Christian Puhrsch	1fe2a9d122	Add _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339 ) Add _int_mm primitive that binds cuBLAS int8@int8 -> int32 matmul and that translates to Triton based mm templates under max autotune. This is a very useful first step towards better supporting quantization on the GPU. This is a not a user facing API, but an internal primitive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94339 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-02-27 20:27:25 +00:00
shibo	32558910f3	make overriding operator warning message only print once (#95179 ) Fixes #ISSUE_NUMBER when I want to override some operators for new backend, this warning message will print for every op, the message is to much. So just print once for all operators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95179 Approved by: https://github.com/bdhirsh	2023-02-27 20:17:43 +00:00
Eddie Yan	29f9a702cc	[NCCL] (re-open) Optionally avoid `recordStream` calls in `ProcessGroupNCCL` (#89880 ) Rebased version of @mcarilli's #76861 CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/89880 Approved by: https://github.com/kwen2501	2023-02-27 20:15:53 +00:00
HELSON	f43ce9553b	[meta_tensor] polish error strings in meta registrations (#95052 ) I found some error message should be formatted for detailed information. So I polished those error message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95052 Approved by: https://github.com/bdhirsh	2023-02-27 20:12:09 +00:00
Bin Bao	fa5a4b0dfc	[CI] Do not compare two eager run results against fp64 result (#95616 ) Summary: When running the benchmark test with --accuracy, two eager runs should return the same result. If not, we want to detect it early, but comparing against fp64_output may hide the non-deterministism in eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95616 Approved by: https://github.com/ZainRizvi	2023-02-27 20:11:21 +00:00
Han Qi	34617d7eb8	dynamo export should be able to export identity function (#94962 ) Summary: While working increasing coverage (https://github.com/jansel/pytorch-jit-paritybench/pull/5) I found that identity function are not exportable because the generated graph has no call_function. Test Plan: Unit test Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94962 Approved by: https://github.com/yanboliang	2023-02-27 19:41:45 +00:00
Wei-Sheng Chin	868640e094	Re-enable a FX-to-ONNX kwargs Test (#94763 ) As title. The re-factorization of ONNX test framework disabled one exporter. This PR just brings that test back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94763 Approved by: https://github.com/justinchuby, https://github.com/abock, https://github.com/titaiwangms	2023-02-27 19:37:37 +00:00
Wenzhe Xue	8dfac7b887	Update `fx.pass.graph_drawer` usage doc to draw fx graph (#95534 ) Previous usage gave this error: ``` f.write(g.get_dot_graph().create_svg()) TypeError: write() argument must be str, not bytes ``` pydot has function to save to different types, e.g. `save_svg()`. I updated the usage doc working code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95534 Approved by: https://github.com/ezyang	2023-02-27 19:27:29 +00:00
cyy	f27e09de04	Cleanup Windows warning suppression in CMake and fix some warnings in the source code (#94927 ) This PR do two things: 1. It moves some Windows warning suppression from various CMake files into the main CMakeList.txt, following the conventions of gcc and clang. 2. It fixes some Windows warnings in the source code. Most importantly, it fixes lots of dll warnings by adjusting C10_API to TORCH_API or TORCH_PYTHON_API. There are still some dll warnings because some TORCH_API functions are actually built as part of libtorch_python Pull Request resolved: https://github.com/pytorch/pytorch/pull/94927 Approved by: https://github.com/malfet	2023-02-27 19:22:20 +00:00
PyTorch MergeBot	d950f45577	Revert "[Functional Collectives] Migrate DeviceMesh::all_reduce to use functional all_reduce. (#95009 )" This reverts commit 0765dbc25ed9368f41225e7de231ee3dd6b188a3. Reverted https://github.com/pytorch/pytorch/pull/95009 on behalf of https://github.com/jeanschmidt due to this PR is causing internal breakages. Check https://fburl.com/diff/me41urq8	2023-02-27 19:21:58 +00:00
Kiarash Jamali	1cf11c1c86	Add bfloat16 support to upsample (#95500 ) Fixes https://github.com/pytorch/pytorch/issues/80339 This PR was previously here: https://github.com/pytorch/pytorch/pull/95159 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95500 Approved by: https://github.com/ezyang	2023-02-27 19:21:52 +00:00
Renfei Chen	c44a733018	Fix split_module bug (#95493 ) Summary: Title, the mapping currently has lots of unused keys due to the condition or always return True, but it will not affect the correctness. Test Plan: N/A Differential Revision: D43579510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95493 Approved by: https://github.com/Skylion007	2023-02-27 19:11:49 +00:00
andrewor14	a3b505c55e	[Quant] Fix setting fixed qparams for inner LSTM ops (#95537 ) Summary: The existing util function did not quantize all inner ops in the quantizable LSTM module, resulting in the error "Could not run X with arguments from the 'QuantizedCPU' backend." This commit fixes this by ensuring that all the other ops whose qparams were not specifically configured are still quantized as before, as in `torch.ao.nn.quantizable.LSTM.from_float`. Test Plan: This commit also adds an additional check in the test to ensure that the final converted model is in fact quantized, in addition to just checking the qparams in the observers have the right values. python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams Reviewers: vkuzo Subscribers: vkuzo, supriyar Pull Request resolved: https://github.com/pytorch/pytorch/pull/95537 Approved by: https://github.com/vkuzo	2023-02-27 19:08:51 +00:00
Kazuaki Ishizaki	31ce32b03d	Fix typos in documents under torch (#95597 ) This PR fixes typos of documents in `.md` files under `torch` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95597 Approved by: https://github.com/ezyang	2023-02-27 19:07:47 +00:00
Kazuaki Ishizaki	3beb644578	[dynamo] Fix keyword argument name of all_dim (#95600 ) This PR changes keyword argument name of `all_dim` function from `keeepdim` to `keepdim`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95600 Approved by: https://github.com/ezyang	2023-02-27 19:05:49 +00:00
Zachary DeVito	4f84c57c87	Fix potential deadlock when recording memory traces (#95273 ) See comment in the diff Differential Revision: [D43490668](https://our.internmc.facebook.com/intern/diff/D43490668) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95273 Approved by: https://github.com/eellison	2023-02-27 19:04:47 +00:00
Kazuaki Ishizaki	9a4cb9bcaf	Fix typos under torch/_inductor directory (#95601 ) This PR fixes typos in comments and messages of `.py` files under `torch/_inductor` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95601 Approved by: https://github.com/ezyang	2023-02-27 19:00:17 +00:00
donnyyou	5d70ee93fa	Expose more headers for extensions. (#95447 ) Fixes #ISSUE_NUMBER Expose more headers for extensions of distributed methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95447 Approved by: https://github.com/ezyang	2023-02-27 18:59:40 +00:00
cyy	c1fa403e57	suppress nvfuser loading warning when we disable nvfuser (#95603 ) To avoid annoying warnings such as "[W interface.cpp:47] Warning: Loading nvfuser library failed" Pull Request resolved: https://github.com/pytorch/pytorch/pull/95603 Approved by: https://github.com/ezyang	2023-02-27 18:56:46 +00:00
Sebastian Raschka	97ec340fe9	Fix double-a typo (#95470 ) Fixes a type where there was a repeated "a" in a warning message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95470 Approved by: https://github.com/ezyang	2023-02-27 18:54:43 +00:00
Li-Huai (Allan) Lin	4930ae7f82	[MPS] Add roll op (#95168 ) Reuse the cpu implementation here as currently there is no native roll implementation from the MPS api (if any, please let me know). Compared to falling back to cpu using `PYTORCH_ENABLE_MPS_FALLBACK=1`, this way we keep tensors on MPS. Did a small benchmark: ```python for num in [10, 100, 1000, 10000]: for shft in [1, 5]: sz = num * num x = torch.arange(sz, device="cpu").view(num, num) s = time.time() r = torch.roll(x, shft) cpu_e = time.time() - s x = torch.arange(sz, device="mps").view(num, num) s = time.time() r = torch.roll(x, shft) mps_e = time.time() - s print(f"size: ({num}, {num}) shft: {shft} cpu: {cpu_e} mps: {mps_e}") ``` ``` size: (10, 10) shft: 1 cpu: 0.00015163421630859375 mps: 0.003078937530517578 size: (10, 10) shft: 5 cpu: 6.794929504394531e-05 mps: 0.0014979839324951172 size: (100, 100) shft: 1 cpu: 0.0001621246337890625 mps: 0.0016200542449951172 size: (100, 100) shft: 5 cpu: 0.00016379356384277344 mps: 0.00154876708984375 size: (1000, 1000) shft: 1 cpu: 0.0022068023681640625 mps: 0.0017690658569335938 size: (1000, 1000) shft: 5 cpu: 0.009071111679077148 mps: 0.0020020008087158203 size: (10000, 10000) shft: 1 cpu: 0.16785407066345215 mps: 0.011695146560668945 size: (10000, 10000) shft: 5 cpu: 0.1160881519317627 mps: 0.011452913284301758 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95168 Approved by: https://github.com/albanD	2023-02-27 18:31:17 +00:00
PyTorch MergeBot	448c97ca10	Revert "Disable MacOS M1 test jobs (#95509 )" This reverts commit afece1992aace1b2dd334f5b61978605b3ac6c2b. Reverted https://github.com/pytorch/pytorch/pull/95509 on behalf of https://github.com/huydhn due to https://github.com/pytorch/pytorch/issues/95510 has been mitigated, macos m1 runners have been added back	2023-02-27 18:27:17 +00:00
Pearu Peterson	b89fda51cd	Implement sparse semantics support in gradcheck (2nd try) (#95405 ) Replaces https://github.com/pytorch/pytorch/pull/94714 that was reverted due to https://github.com/pytorch/pytorch/pull/94714#issuecomment-1442355648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95405 Approved by: https://github.com/albanD	2023-02-27 17:48:02 +00:00
Tal Ben-Nun	ea367347c0	[inductor] Allow list of decompositions to be overridden (#95468 ) Partially addresses #95021 by exposing decompositions as an argument. The reason for the `is None` check is to enable passing an empty list of decompositions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95468 Approved by: https://github.com/ngimel	2023-02-27 17:45:41 +00:00
Nikita Vedeneev	325b43661e	add/add_ for compressed sparse inputs: bypass BLAS in some trivial cases (#95293 ) In `add(self, other, out=...)` we can bypass calls to BLAS in cases when `self == other == out` and `self == other`. This PR fixes the repro from https://github.com/pytorch/pytorch/issues/94966, but the issue is still present when `x.add_(x)` is replaced, say, with `x = x.clone().add_(x)`. Could that be a synchronization issue? CC @IvanYashchuk . Pull Request resolved: https://github.com/pytorch/pytorch/pull/95293 Approved by: https://github.com/cpuhrsch	2023-02-27 16:06:02 +00:00
Angela Yi	d301caa890	Deepcopy output node metadata (#95426 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95426 Approved by: https://github.com/SherlockNoMad	2023-02-27 15:25:54 +00:00
lezcano	b3175ae95f	Avoid copies in matmul (#76828 ) With this PR, matmul just folds a bmm into a mm o mv if and only if it can achieve so without copying. We add tests for this to make sure that our algorithm to detect this is accurate. For the cases where it was copying before see https://github.com/pytorch/pytorch/pull/75197#discussion_r843413208 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489479 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489805 Fixes https://github.com/pytorch/pytorch/issues/76702 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76828 Approved by: https://github.com/ngimel	2023-02-27 15:24:59 +00:00
Nikita Karetnikov	d83a14e7f6	[inductor] enable `test_grid_sampler_2d_dynamic_shapes` (#95575 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95575 Approved by: https://github.com/ezyang	2023-02-27 15:19:33 +00:00
lezcano	03cc0f587c	Don't create large intermediary tensors in the backward of matmul (#95261 ) Currently, if we multiply a transposed batch of matrices with shape [b, m, n] and a matrix with shape [n, k], when computing the gradient of the matrix, we instantiate a matrix of shape [b, n, k]. This may be a very large matrix. Instead, we fold the batch of matrices into a matrix, which avoids creating any large intermediary tensor. Note that multiplying a batch of matrices and a matrix naturally occurs within an attention module, so this case surely happens in the wild. In particular, this issue was found while investigating the OOMs caused by the improved folding algorithm in the next PR of this stack. See https://github.com/pytorch/pytorch/pull/76828#issuecomment-1432359980 This PR fixes those OOMs and decreases the memory footprint of the backward of matmul. I understand this is a tricky one, so I put it on its own PR to discuss it. Differential Revision: [D43541495](https://our.internmc.facebook.com/intern/diff/D43541495) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95261 Approved by: https://github.com/ezyang	2023-02-27 15:19:09 +00:00
Nikita Shulga	fd8367a7b1	[MPS][BE] Introduce xfail (#95045 ) Add `mps_ops_modifier` function that adds `unittest.expectedFailure` decorators to the operators that supposed to fail on MPS. This allows one to know whether or not operation will fail, rather than skip it. For example: ``` % python test_mps.py -v -k test_output_match_dot test_output_match_dot_cpu_float32 (__main__.TestConsistencyCPU) ... ok test_output_match_dot_cpu_int16 (__main__.TestConsistencyCPU) ... ok test_output_match_dot_cpu_int32 (__main__.TestConsistencyCPU) ... ok test_output_match_dot_cpu_int64 (__main__.TestConsistencyCPU) ... expected failure test_output_match_dot_cpu_uint8 (__main__.TestConsistencyCPU) ... ok ---------------------------------------------------------------------- Ran 5 tests in 0.175s OK (expected failures=1) ``` Moved a few functions from blocklist to xfail, and find out that some of the functions in the list actually work, for example `torch.long`. Also, allow `None` to be used in `ALLOWLIST` instead of specifying all types explicitly (which aligns with `DecorateInfo` semantic) Eventually, we should get rid of `ALLOWLIST` (i.e. all ops are allowed), keep small `BLOCKLIST` and move the rest to `XFAILLIST` Add step to print HW/SW info before running MPS tests. Fix type promotion in `trace_mps_out` Introduce `MACOS_12_X_XFAILLIST` and skip almost every function for `torch.uint8`, although some of those doesn't make much sense and feels like a regression from PyTorch-1.13 Re-enabled MPS testing on MacOS 12, as runners seems to be available again Pull Request resolved: https://github.com/pytorch/pytorch/pull/95045 Approved by: https://github.com/albanD	2023-02-27 15:01:01 +00:00
Sergii Dymchenko	11f293a74e	Comment about Meta-internal usage of trymerge.py (#95536 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95536 Approved by: https://github.com/malfet	2023-02-27 14:16:04 +00:00
Edward Z. Yang	fb10e66d35	Bulk convert numel() to sym_numel() in FunctionsManual (#95543 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95543 Approved by: https://github.com/ngimel, https://github.com/Skylion007	2023-02-27 13:46:13 +00:00
Jianyu Huang	21f680e8ad	Follow up on CUDA 12 support for PyTorch/Caffe2 (#95582 ) Differential Revision: D43610669 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95582 Approved by: https://github.com/ngimel	2023-02-27 04:39:56 +00:00
Nikita Karetnikov	5265170029	[inductor] enable `test_recompile_on_index_dynamic_shapes` (#95581 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95581 Approved by: https://github.com/ezyang	2023-02-27 03:17:33 +00:00
Edward Z. Yang	6624a73837	Move istype and object identity tests into a dispatching dictionary. (#95476 ) The idea is to make it a little more obvious which branch you're going to go down in a subset of cases, and make it easier to detect if you've accidentally shadowed one condition with another (the reason I wrote this in the first place.) The type dictionary also makes it harder for people to accidentally use isinstance when they should have used istype. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95476 Approved by: https://github.com/jansel	2023-02-27 02:50:58 +00:00
Joel Schlosser	d6dd67a248	Dynamo: Use out-of-place binary ops instead of in-place (#95446 ) Fixes issues with things like: ```python x = 2 x += y.shape[0] ``` resulting in invalid `2 += y.shape[0]` code in the FX graph. Fix: Whenever dynamic shapes are involved, insert the out-of-place op to the FX graph instead of the in-place op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95446 Approved by: https://github.com/ezyang	2023-02-27 02:10:37 +00:00
Edward Z. Yang	7dd95ad7f3	Add a convenience shortcut for accessing size on ComptimeVar (#95404 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95404 Approved by: https://github.com/voznesenskym	2023-02-27 02:02:02 +00:00
Jason Ansel	56c3e4ce35	[inductor] Shrink mm configs for small sizes (#95555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95555 Approved by: https://github.com/ngimel	2023-02-26 22:42:53 +00:00
Jason Ansel	6e61629f10	[inductor] Refactors/improvements to max-autotune (#95554 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95554 Approved by: https://github.com/ngimel, https://github.com/nmacchioni	2023-02-26 22:39:04 +00:00
Jason Ansel	d3e1f165b3	Copy helper next_power_of_2 from triton (#95436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95436 Approved by: https://github.com/ngimel	2023-02-26 20:49:36 +00:00
ydwu4	261b019a64	Copy nn_module_stack meta data when creates create node in tracer (#95358 ) This pr allows tracer to always preserve the nn_module_stack (if there is any) meta data when creating node. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95358 Approved by: https://github.com/SherlockNoMad	2023-02-26 20:21:40 +00:00
Brian Hirsh	bc51ee4ed7	fix spurious aot autograd warning (#95521 ) The _make_boxed logic probably needs a cleanup, but this fixes a spurious warning that we should get in before the release. Confirmed that this used to emit a warning and no longer does: ``` import torch lin = torch.nn.Linear(100, 10) def f(x): return lin(x) opt_f = torch.compile(f) opt_f(torch.randn(10, 100, requires_grad=False)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95521 Approved by: https://github.com/ngimel	2023-02-26 18:17:36 +00:00
Andrew Gu	6c30dc6cee	[FSDP] Save `_all_handles`; `_all_fsdp_states` to root (#95465 ) - The previous PR addressed one tree traversal in `_root_pre_forward()` but not the main one from `_get_fsdp_handles()` that runs for all settings. - This PR saves `_all_handles` to cache `_get_fsdp_handles()` and `_all_fsdp_states` to cache `_get_fsdp_states()` (renamed from `_fsdp_states` compared to last PR) on the root state. - This PR introduces a dummy `_RootFSDPState` class that inherits from `_FSDPState` to be used only for type checking since some attributes are only defined for root states. - I found this approach to be better than adding `_p_assert(state.root_only_attr is not None, ...)` upon each usage of `root_only_attr`. - This hopefully also helps readers to quickly see which attributes are defined only on root states. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95465 Approved by: https://github.com/fduwjj	2023-02-26 13:59:53 +00:00
Nicky Yee	ac9b305afe	Back out "cherry-picking autodiff support for gather/index_select (#93333 )" (#95565 ) Summary: A bisect blamed #93333 for GPU memory leakage. This diff backs it out. Test Plan: Monitor max GPU memory usage to see if there's a leak. Reviewed By: hyuen, yinbinm Differential Revision: D43511893 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95565 Approved by: https://github.com/ngimel	2023-02-26 10:24:46 +00:00
Naren Dasan	3064bc4060	[dynamo] Reserve the tensorrt backend name for torch-tensorrt (#94632 ) In PR #93822 the `fx2trt` backend was removed which registered the `tensorrt` backend names to point to `fx2trt` / `torch_tensorrt` and move the name to `onnxrt`. We want to reserve the name `tensorrt` for `torch_tensorrt` to prevent any confusion but due to code-freeze we cannot complete the integration and set up testing for the next release. So we propose leaving out the `tensorrt` name until we can set up the backend and testing for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94632 Approved by: https://github.com/frank-wei	2023-02-26 09:40:31 +00:00
fduwjj	fa7f17799a	[3/N][BE][ST Deprecate] Remove Replicated Tensor (#95453 ) Please use distributed tensor instead. We are deprecating ShardedTensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95453 Approved by: https://github.com/wanchaol	2023-02-26 06:18:31 +00:00
fduwjj	a88bfc60c7	[2/N][ST deprecate][BE] Remove Replicate Tensor convert from DDP and PTD (#95450 ) No use is found for this ST/Replicated Tensor based DDP. As part of ShardedTensor migration, let's remove this logic. Trying to undo everything in https://github.com/pytorch/pytorch/pull/75753. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95450 Approved by: https://github.com/wanchaol	2023-02-26 03:03:37 +00:00
Huy Do	9b7abc4fac	Run slow gradcheck tests sequentially (#95494 ) Also redo https://github.com/pytorch/pytorch/pull/95246 as there are many more still run OOM Pull Request resolved: https://github.com/pytorch/pytorch/pull/95494 Approved by: https://github.com/clee2000	2023-02-26 00:44:25 +00:00
Nikita Shulga	9bca9df42b	[BE] Fix TORCH_WARN_ONCE (#95559 ) It does not take a condition as first argument, unlike `TORCH_CHECK` Test plan, run: ` python3 -c "import torch;print(torch.arange(1., 10.,device='mps').view(3, 3).trace())"` and observe no warning Pull Request resolved: https://github.com/pytorch/pytorch/pull/95559 Approved by: https://github.com/Skylion007	2023-02-25 20:47:27 +00:00
kshitij12345	407b0f3214	fix for debug crash build (#95464 ) Fixes https://github.com/pytorch/pytorch/issues/94376 ⚠️ Hacky fix Details about use of `noop_vtable`: `d677432b70/c10/core/impl/PyInterpreter.h (L92-L102)` Currently, at destruction, `noop_vtable` goes out of scope first while there are dangling references to the object still present with other objects like `PythonKernelHolder` which is held by the singleton `Dispatcher`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95464 Approved by: https://github.com/ezyang	2023-02-25 19:42:06 +00:00
Edward Z. Yang	d78274b759	Automatically guard when SymInt is converted to int (#95479 ) During enablement, we disabled int() conversions because they were any easy way to footgun guards. We have enough of dynamic shapes working now that this is now causing spurious errors; e.g., if you feed a symbolic int to x.size(symint). We now allow for implicit conversions of SymInt to int here, posting a guard. We expect guard provenance to help people debug overspecialization. Fixes https://github.com/pytorch/pytorch/issues/95328 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95479 Approved by: https://github.com/wconstab, https://github.com/voznesenskym, https://github.com/ngimel	2023-02-25 19:41:51 +00:00
Aaron Enye Shi	a530446f57	Manual submodule update: kineto and libfmt bazel issue (#94756 ) (#95535 ) Summary: This is a manual pull request to update the third_party submodule for [pytorch/kineto](https://github.com/pytorch/kineto). Also, tries to fix the failure in libfmt bazel build similar to https://github.com/pytorch/pytorch/pull/93219. New submodule commit: `92c5344f0b` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95535 Test Plan: Ensure that CI jobs succeed on GitHub before landing. Differential Revision: D43588413 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/95535 Approved by: https://github.com/davidberard98	2023-02-25 19:26:08 +00:00
Yanbo Liang	02d44e5de4	[Dynamo] Support CUDA stream passed from outside of torch.compile decrator (#94627 ) Fixes #94499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94627 Approved by: https://github.com/jansel	2023-02-25 19:15:59 +00:00
Bin Bao	ab1ab3ab19	[CI] Specify more torch.backends.cudnn options to reduce non-determinism (#95478 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95478 Approved by: https://github.com/ezyang	2023-02-25 18:54:12 +00:00
Li-Huai (Allan) Lin	4dca9bde05	[MPS] Add fmax fmin op (#95191 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95191 Approved by: https://github.com/kulinseth	2023-02-25 07:21:48 +00:00
Yanbo Liang	057bc7191d	[Dynamo] Remove torch.autograd.profiler.profile workaround in UserDefined (#95504 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95504 Approved by: https://github.com/williamwen42	2023-02-25 05:15:01 +00:00
Wei Wang	f5cf1a8b43	Update triton hash (#95540 ) Fixes #95523 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95540 Approved by: https://github.com/ngimel	2023-02-25 03:56:29 +00:00
PyTorch MergeBot	ee6610ddf6	[vision hash update] update the pinned vision hash (#95532 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95532 Approved by: https://github.com/pytorchbot	2023-02-25 03:24:53 +00:00
Edward Z. Yang	b8151d2ba9	Utility for running delta comparisons between two flag configs (#95411 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95411 Approved by: https://github.com/Chillee	2023-02-25 02:30:22 +00:00
Edward Z. Yang	69d62373aa	Move multi-line wrap functions to helper (#95472 ) My intention is to collapse all of the istype() and isinstance() and object identity tests into a more structured form involving a dict lookup. To do this conveniently, I need every continuation to be expressible in a single expression. Thus, all multi-line wrap methods are moved. This is code motion only, no logic changes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95472 Approved by: https://github.com/Skylion007	2023-02-25 02:23:40 +00:00
Edward Z. Yang	a33d8133a5	Slight cleanup of VariableBuilder giant if condition (#95471 ) Some of these changes are semantics preserving, some are not. Please review carefully. * Use `istype(x, y)` over `type(x) is y` * Use istype over isinstance in frozenset. If the user subclassed the type in question, we must treat it as a user defined class as it may have custom behavior * The `isinstance(value, (int, float))` condition for `wrap_unspecialized_primitive` is dead-ish; direct int/float values are caught earlier istype check. Technically however, if you subclassed int/float it would pass through, however this is almost assuredly not intended behavior Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95471 Approved by: https://github.com/Skylion007	2023-02-25 02:23:40 +00:00
Kyle Yoon	8693604bc6	coreml - Wrap Core ML execute and forward calls in autorelease pool (#95384 ) Summary: When performing inference using the Core ML delegate, memory is increasing indefinitely. This is due to Core ML allocating memory within `predictionFromFeatures:error:`. Seems that the autorelease pool does not release the return values from the prediction method until inference is stopped completely. So we need to release with `autoreleasepool` manually ([per Apple guidance in the Apple Developer Forums](https://developer.apple.com/forums/thread/692425)). This commit wraps `autoreleasepool` around the `execute` function of `PTMCoreMLBackend`, which is the scope of where the return values of `predictionFromFeatures:error:` are. Also added in `PTMCoreMLExecutor` for good measure. Differential Revision: D43520767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95384 Approved by: https://github.com/mcr229	2023-02-25 01:06:36 +00:00
Ivan Zaitsev	ca59b2d375	Fix co-dev regresssion in github-exports-check job (#95345 ) Summary: Regression introduced in #91134 (github-exports-check calls git, which is not available internally at Meta). Meta employees, see T145865943 for the context. Test Plan: Unit tests, `github-export-checks` job. Differential Revision: D43521051 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95345 Approved by: https://github.com/kit1980	2023-02-24 22:40:28 +00:00
Sim Sun	acb81c1c5a	[pytorch] Bump SoLoader version to 0.10.5 (#95498 ) Summary: Use system linker by default on Android N and above devices. Test Plan: sandcastle and Circle CI Differential Revision: D43581588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95498 Approved by: https://github.com/kit1980	2023-02-24 22:37:47 +00:00
Huy Do	afece1992a	Disable MacOS M1 test jobs (#95509 ) We have an outage with MacOS m1 runner, so need to disable the job till next Monday where infra has capacity to look into the issue. Note: Do we want to keep MPS tests on `macos-m1-13`? (As long as this new runners are still there) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95509 Approved by: https://github.com/clee2000	2023-02-24 22:36:07 +00:00
eqy	cc39cd6938	[CUDA][CUBLAS] Explicitly link against `cuBLASLt` (#95094 ) An issue surfaced recently that revealed that we were never explicitly linking against `cuBLASLt`, this fixes it by linking explicitly rather than depending on linker magic. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/95094 Approved by: https://github.com/malfet, https://github.com/ngimel, https://github.com/atalman	2023-02-24 21:44:32 +00:00
Jane Xu	b215af2db8	[optim] Add general documentation on our algorithm defaults (#95391 ) I added a section + table under Algorithms https://docs-preview.pytorch.org/95391/optim.html?highlight=optim#module-torch.optim <img width="725" alt="image" src="https://user-images.githubusercontent.com/31798555/221246256-99325a27-9016-407b-a9fe-404d61e41a82.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95391 Approved by: https://github.com/albanD	2023-02-24 21:35:30 +00:00
Sergii Dymchenko	0520a680c0	Rebuild LICENSES_BUNDLED.txt (#95505 ) A re-run of third_party/build_bundled.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/95505 Approved by: https://github.com/seemethere	2023-02-24 21:24:05 +00:00
Yanming Wang	b855b5eaac	SymIntify topk (#95015 ) Companion PR for https://github.com/pytorch/xla/pull/4644. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95015 Approved by: https://github.com/ezyang	2023-02-24 21:20:50 +00:00
Jason Ansel	f53671e46e	[inductor] Bugfix in autotuning cache handling (#95435 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95435 Approved by: https://github.com/nmacchioni, https://github.com/yanboliang	2023-02-24 21:19:52 +00:00
Kulin Seth	76cbe5797d	[MPS] Add TORCH_CHECK for Conv (#95480 ) - Also remove FFTs from fallback Pull Request resolved: https://github.com/pytorch/pytorch/pull/95480 Approved by: https://github.com/DenisVieriu97	2023-02-24 19:52:35 +00:00
Bin Bao	4c8ad93a7c	[Inductor][CI] Remove hf_GPT2_large from CPU inference test (#95473 ) Summary: hf_GPT2_large shows random failure on CI for the CPU inference. Created https://github.com/pytorch/pytorch/issues/95474 for the Intel team to investigate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95473 Approved by: https://github.com/anijain2305	2023-02-24 18:21:36 +00:00
Horace He	01c861af14	Added utilities to instrument kernel bandwidth numbers (#95355 ) Looks like ![image](https://user-images.githubusercontent.com/6355099/221048077-33aeff50-0951-42c9-89e9-22049db4f94d.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95355 Approved by: https://github.com/ngimel, https://github.com/jansel	2023-02-24 17:51:11 +00:00
Sergii Dymchenko	d677432b70	Remove non-existing third_party/catch from CMake (#95420 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95420 Approved by: https://github.com/huydhn	2023-02-24 08:00:07 +00:00
Michael Voznesensky	9ded087bac	During export, generate Python TENSOR_MATCH guards (#94970 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94970 Approved by: https://github.com/ezyang	2023-02-24 05:37:31 +00:00
Shawn Xu	80a6b24ee1	[pt] move csrc shm logic to aten storage utils (#95228 ) Summary: This is part 1 of the effort to support `share_memory_()` in C++ aten library. This allows C++ code to in place replace the tensor storage to shm based. For now fd based shm is the only implementation supported to simplify memory management in general. This first part intentionally avoids public api changes (to `TensorBase`, see comments in `StorageUtil.h`) such that we can get the core features usable outside pt/csrc first. The API addition to `Tensor` or `TensorBase` would involve more distracting changes and make the change harder to review. Test Plan: ``` buck test caffe2:StorageUtils_test ``` Differential Revision: D43467616 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95228 Approved by: https://github.com/ezyang	2023-02-24 05:30:00 +00:00
Will Constable	a12e92d8e4	Support nn.Module forward hooks in torchdynamo (#92125 ) Tweak dynamo behavior in 2 places when calling nn.Modules, to route the call to __call__ instead of .forward(), since __call__ is the codepath that eager users hit and will dispatch to hooks correctly. (1) inside NNModuleVariable.call_function, which covers the common case of calling a module from code dynamo is already tracing (2) at the OptimizedModule layer, which is the entrypoint into a top-level nn.Module dynamo is about to compile This exposes a new bug: NNModuleVariable used to special-case calling module.forward() (which is a method) as a UserFunctionVariable with an extra 'self' arg. After tracing into module.__call__, there is no longer a special case for the eventual call into .forward, and it gets wrapped in a UserDefinedObjectVariable following standard behavior of ._wrap(). UDOV can't be called, so this broke some tests. - Fix: add a new special case in _wrap() that treats methods as a UserDefinedMethod instead of UserDefinedObjectVariable. Now, the forward method can be called. Also, fix NNModuleVar.call_method routing forward back to __call__ Pull Request resolved: https://github.com/pytorch/pytorch/pull/92125 Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/voznesenskym	2023-02-24 05:10:29 +00:00
leslie-fang-intel	d89bfa16e7	[quant] add serialization method for quantized hardswish (#94486 ) Summary Fix the issue: https://github.com/pytorch/pytorch/issues/91877. The root cause is serialization and deserialization method for `state_dict` does not enable for `QuantizedHardswish`. Added these methods in this PR. Test plan ``` python -m pytest quantization/core/test_quantized_module.py -k test_hard_swish ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94486 Approved by: https://github.com/jgong5, https://github.com/vkuzo	2023-02-24 04:43:27 +00:00
Connor Henderson	9d04d376d8	docs: Match open bracket with close bracket in unsqueeze (#95215 ) Was going to fix something else that I thought was an issue, but isn't, so just leaving this tiny thing in case it's wanted Pull Request resolved: https://github.com/pytorch/pytorch/pull/95215 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-02-24 03:56:59 +00:00
PyTorch MergeBot	6665fe9e65	[vision hash update] update the pinned vision hash (#95427 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95427 Approved by: https://github.com/pytorchbot	2023-02-24 03:39:47 +00:00
Brian Hirsh	a641d60757	hotfix for memory leak in aot autograd induced by saving tensors for backward (#95101 ) Workaround fix in AOTAutograd for https://github.com/pytorch/pytorch/issues/94990 (see the comments for more details / discussion) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95101 Approved by: https://github.com/albanD	2023-02-24 03:02:55 +00:00
XiaobingSuper	4846d52212	inductor: fix complier error when trying to vectorize logit_and and logit_or (#95361 ) Currently, `operator&& ` and `operator\|\| ` don't have vectorization implementation, disable them now for a quick fix for 2.0 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95361 Approved by: https://github.com/ngimel, https://github.com/EikanWang	2023-02-24 02:30:13 +00:00
Rodrigo Kumpera	0765dbc25e	[Functional Collectives] Migrate DeviceMesh::all_reduce to use functional all_reduce. (#95009 ) BC: This changes the signature and semantics of DeviceMesh::all_reduce. DeviceMesh::all_reduce now uses a functional collective under the hood which makes it more easily traceable. You no longer need to use CommTensor to get a trace. all_reduce now is async only and uses AsyncCollectiveTensor to ensure proper stream synchronization. Signature changed: removed `async_op` param and changes return type from `Optional[Work]` to `torch.Tensor`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95009 Approved by: https://github.com/wanchaol	2023-02-24 02:10:55 +00:00
Li-Huai (Allan) Lin	5cad542e43	[MPS] Add log_sigmoid op (#95280 ) 1. Add log_sigmoid. 2. Make log1p a common function. Operators that use log1p: mish, softplus, log_sigmoid (maybe more). Pull Request resolved: https://github.com/pytorch/pytorch/pull/95280 Approved by: https://github.com/kulinseth	2023-02-24 01:38:30 +00:00
Mark Saroufim	9f707f164e	Add more GPU metric instrumentation (#91717 ) Fixes https://github.com/pytorch/serve/issues/1937 A fairly common query I see folks running while using pytorch is `nvidia-smi --format=csv,noheader,nounits --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.used,temperature.gpu,power.draw,clocks.current.sm,clocks.current.memory -l 10` Existing metrics we have * For kernel utilization`torch.cuda.utilization()` * For memory utilization we have them under `torch.cuda.memory` the memory allocated with `torch.cuda.memory.memory_allocated()` * For total available memory we have `torch.cuda.get_device_properties(0).total_memory` Which means the only metrics we're missing are * Temperature: now in `torch.cuda.temperature()` * Power draw: now in `torch.cuda.power()` * Clock speed: now in `torch.cuda.clock_speed()` With some important details on each * Clock speed settings: I picked the SM clock domain which is documented here https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g805c0647be9996589fc5e3f6ff680c64 * Temperature: I use `pynvml.nvmlDeviceGetTemperature(handle, 0)` where 0 refers to the GPU die temperature Pull Request resolved: https://github.com/pytorch/pytorch/pull/91717 Approved by: https://github.com/ngimel	2023-02-24 00:38:03 +00:00
Edward Z. Yang	8efe4fd590	Memoize repeated nonzero calls to the same fake tensor (#95399 ) This removes the need to explicitly constrain_unify `x[mask]` and `y[mask]` when mask is a boolean tensor. It's very narrow but it seems to work in practice. To invalidate the nonzero call when mutation occurs, I use version counter. I know there are ways to bypass this but I think it's good enough for now. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95399 Approved by: https://github.com/eellison	2023-02-24 00:27:45 +00:00
Edward Z. Yang	4833e47feb	Add support for nonzero, some improvements to reduce guards (#95387 ) This takes the strategy described in https://docs.google.com/document/d/1lFRYAJo5nrfxRhwIzGnfi2pbLpU6T4ytSRSuLJ5qebI/edit# It is essentially https://github.com/pytorch/pytorch/pull/95222 but squashed and with changes that are unnecessary given that we assume nonzero returns > 1. What's in the PR: * nonzero now supports meta propagation. When `capture_dynamic_output_shape_ops`, it will return a tensor with an unbacked SymInt representing the size in question. * The unbacked SymInt is UNSOUNDLY assumed to be not equal to 0/1. We will still error if you guard otherwise. * PrimTorch pointwise operators are updated to use empty_permuted, to avoid guarding on unbacked SymInt from empty_strided (tested in `test_dynamic_pointwise_scalar`) * Convolution is updated to skip backend selection if batch is unbacked, to avoid guarding on unbacked SymInt (tested in `test_unbacked_batch_resnet`) * I kept the helper utilities like `definitely_true` for working with possibly unbacked SymInts. They're not used right now but maybe someone will find them useful. * Added `constrain_unify` to let you specify two unbacked SymInts must have the same value Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95387 Approved by: https://github.com/voznesenskym	2023-02-24 00:27:45 +00:00
Atharva Kavitkar	627282fa6c	Corrected grammar in contribution guide (#93014 ) Corrected the grammar of a sentence in "Implementing Features or Fixing Bugs" section of the contribution guide. Before: Issues that are labeled first-new-issue, low, or medium priority provide the best entrance point are great places to start. After: Issues that are labeled first-new-issue, low, or medium priority provide the best entrance point _and_ are great places to start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93014 Approved by: https://github.com/albanD, https://github.com/kit1980	2023-02-24 00:22:14 +00:00
PyTorch MergeBot	3bafecf719	Revert "Add various uninterpreted bit tensor data types (#94992 )" This reverts commit 9dbfca7840680ccd8d43f3e12594420ab9cd82e4. Reverted https://github.com/pytorch/pytorch/pull/94992 on behalf of https://github.com/atalman due to breaks libtorch windows nightly builds see: https://github.com/pytorch/pytorch/pull/95406	2023-02-23 23:54:23 +00:00
Zain Rizvi	f172c7c60a	Improve retries when ECR login is flaky (#95398 ) We had a few failures on master where the AWS ECR login was flaky - [example 1](https://github.com/pytorch/pytorch/actions/runs/4255994694/jobs/7404316780) - [example 2](https://github.com/pytorch/pytorch/actions/runs/4255390043/jobs/7402936370) - [example 3](https://github.com/pytorch/pytorch/actions/runs/4255390040/jobs/7403356275) Most likely the failure happened when getting the AWS_ACCOUNT_ID (which wasn't protected by a retry). Retrying getting the account id, and also moving the whole step into a retry action to retry on slightly longer lasting ECR outages Pull Request resolved: https://github.com/pytorch/pytorch/pull/95398 Approved by: https://github.com/huydhn	2023-02-23 23:47:10 +00:00
Jane Xu	6dc81f7bdd	Update docs that Parameters are immune to no_grad mode (#95232 ) Fixes https://github.com/pytorch/pytorch/issues/83998 ![image](https://user-images.githubusercontent.com/31798555/220971800-4af57d92-9f15-4e13-bfe4-73e2ff1cd943.png) ![image](https://user-images.githubusercontent.com/31798555/221019508-d7330a16-7f01-4d37-a1af-a4905e9596c4.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95232 Approved by: https://github.com/soulitzer	2023-02-23 23:33:19 +00:00
Huy Do	98c5921ed5	Upload artifacts from inductor-A100-perf to S3 (#95401 ) This addresses the missing artifacts from induction A100 perf workflows on HUD https://github.com/pytorch/pytorch/issues/95075#issuecomment-1441924840 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95401 Approved by: https://github.com/clee2000, https://github.com/wconstab	2023-02-23 21:46:04 +00:00
Will Constable	24dd37ef51	Add BOOL_FALSE guard to optimize empty container case (#95248 ) There is a fast way to implement a guard for an empty dict, which is to check its bool() value. However, we can't use this guard in general, since we can only safely apply it at runtime if the runtime value actually is a dict (or, another type that works with 'bool' in the same way). A counterexample is when a tensor is passed instead of a dict, and throws on bool() operator. So we can put a type check in the guard, but that is slow enough it defeats the purpose. Instead, we note that for the case of NNModuleVariables (which are specialized NNModules not unspecialized ones), we already have a hook in place to invalidate the guards if setattr is called. I am claiming that setattr is the only way that the type of a property on an NNModule could change. If I'm right, then it's safe to (a) only use this guard for NNModuleVariables, (b) not do a type check inside the guard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95248 Approved by: https://github.com/voznesenskym	2023-02-23 21:35:15 +00:00
Andrew Gu	9c45f47bbe	[FSDP] Save `_fsdp_states` on root (#95343 ) This saves an attribute `_fsdp_states: Optional[_FSDPState]`. For root, it is populated with all `_FSDPState`s in the root's tree. For non-root, it is `None`. This is used to avoid doing the tree traversal during `_root_pre_forward()` when `forward_prefetch=True`. Differential Revision: [D43536895](https://our.internmc.facebook.com/intern/diff/D43536895) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95343 Approved by: https://github.com/fegin	2023-02-23 21:18:05 +00:00
Jesse Cai	cba8b12fa7	[quant][bug fix] Fix qrange_len in `torch.ao.quantization.utils.py` (#95297 ) Summary: It looks like there is a typo and qrange_len should be 2^32 instead of 2^31, as it is currently set. Test Plan: ``` python test/test_quantization.py TestObserver.test_per_tensor_observers ``` Reviewers: Subscribers: Tasks: https://github.com/pytorch/pytorch/issues/95295 Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/95297 Approved by: https://github.com/vkuzo	2023-02-23 20:23:45 +00:00
ssjia	0eeb04652a	[vulkan] Pad channels when using texture storage instead of "tight packing" (#95251 ) Currently, in Vulkan 4D tensors are represented in GPU textures by simply combining the batch and channel dimensions into the depth axis. However, if the number of channels is not a multiple of 4, then data belonging to the same batch can cross texel boundaries. For instance, consider a tensor with `N=2`, `C=3`. The depth axis of the texture would contain the data ``` \|tex1\|tex2\| ----------- \|AAAB\|BB00\| ``` Where A represents data from `n=1`and B represents data form `n=2`. This packing structure ("tight packing") makes some ops that care about batch boundaries more complex and inefficient to implement. Therefore this diff introduces channel padding when storing tensors as image textures. The same tensor with `N=2`, `C=3` would now have the depth axis contain ``` \|tex1\|tex2\| ----------- \|AAA0\|BBB0\| ``` Differential Revision: [D43068669](https://our.internmc.facebook.com/intern/diff/D43068669/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43068669/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/95251 Approved by: https://github.com/salilsdesai	2023-02-23 19:08:00 +00:00
Yinghai Lu	d4882a9445	Make the cuda device assert error message clearer (#95360 ) Summary: Easier to debug Test Plan: CI Differential Revision: D43525303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95360 Approved by: https://github.com/ngimel	2023-02-23 18:28:54 +00:00
Angela Yi	ec10d23c51	[dynamo] Fix list contains check (#95092 ) Original issue was something like: ``` def func(x): assert x.size(-1) in [4, 5, 6], "bad" return x + x ``` where the contains check is comparing a symint (x.size(-1)) with other integers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95092 Approved by: https://github.com/voznesenskym, https://github.com/yanboliang	2023-02-23 18:22:32 +00:00
Pearu Peterson	0c0694495b	Fix a bug in nesting check_sparse_tensor_invariants context managers (#95372 ) As in the title. The bug was reported in https://github.com/pytorch/pytorch/pull/94728#discussion_r1108892366 and has the following reproducer: ```python >>> import torch >>> check_ctx = torch.sparse.check_sparse_tensor_invariants(True) >>> no_check_ctx = torch.sparse.check_sparse_tensor_invariants(False) >>> with check_ctx: ... assert torch.sparse.check_sparse_tensor_invariants.is_enabled() ... with no_check_ctx: ... assert not torch.sparse.check_sparse_tensor_invariants.is_enabled() ... assert torch.sparse.check_sparse_tensor_invariants.is_enabled() ... Traceback (most recent call last): File "<stdin>", line 5, in <module> AssertionError ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95372 Approved by: https://github.com/cpuhrsch	2023-02-23 18:22:13 +00:00
Zain Rizvi	808879ec8b	Revert "Implement sparse semantics support in gradcheck (#94714 )" (#95386 ) This reverts commit 7ac511c29ad365f6dc078b8353d9c189720970a2 from https://github.com/pytorch/pytorch/pull/94714 since it breaks periodic. Git thinks there's a merge conflict due to an unfortunately located newline deletion, so reverting this one manually Details behind the failure in https://github.com/pytorch/pytorch/pull/94714#issuecomment-1442160593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95386 Approved by: https://github.com/clee2000	2023-02-23 18:02:37 +00:00
Catherine Lee	fb3ff77438	[mergebot] Fix for pagination error (#95333 ) Fix for weird bug that happens very rarely. My solution is to retrieve all checksuites before going to retrieve their checkruns. Sometimes `cs_cursor=edges[edge_idx - 1]["cursor"] if edge_idx > 0 else None,` is None when it shouldn't be because of how we reset `checksuites = get_next_checksuites(checksuites)` on every loop. Ex page 1 of checksuites contains some stuff page 2 of checksuites: pull {a bunch of checkruns} cs_cursor gets set to none for the pull checksuite on page 2 because `checksuites = get_next_checksuites(checksuites)` resets the edges on every loop. Then the checkruns can't be retrieved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95333 Approved by: https://github.com/huydhn	2023-02-23 17:48:56 +00:00
PyTorch MergeBot	254b161def	Revert "During export, generate Python TENSOR_MATCH guards (#94970 )" This reverts commit 5a8092f0584590796e1f64a1f51ac0c834750449. Reverted https://github.com/pytorch/pytorch/pull/94970 on behalf of https://github.com/voznesenskym due to Clowny comparison bug on edge cases for devices	2023-02-23 17:47:59 +00:00
PyTorch MergeBot	cb6e38d89d	Revert "Update docs that Parameters are immune to no_grad mode (#95232 )" This reverts commit 5783cee2a3a1457fc93b00a4a50e61ba02f148db. Reverted https://github.com/pytorch/pytorch/pull/95232 on behalf of https://github.com/ZainRizvi due to This caused the test_doc_examples test to fail on trunk	2023-02-23 17:43:45 +00:00
alexdremov	b9e95158d5	[MPS] Fix LSTM backward and forward pass (#95137 ) Fixes #91694 Fixes #92615 Several transpositions were missing for backward graph in case of `batch_first=True`. The #91694 is not reproduced with `batch_first=False`. After fixing transpose issue, I finally thought that now I can use LSTM freely in my project. And then I got horrific results on train. Seems related to #92615. After that I decided to fix LSTM's backward step completely. I collected all my findings in this thread — seems like I succeeded Funny enough, backward tests were completely disabled before and were not passing: ```python @unittest.skipIf(True, "Backward of lstm returns wrong result") def test_lstm_2(self, device="mps", dtype=torch.float32): ``` UPD: forward pass of multi-layer version also was wrong due to the incorrect `initState, initCell` slices. Tests were passing because states were inited with zeros. Accidentally fixed this too Pull Request resolved: https://github.com/pytorch/pytorch/pull/95137 Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/soulitzer	2023-02-23 17:32:42 +00:00
Denis Vieriu	86efa104f5	[MPS] Fix view op slicing for 2nd dim in case of 0 offset (#95381 ) * Fix view op slicing for 2nd dim in case of 0 offset Pull Request resolved: https://github.com/pytorch/pytorch/pull/95381 Approved by: https://github.com/razarmehr	2023-02-23 17:26:10 +00:00
Jane Xu	5783cee2a3	Update docs that Parameters are immune to no_grad mode (#95232 ) Fixes https://github.com/pytorch/pytorch/issues/83998 ![image](https://user-images.githubusercontent.com/31798555/220971800-4af57d92-9f15-4e13-bfe4-73e2ff1cd943.png) ![image](https://user-images.githubusercontent.com/31798555/220971892-35554d17-fc44-4211-9017-7a5555ae3bb1.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95232 Approved by: https://github.com/soulitzer	2023-02-23 16:41:54 +00:00
Edward Z. Yang	af202aea34	Add knobs for globally turning off 0/1 specialization and duck shaping (#95352 ) They're not wired up to anything right now but the most logical wiring would be to add torch._dynamo.config to toggle them. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95352 Approved by: https://github.com/voznesenskym	2023-02-23 16:29:10 +00:00
Edward Z. Yang	94fd063f3f	Stop subclassing sympy Symbol (#95313 ) According to ngimel (and also noticed by me), printing x1s0*2 doesn't work correctly in Sympy as it complains '<' not supported between instances of 'tuple' and 'str' This is probably a Sympy bug but the real answer is subclassing is more trouble than its worth and we ought not do it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95313 Approved by: https://github.com/ngimel	2023-02-23 16:28:50 +00:00
Pearu Peterson	cece63f197	Add warn-once deprecation warning to legacy sparse constructors (#94850 ) Addresses https://github.com/pytorch/pytorch/issues/68323#issuecomment-1425174341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94850 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2023-02-23 15:05:12 +00:00
kshitij12345	3b966a6ce3	[autograd] disable backward/grad for complex scalar output (#92753 ) Fixes https://github.com/pytorch/pytorch/issues/92750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92753 Approved by: https://github.com/ezyang	2023-02-23 11:38:27 +00:00
Yanbo Liang	b5ff41a47a	[Dynamo] No graph break on calling dict & collections.OrderedDict() (#95250 ) It's common to call ```dict()``` or ```collections.OrderedDict()``` inside of ```forward``` function, so we should not graph break. This pattern has been used in many places including: * The use case in [torchvision]( `928b05cad3/torchvision/models/_utils.py (L66-L73)`). * It causes ~100 model failures(nopython=True) in the 14k github models. * Also it hits several Meta internal use cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95250 Approved by: https://github.com/jansel	2023-02-23 09:03:07 +00:00
Peter Bell	bc438af6fe	std/var: support floating point correction value (#94073 ) Ref https://github.com/pytorch/pytorch/issues/61492#issuecomment-1413003480 The array API specifies correction to be `Union[int, float]` while we currently only support integers. https://data-apis.org/array-api/latest/API_specification/generated/array_api.std.html As std/var is calculated currently, the final count of elements is already done in floating point so we can make the correction floating point without any loss of precision or generality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94073 Approved by: https://github.com/ezyang	2023-02-23 05:50:45 +00:00
Peter Bell	56aed2a6bb	SymFloat: Expose comparison operators in C++ API (#94812 ) This is adapted from the corresponding methods in `SymInt.h`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94812 Approved by: https://github.com/ezyang	2023-02-23 05:50:45 +00:00
XiaobingSuper	5730cabdd0	using float type to do the computation of norm reduce for cpu half and bfloat16 dtype (#95166 ) As the title, we should use a higher dtype to compute norm reduce for half and bfloat1 dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95166 Approved by: https://github.com/peterbell10, https://github.com/jgong5, https://github.com/ngimel, https://github.com/lezcano	2023-02-23 05:00:25 +00:00
Iris	6912cf4053	[DCP] Update DCP to use the updated FSDP optim state_dict APIs (#95303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95303 Approved by: https://github.com/fegin	2023-02-23 03:55:02 +00:00
Zain Rizvi	c97275acf6	Fix OOMing periodic shards (#95246 ) Tests have been consistently failing with the error on the following shards with the error `RuntimeError: CUDA error: out of memory` - `periodic / linux-bionic-cuda11.7-py3-gcc7-slow-gradcheck / test (default, 1, 2, linux.4xlarge.nvidia.gpu)` - `periodic / linux-bionic-cuda11.7-py3-gcc7-slow-gradcheck / test (default, 2, 2, linux.4xlarge.nvidia.gpu)` Seeing if serializing those test files makes the periodic jobs succeed again. This feels a bit off since there are so many different test files that have failed and need to be serialized, indicating a potential perf regression somewhere Failures on hud: https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=100&name_filter=periodic%20%2F%20linux-bionic-cuda11.7-py3-gcc7-slow-gradcheck%20%2F%20test%20(default%2C%20 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95246 Approved by: https://github.com/Skylion007, https://github.com/huydhn	2023-02-23 03:50:56 +00:00
Iris	bdb78e529e	[PTD][DCP] Add fsdp checkpoint example (#95258 ) Add an example to show recommended way to checkpoint FSDP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95258 Approved by: https://github.com/kumpera	2023-02-23 03:40:27 +00:00
PyTorch MergeBot	c594a32f60	[vision hash update] update the pinned vision hash (#95340 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95340 Approved by: https://github.com/pytorchbot	2023-02-23 03:34:14 +00:00
Driss Guessous	29c235e555	[SDPA] Fix bug in parsing scaled_dot_product_attention arguments (#95311 ) Fixes #95266 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95311 Approved by: https://github.com/cpuhrsch	2023-02-23 03:12:46 +00:00
Natalia Gimelshein	8e391c735f	use 4 warps for small block config in mm (#95339 ) Temporary Fix for #95312 In triton, 1 warp computes 16x16 tile of output, so for 32x32 block we only need 4 warps. 8 warps IMA, which is a bug, but it's not a good config anyway. Triton main is supposed to have better behavior for these pathological, but we are not on main yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95339 Approved by: https://github.com/ezyang, https://github.com/Chillee	2023-02-23 03:03:42 +00:00
Nikita Karetnikov	ba8ff4be4d	[inductor] enable `test_nll_loss_forward_dynamic_shapes` (#95176 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95176 Approved by: https://github.com/ezyang	2023-02-23 02:52:21 +00:00
Sergii Dymchenko	f98733e976	Fix disbale typos (#95322 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95322 Approved by: https://github.com/clee2000	2023-02-23 02:08:45 +00:00
Mikayla Gawarecki	586ac98cde	Bugfix nested mem_efficient path in SDPA when E_qk != E_v (#95330 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95330 Approved by: https://github.com/drisspg, https://github.com/cpuhrsch	2023-02-23 02:06:46 +00:00
Andrew Gu	78175ceeab	[FSDP][Docs] Re-add why reg. post-bwd hook on 1st forward (#95326 ) This PR adds back some explanation for why we have the heuristic to only register the post-backward hook on the first forward in the case of multiple forwards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95326 Approved by: https://github.com/fegin	2023-02-23 01:50:25 +00:00
Edward Z. Yang	f247129f23	Avoid FPE when running batch norm with zero batch size. (#95324 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95324 Approved by: https://github.com/bdhirsh	2023-02-23 01:26:03 +00:00
Kyle Yoon	a257486bdd	coreml_delegate - Add input shape in error when throwing from predicting (#95249 ) Summary: This change adds input shape when CoreML throws an errors. Test Plan: testMCSModelInvalidInputShape tests that the assert throws when invalid input shapes are provided. Differential Revision: D43449112 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95249 Approved by: https://github.com/mcr229	2023-02-23 00:45:44 +00:00
Nicolas Macchioni	3ebab9aeff	[pt2][inductor] switch dinfo representation (#95302 ) Summary: bypass-github-export-checks use `dinfo.name` instead of `repr(dinfo)`, as initial results have shown that `dinfo.total_memory` may unexpectedly fluctuate Test Plan: sandcastle + CI Differential Revision: D43503558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95302 Approved by: https://github.com/bertmaher	2023-02-23 00:15:29 +00:00
Edward Z. Yang	ca7eb1bab2	Preserve meta["val"] on export (#95314 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95314 Approved by: https://github.com/yinghai, https://github.com/voznesenskym	2023-02-22 23:24:57 +00:00
Fabio Rocha	f6f413c6b6	Second part of splitting #91254 in two (#92749 ) This handles the disabling masks if numel is a multiple of BLOCK. It currently introduces a performance regression, but the triton it generates does not seem to have any issues: all the change does is cause xmask to be removed from load/stores in cases where it safely can be removed. It seems it must be coming from some issue in triton optimizer. FWIW, if you try this change with current triton master (instead of pinned version) it does _not_ cause a performance regression. However, upgradign to triton master by itself already causes significant performance regressions so it's not an option to just bump up the pin. I'm going to leave this PR open until we manage to increase the triton pin past the big refactoring. Once we do that I will check if it still causes a performance regression. UPDATE: The triton pin has been moved and I retried this PR. As expected, there's no longer a performance regression for hf_Bert: ``` tspin python benchmarks/dynamo/torchbench.py --performance --backend inductor --float16 --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) --only hf_Bert -n 5 --diff-branch viable/strict 2> err batch size: 16 cuda train hf_Bert numel_BLOCK 1.175x p=0.00 batch size: 16 cuda train hf_Bert viable/strict 1.161x p=0.00 ``` Re-opening this, should be okay to merge now I expect. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92749 Approved by: https://github.com/jansel	2023-02-22 22:55:05 +00:00
Andrew Gu	cbac56e244	[BE] Simplify `Source.is_nn_module`; add some types (#95292 ) I am still reading Dynamo source code... This is an easy PR to simplify `Source.is_nn_module()` to reuse `GuardSource.is_nn_module()` instead of having the `in (...)` check implemented twice. While simplifying that, I thought I might as well add some type annotations for `Source` methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95292 Approved by: https://github.com/ezyang	2023-02-22 22:33:58 +00:00
ydwu4	674ef1f9be	Make fx.Transformer.get_attr call tracer to preserve node.meta (#95245 ) Currently, transformer creates proxy objects directly for get_attr method. node.meta is lost in this step. In order to keep it, we invoke tracer.create_proxy. Meta data is copied over in tracer.create_proxy and tracer.create_node. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95245 Approved by: https://github.com/SherlockNoMad, https://github.com/tugsbayasgalan	2023-02-22 22:33:37 +00:00
Howard Huang	c0fa0669f6	Update isend/irecv warning messages for nccl (#95236 ) Summary: nccl backend does not support `tag` as mentioned in https://github.com/pytorch/pytorch/issues/94819. Adding a note in the documentation for it. Example: <img width="888" alt="image" src="https://user-images.githubusercontent.com/14858254/220464900-094c8063-797a-4bdc-8e25-657f17593fe9.png"> Differential Revision: D43475756 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95236 Approved by: https://github.com/awgu, https://github.com/rohan-varma	2023-02-22 22:00:13 +00:00
Pearu Peterson	7ac511c29a	Implement sparse semantics support in gradcheck (#94714 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94714 Approved by: https://github.com/soulitzer, https://github.com/albanD	2023-02-22 20:03:25 +00:00
Li-Huai (Allan) Lin	b6a1c238bd	[MPS] Remove mps specialized path in BCE backward (#95220 ) Remove mps specialized path in BCE backward as `logit` op has been implemented for mps. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95220 Approved by: https://github.com/soulitzer	2023-02-22 19:43:53 +00:00
Li-Huai (Allan) Lin	69c76ff05e	[MPS] Add xlogy op (#95213 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95213 Approved by: https://github.com/kulinseth, https://github.com/soulitzer	2023-02-22 19:43:12 +00:00
Iris	5fa937886c	[DCP][nit] Rename variables + minor documentation fix for optimizer.py (#95264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95264 Approved by: https://github.com/rohan-varma	2023-02-22 19:07:10 +00:00
Edward Z. Yang	3758559a58	Reland "Introduce constrain_range; remove old expr_subs (#95063 )" (#95209 ) This reverts commit 4e88547c957cdc3a3c87e7b873520638ccfbd667. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95209 Approved by: https://github.com/albanD	2023-02-22 18:16:25 +00:00
Zain Rizvi	d6a8d397da	Fix formatting for merge failed message (#95234 ) Fixes formatting so that the merge rule shows up on a different line than the "Raised by" text Follow up to https://github.com/pytorch/pytorch/pull/94932 New version <img width="433" alt="image" src="https://user-images.githubusercontent.com/4468967/220441349-ac99096d-590a-42c1-b995-4a23b2d9b810.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95234 Approved by: https://github.com/huydhn	2023-02-22 18:11:22 +00:00
Ramin Azarmehr	d88d4145c3	[MPS] Fix Float16 issue with Reduction ops for macOS 12 (#94952 ) This would fix the issue with `__rdiv__` with float16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94952 Approved by: https://github.com/kulinseth	2023-02-22 18:07:56 +00:00
Denis Vieriu	5e47571a13	[MPS] Convolution cleanup; remove unnecessary contiguous calls (#95078 ) - Fixes convolution crashes in backward with weights - Removes unnecessary contiguous calls Pull Request resolved: https://github.com/pytorch/pytorch/pull/95078 Approved by: https://github.com/kulinseth	2023-02-22 18:04:12 +00:00
Kulin Seth	02a6d4334b	[MPS] Handle broadcasting by expanding src tensor in Copy.mm (#95272 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95272 Approved by: https://github.com/DenisVieriu97	2023-02-22 18:02:42 +00:00
Michael Voznesensky	5a8092f058	During export, generate Python TENSOR_MATCH guards (#94970 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94970 Approved by: https://github.com/ezyang	2023-02-22 17:28:17 +00:00
Denis Vieriu	8475af7761	[MPS] Cast int64 to int32 for reduction ops (#95231 ) - give warnings of converting int64 for reduction ops - use cast tensor for reduction sum on trace - unblock trace from running Pull Request resolved: https://github.com/pytorch/pytorch/pull/95231 Approved by: https://github.com/razarmehr	2023-02-22 17:23:25 +00:00
PyTorch MergeBot	6ae60b19b7	Revert "During export, generate Python TENSOR_MATCH guards (#94970 )" This reverts commit 5d2eb6d636069a255754289572dfa36ffa35e5a7. Reverted https://github.com/pytorch/pytorch/pull/94970 on behalf of https://github.com/jeanschmidt due to Requires codev to land internal test changes	2023-02-22 16:49:37 +00:00
Nicolas Macchioni	5f24b2b1f0	[pt2][inductor] search caches by default (#95134 ) Summary: attempt two at enabling search of global/local cache, regardless of `max_autotune`, by default. the main problem is that triton template generation seems to be broken in some cases for CI tests (maybe dynamic shapes), but this is going to take more time to figure out. for now, we can just cancel template generation instead of raising an assertion error and filter out those failed templates. Test Plan: sandcastle + CI Differential Revision: D43424922 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95134 Approved by: https://github.com/jansel	2023-02-22 06:02:17 +00:00
Will Constable	8de4238a31	Add dynamo bench arg --per_process_memory_fraction (#95260 ) Simply pipes the arg to the existing torch.cuda API by the same name. Useful for locally debugging OOMs that happened on a smaller GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95260 Approved by: https://github.com/davidberard98	2023-02-22 05:11:18 +00:00
Shunting Zhang	a4b02a15d3	Support registering op returning symint in python (#95240 ) Running an operator registered in python returning a symint will result in the following error: ``` RuntimeError: Unable to cast Python instance of type <class 'torch.SymInt'> to C++ type 'long' ``` The interaction of 2 things make the issue being triggered: - We use boxed kernel here. For boxed kernel, we need convert py::object to IValue in torch/csrc/autograd/python_variable.cpp pushPyOutToStack . - In the schema parsing code in torch/csrc/jit/frontend/schema_type_parser.cpp SchemaTypeParser::parseFakeAndRealType , if a SymInt is found, we register a Int type instead (not sure why we do this), and register SymInt as the real type. The result is we would convert an SymInt to int in pushPyOutToStack and cause the issue. The fix is to use real type when we convert py::object to IValue. BTW, registering the same op using C++ API does not trigger the issue. ``` TORCH_LIBRARY(clib, m) { m.def("sqsum(SymInt a, SymInt b) -> SymInt", [](SymInt a, SymInt b) -> SymInt { return a * a + b * b; }); } ``` The reason is, the kernel registered in C++ is unboxed kernel and it does not trigger the code path above that converts an py::object to IValue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95240 Approved by: https://github.com/larryliu0820, https://github.com/ezyang	2023-02-22 04:56:37 +00:00
Jane Xu	097679478e	[optim] Set defaults to foreach, NOT fused (#95241 ) Rolling back the default change for Adam and rectifying the docs to reflect that AdamW never defaulted to fused. Since our fused implementations are relatively newer, let's give them a longer bake-in time before flipping the switch for every user. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95241 Approved by: https://github.com/ngimel	2023-02-22 04:47:32 +00:00
Huy Do	2f547ae613	Remove SHA checksum for bazel http_archive from GitHub (#95039 ) An action item from https://github.com/pytorch/pytorch/issues/94346 Although the security practice of setting the checksum is good, it doesn't work when the archive is downloaded from some sites like GitHub because it can change. Specifically, GitHub gives no guarantee to keep the same value forever https://github.com/community/community/discussions/46034. This also adds a new linter to make sure that SHA checksum from GitHub can be removed quickly. The WORKSPACE file is actually updated using the new linter: ``` >>> Lint for WORKSPACE: Advice (BAZEL_LINTER) format Redundant SHA checksum. Run `lintrunner -a` to apply this patch. You can run `lintrunner -a` to apply this patch. 5 5 \| 6 6 \| http_archive( 7 7 \| name = "rules_cuda", 7 \|- sha256 = "f80438bee9906e9ecb1a8a4ae2365374ac1e8a283897281a2db2fb7fcf746333", 9 8 \| strip_prefix = "runtime-b1c7cce21ba4661c17ac72421c6a0e2015e7bef3/third_party/rules_cuda", 10 9 \| urls = ["`b1c7cce21b`.tar.gz"], 11 10 \| ) -------------------------------------------------------------------------------- 29 28 \| name = "pybind11_bazel", 30 29 \| strip_prefix = "pybind11_bazel-992381ced716ae12122360b0fbadbc3dda436dbf", 31 30 \| urls = ["`992381ced7`.zip"], 31 \|- sha256 = "3dc6435bd41c058453efe102995ef084d0a86b0176fd6a67a6b7100a2e9a940e", 33 31 \| ) 34 32 \| 35 33 \| new_local_repository( -------------------------------------------------------------------------------- 52 50 \| urls = [ 53 51 \| "https://github.com/gflags/gflags/archive/v2.2.2.tar.gz", 54 52 \| ], 54 \|- sha256 = "34af2f15cf7367513b352bdcd2493ab14ce43692d2dcd9dfc499492966c64dcf", 56 53 \| ) 57 54 \| 58 55 \| new_local_repository( ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95039 Approved by: https://github.com/ZainRizvi	2023-02-22 04:39:19 +00:00
Wei Wang	8d22eb61aa	Upgrade setuptools before building wheels (#95265 ) Should fix https://github.com/pytorch/builder/issues/1318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95265 Approved by: https://github.com/ngimel	2023-02-22 04:21:09 +00:00
Wei Wang	a4d866b1eb	Update triton hash (#95247 ) Should fix #95082 This hash commit is supposed to fix sm_89 issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95247 Approved by: https://github.com/ngimel, https://github.com/seemethere	2023-02-22 04:05:00 +00:00
PyTorch MergeBot	e769371781	[vision hash update] update the pinned vision hash (#95252 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95252 Approved by: https://github.com/pytorchbot	2023-02-22 03:44:38 +00:00
PyTorch MergeBot	cf6e078c34	Revert "Reland "Introduce constrain_range; remove old expr_subs (#95063 )" (#95209 )" This reverts commit f7bf31fff1b72752227459bb51e5682abefcfed7. Reverted https://github.com/pytorch/pytorch/pull/95209 on behalf of https://github.com/ezyang due to internal sympy is too old	2023-02-22 01:58:58 +00:00
AllenTiTaiWang	f67d2df933	[ONNX] Refactor validation op-level (#94920 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94920 Approved by: https://github.com/BowenBao	2023-02-22 00:06:59 +00:00
Huy Do	c399ee09fe	Use PyTorch wheel in Windows CI (#94958 ) Per the convo in https://github.com/pytorch/pytorch/pull/93139/files#r1107487994, switching Windows CI to use built PyTorch wheel like other platforms instead of 7z-ing stuffs over. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94958 Approved by: https://github.com/malfet	2023-02-21 23:56:05 +00:00
Li-Huai (Allan) Lin	f70a3430aa	[MPS] Add hypot op (#95196 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95196 Approved by: https://github.com/kulinseth	2023-02-21 22:40:20 +00:00
Peter Bell	7289d22d67	Use FindCUDAToolkit to find cuda dependencies (#82695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/82695 Approved by: https://github.com/malfet	2023-02-21 22:35:17 +00:00
Jane Xu	5d1fec80e3	[BE][CI] remove .jenkins entirely (#92625 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92625 Approved by: https://github.com/huydhn	2023-02-21 21:36:04 +00:00
Edward Z. Yang	f20c4d2345	Stop printing giant container in test failure message (#95226 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95226 Approved by: https://github.com/albanD	2023-02-21 21:15:02 +00:00
David Berard	ed4b6d2113	[profiler] update docs with repeat=1 (#95085 ) Specifying number of times to repeat is now required when defining the schedule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95085 Approved by: https://github.com/aaronenyeshi	2023-02-21 21:09:10 +00:00
Peter Bell	640b9c80f9	[primTorch] Redefine prim.collapse{,_view} end point to be inclusive (#92017 ) This makes `prims.collapse(a, start, end)` match the behavior of `torch.flatten(a, start, end)` more closely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92017 Approved by: https://github.com/mruberry	2023-02-21 20:36:50 +00:00
Peter Bell	2622adb980	[primTorch] Make `prims.collapse` a real prim (#91748 ) `prims.collapse` is currently just a plain python function wrapping `prims.reshape`. This turns it into a real prim, and also factors out some of the code duplicated with `_collapse_view_aten`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91748 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-02-21 20:36:50 +00:00
Renfei Chen	0d2e91573e	Reorder the Fx execution order to in-time get_attr rather than putting all get_attr ahead (#95014 ) Summary: Basically today we: [getattr....getattr, call partition1, call parition2] this makes getattr just in time: so [getattr, call partition1, getattr, call partition 2 ..] Test Plan: CMF and MAI test result: https://fb.quip.com/K5J9A7G246Ox Differential Revision: D43376080 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95014 Approved by: https://github.com/angelayi	2023-02-21 20:05:30 +00:00
Edward Z. Yang	e5785f1e34	If the input is contiguous, short-circuit infer_size_dv in reshape (#95216 ) The main improvement is that this avoids guards from infer_size_dv, although this also counts as a minor perf improvement too. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95216 Approved by: https://github.com/albanD	2023-02-21 19:45:41 +00:00
jjsjann123	7b403c8c75	Nvfuser moving python tests and files under nvfuser (#95155 ) 1. Moving `test_jit_cuda_fuser.py` `test_nvfuser_dynamo.py` `test_nvfuser_frontend.py` under `third_party/nvfuser/python_tests/`. 2. Moving `nvfuser/__init__.py` to `third_party/nvfuser/python/`. 3. Leaving dummy test scripts under `./test/` for CI. 4. Patching `torch/_prims/nvfuser_prims.py` for view/reshape renaming in nvfuser 5. Installing `third_party/nvfuser/python` and `third_party/nvfuser/python_tests` to pytorch root/test directy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95155 Approved by: https://github.com/davidberard98	2023-02-21 19:27:24 +00:00
Michael Voznesensky	5d2eb6d636	During export, generate Python TENSOR_MATCH guards (#94970 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94970 Approved by: https://github.com/ezyang	2023-02-21 19:12:57 +00:00
William Wen	307ebacf94	[dynamo 3.11] fix to eval_frame.c (#94102 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94102 Approved by: https://github.com/albanD, https://github.com/jansel, https://github.com/malfet	2023-02-21 18:47:36 +00:00
William Wen	1123ab8647	[dynamo 3.11] changes to with contexts (#94101 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94101 Approved by: https://github.com/albanD, https://github.com/jansel	2023-02-21 18:47:36 +00:00
William Wen	04d931d979	[dynamo 3.11] changes to MAKE_FUNCTION and MATCH_KEYS (#94100 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94100 Approved by: https://github.com/albanD, https://github.com/jansel	2023-02-21 18:47:34 +00:00
William Wen	d5aaf54261	[dynamo 3.11] fix cell/freevar offsets (#94099 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94099 Approved by: https://github.com/albanD, https://github.com/jansel	2023-02-21 18:47:32 +00:00
William Wen	055a9e45aa	[dynamo 3.11] changes to LOAD_GLOBAL and function calls (#94098 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94098 Approved by: https://github.com/albanD	2023-02-21 18:47:30 +00:00
Zain Rizvi	da98053c6d	Fix bug where a github api failure would prevent the label check from failing (#95098 ) Fix bug where a github api failure would prevent the check from failing even if we already saw that labels were needed. Also adds more debugging info to the rate limit exceeded error since it's weird to see an error claiming the rate limit has exceeded when the "Used" amount is way below the limit. I suspect these happen when the request arrived just before the rate reset time, but the response was generated right after the reset time, hence the apparently tiny "used" amounts Example run where the check should have failed, but passed instead: https://github.com/pytorch/pytorch/actions/runs/4200205209/jobs/7285979824 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95098 Approved by: https://github.com/huydhn	2023-02-21 18:42:12 +00:00
kshitij12345	311b20aae1	[fix] torch.pow handle real negative base and complex exponent (#95198 ) Fixes https://github.com/pytorch/pytorch/issues/89903 https://github.com/pytorch/pytorch/issues/95111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95198 Approved by: https://github.com/albanD, https://github.com/ngimel	2023-02-21 18:36:20 +00:00
Zain Rizvi	976d289e86	Fix `update_pytorch_labels` workflow (#95227 ) Pass in repo args now that they're required (after a recent refactor). Also changes the script to pass in the repo name instead of being hardcoded to pytorch/pytorch. I'm guessing this wasn't noticed earlier since the workflow is only triggered when a label is created/edited/deleted Pull Request resolved: https://github.com/pytorch/pytorch/pull/95227 Approved by: https://github.com/huydhn	2023-02-21 18:26:21 +00:00
Kiersten Stokes	b0f22f8d2b	Use `run_subtests` utility in FSDP `test_state_dict_save_load_flow` test (#95090 ) Converts the single-instance of `self.subTest` in `test_fsdp_state_dict.py` to use the `run_subtests` utility. Related: #84171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95090 Approved by: https://github.com/awgu	2023-02-21 18:24:41 +00:00
Natalia Gimelshein	bef3c02330	try triton with remat fix (#94882 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94882 Approved by: https://github.com/malfet	2023-02-21 18:06:48 +00:00
Edward Z. Yang	f7bf31fff1	Reland "Introduce constrain_range; remove old expr_subs (#95063 )" (#95209 ) This reverts commit 4e88547c957cdc3a3c87e7b873520638ccfbd667. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95209 Approved by: https://github.com/albanD	2023-02-21 18:02:48 +00:00
Edward Z. Yang	ce950b412f	Reland "Add torch.empty_permuted (#95069 )" (#95208 ) This reverts commit 92e03cd583c027a4100a13682cf65771b80569da. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95208 Approved by: https://github.com/albanD	2023-02-21 18:02:48 +00:00
puririshi98	8aa34602f7	Jetson Update for CI Redo (#94549 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-02-21 17:13:38 +00:00
Jeeja	c6d8d10b3e	Fix warning if backend registers timer (#91702 ) currently logger timer is registered default for cpu/cuda. for other backends, it may or may not registers this timer. It reports warning for other backends and return which is not expected. The above may fail, if the backends has have registered this timer. For example, HPU(habana) backend registers this timer. so, in this case it reports a warning and return which is incorrect. Other case is where lazy backend timer is never registered. so, this returns a warning, and this is the reason the check was added, but it fails for other cases. Add a generic check if the timer is registered, then don’t report warning. Signed-off-by: Jeeja <jeejakp@habana.ai> Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/91702 Approved by: https://github.com/kit1980	2023-02-21 14:09:47 +00:00
Edward Z. Yang	7ca623c2e1	Fix convit_base (#95174 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95174 Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/atalman	2023-02-21 14:07:59 +00:00
PyTorch MergeBot	92e03cd583	Revert "Add torch.empty_permuted (#95069 )" This reverts commit bedeb1f014795c497f11942ff4c772431d1c157a. Reverted https://github.com/pytorch/pytorch/pull/95069 on behalf of https://github.com/jeanschmidt due to Breaking internal builds. More in https://fburl.com/phabricator/ztrxrroq	2023-02-21 12:05:20 +00:00
aashishthakur10	079476c6b2	Add a check for n<0 and a test for it (#95144 ) Fixes [94740 ](https://github.com/pytorch/pytorch/issues/94740) Adds a check in `aten/src/ATen/native/ReduceOps.cpp` and a test case in test/test_torch.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/95144 Approved by: https://github.com/lezcano	2023-02-21 10:57:06 +00:00
PyTorch MergeBot	4e88547c95	Revert "Introduce constrain_range; remove old expr_subs (#95063 )" This reverts commit 3711f7c59f772190059ebee7fbd58978e1082267. Reverted https://github.com/pytorch/pytorch/pull/95063 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, more details can be found: https://fburl.com/phabricator/fq5b6k8a	2023-02-21 10:43:39 +00:00
cyy	1ab112cfab	code is clean enough that some warnings can be enabled (#95139 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95139 Approved by: https://github.com/Skylion007	2023-02-21 07:24:20 +00:00
Li-Huai (Allan) Lin	e0a0329a67	[MPS] Add hardsigmoid op (#95164 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95164 Approved by: https://github.com/kulinseth	2023-02-21 07:06:37 +00:00
Li-Huai (Allan) Lin	d96aac8d2a	[MPS] Add logit op (#95162 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95162 Approved by: https://github.com/kulinseth	2023-02-21 07:02:45 +00:00
Michael Gschwind	062380db91	Fix Typo (#95173 ) Summary: Fix Typo Test Plan: sandcastle & github Differential Revision: D43417472 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95173 Approved by: https://github.com/nmacchioni, https://github.com/Skylion007	2023-02-21 04:07:07 +00:00
Liao, Xuan	aa042a57cd	[inductor] fix max_pool2d with ceil mode (#94887 ) Fixes #94775 When ceil mode turns on, max_pool2d has a bug allowing a sliding window to be entirely off bounds. This PR restricts sliding windows to start within the input or left padding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94887 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire	2023-02-21 01:58:19 +00:00
Natalia Gimelshein	1aea2d2ec3	for SymInt nodes in fx graph, get result from node meta in inductor GraphLowering (#95152 ) Finally, swin is passing, with no floors in the generated code. I don't know how to write a test for it though, and swin patterns triggering this are pretty complicated (even prior to this PR we were already good at pulling `floors` out of device code). Pull Request resolved: https://github.com/pytorch/pytorch/pull/95152 Approved by: https://github.com/ezyang	2023-02-21 01:35:41 +00:00
Edward Z. Yang	77dae43767	Don't truncate leading 1s if they are unbacked (#95141 ) This prevents us from guarding on leading unbacked SymInts. The previous attempt at https://github.com/pytorch/pytorch/pull/94521 I got the logic a bit wrong. My idea there was to avoid slicing when the values to be set have low enough dimensionality that they definitely aren't too long. To do this, I need to compute the difference between the data to be set, and the post-slice space for the values. But I incorrectly compared against the pre-slice space in the original PR. Another version of this PR which is wrong is to compare against variableIndices.size(); but remember that in advanced indexing with tensors/lists, each of the individual indices specify what coordinates to read out of each dimension! A third incorrect attempt tested `variableIndices[0].dim()`, which is only correct if you don't broadcast one of the later variable indices, and if there are enough variableIndices to cover all dims. This is all quite complicated, so I went for a simpler solution of checking if the leading dim had a hint before testing if it is not equal to one. BTW, there is no test for this one stripping behavior. There is now a test for this, based off the real code that caused the problem. Signed-off-by: Edward Z. Yang <ezyangmeta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95141 Approved by: https://github.com/ngimel	2023-02-21 00:22:24 +00:00
Masaki Kozuki	f54233e273	[foreach] bump tensor's version and define backward via torchgen (as possible) (#93901 ) ## summary - increment tensor versions in inplace foreach functions - add a logic to take care of `ArrayRef<Scalar>` rel: https://github.com/pytorch/pytorch/issues/58833, https://github.com/pytorch/pytorch/pull/89591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93901 Approved by: https://github.com/albanD	2023-02-20 23:18:07 +00:00
Nicolas Macchioni	83b5eb4e16	[sympy] fix ValueRanges.pow error when b.lower is float (#95151 ) Summary: fix `TypeError: 'Float' object cannot be interpreted as an integer` for `ValueRanges.pow(a, b)` when `not a.is_singleton() and b.is_singleton() and not isinstance(b.lower, int)` this is breaking `cuda11.7-py3.10-gcc7-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu)` {F878635541} Test Plan: sandcastle + CI Differential Revision: D43430385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95151 Approved by: https://github.com/Skylion007	2023-02-20 22:55:24 +00:00
Mengwei Liu	679e5dbfa1	[executorch] Always generate CustomOpsNativeFunctions.h if custom_ops.yaml is present (#95084 ) To match the build system logic, enforce CustomOpsNativeFunctions.h to be generated if we have custom_ops.yaml, even if we don't select any custom ops. Added unit test. Differential Revision: [D43402718](https://our.internmc.facebook.com/intern/diff/D43402718) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95084 Approved by: https://github.com/iseeyuan	2023-02-20 18:54:41 +00:00
Ramin Azarmehr	da41003b5f	[MPS] Fix the uint8 type issue with View ops kernels (#95145 ) This should fix the problem in Resnet model with image artifacts due to saturation on int8 type and also the incorrect class recognition reported in #86954. Fixes #86954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95145 Approved by: https://github.com/kulinseth, https://github.com/DenisVieriu97	2023-02-20 18:09:20 +00:00
Edward Z. Yang	08370ddad8	Update model skips (#95089 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95089 Approved by: https://github.com/albanD	2023-02-20 13:24:49 +00:00
ydwu4	4d753b5045	[WIP][dynamo] simplify module_key creation logic (#94945 ) After some thoughts, I find it difficult to come up with a robust naming convention that satisfies the following constraints at the same time: 1. the new name should be a valid nn.Moule attribute (as required by minifier and it's a good thing to have in general) 2. it can cover various cases such as GetItemSource, GetAttrSource 3. it's easy to recover the original path 4. robust to users' naming scheme. Thanks to @yanboliang for pointing out the original access path is preserved in Source, now we just need to add an additonal value source.name() to node.meta["nn_module_stack"] to get the access path in original module. We also address some TODO in quantization, which relies on the original naming convention in nn_module_stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94945 Approved by: https://github.com/jansel, https://github.com/yanboliang	2023-02-20 07:28:04 +00:00
Wang, Eikan	954c767bc6	[Inductor] Enable accuracy test for CPPBackend (#94898 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94898 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-02-20 05:02:15 +00:00
ganler	3dcf8b6140	[Fix] Inbound check of sorter indices in searchsorted (#95109 ) Fixes https://github.com/pytorch/pytorch/issues/91606, but in C++14 style. Prior fix (https://github.com/pytorch/pytorch/pull/94863) was in C++17 which might violate some builds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95109 Approved by: https://github.com/ngimel	2023-02-20 04:59:11 +00:00
Natalia Gimelshein	286d821e61	Don't replace FloorDiv with floor in simplify, do simplifications for divisible exprs (#95076 ) I don't see why `floor` is better than `FloorDiv` and solve with `FloorDiv` doesn't work anyway (the solution wouldn't be unique even if it worked). Pull Request resolved: https://github.com/pytorch/pytorch/pull/95076 Approved by: https://github.com/jansel, https://github.com/malfet, https://github.com/nkaretnikov	2023-02-20 01:53:54 +00:00
Edward Z. Yang	bedeb1f014	Add torch.empty_permuted (#95069 ) torch.empty_permuted is a generalized version of torch.empty(memory_format=...), where you can pass an arbitrary physical layout as a tuple of dims to allow you to setup dense, non-overlapping tensors with non-standard memory format. Check the docblock for a full description of semantics. The initial motivation for this PR is with guard-less unbacked SymInts. Traditionally, the way we allocate dense tensors with arbitrary layout is with `empty_strided`. However, `empty_strided` does not know that the given strides are actually contiguous, and must test this manually to find out if it is the case. With `empty_permuted`, this is known statically to be the case and helps us skip some 0/1 guards. However, I also think torch.empty_permuted is a useful API in its own right. It is technically possible to simulate this with an empty and a permute; however, there are some downsides: * The manual incant is tricky to work out. To allocate an NHWC tensor, the invocation is `torch.empty(N, H, W, C).permute(0, 3, 1, 2)`; the permute call has to take NHWC to NCHW, and is the inverse of the permutation people are typically thinking of when they talk about NHWC (0, 2, 3, 1). Instead, torch.empty_permuted lets you say `torch.empty_permuted((N, C, H, W), (0, 2, 3, 1))`, letting you provide the intuitive permutation. It can be literally be read off as NHWC if you assign N=0, C=1, H=2, W=3. * An empty(requires_grad=True).permute() is no longer a leaf tensor. You can force it to be a leaf with a detach(), but it is more straightforward and less error prone to allow directly allocating a tensor with the correct permutation. It is also technically possible to simulate this with empty_strided. However, this requires the user to manually compute the contiguous output strides and is bad from a reduction of guards perspective. For what it's worth, this is one of the more common uses of as_strided in the wild, and it would be nice to get rid of it. A nice enhancement of this feature would be to accept `physical_layout` anywhere `memory_format` is accepted. However, this would be a pretty involved change, so I'm doing the easy thing instead. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95069 Approved by: https://github.com/malfet, https://github.com/ngimel, https://github.com/albanD, https://github.com/dagitses	2023-02-20 00:23:10 +00:00
Nicolas Macchioni	50ec4ddb70	fix 'sympy.core.logic' has no attribute 'boolalg' (#95130 ) Summary: fix module error by directly importing `sympy.logic.boolalg.Boolean` Test Plan: CI Differential Revision: D43423823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95130 Approved by: https://github.com/Skylion007	2023-02-20 00:09:57 +00:00
Nikita Karetnikov	567362cedb	[inductor] move dynamic shapes tests into a new file (#94971 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94971 Approved by: https://github.com/ezyang	2023-02-20 00:01:48 +00:00
Edward Z. Yang	3711f7c59f	Introduce constrain_range; remove old expr_subs (#95063 ) This PR introduces a new `constrain_range` function which can be used to constrain the possible values a SymInt/SymFloat can take on. This knowledge can be then used to discharge potential guards (by running the range analysis, and then seeing if the guard must be true given the original range) without adding another guard. The usage of ranges is very limited right now; ranges are only constrained when the user explicitly instructs the system so. However, we can also infer range constraints based on guards as well; this is left for future work. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95063 Approved by: https://github.com/eellison	2023-02-19 23:17:09 +00:00
PyTorch MergeBot	f89ae0a7f4	Revert "Only truncate leading 1s if the value is too big. (#94521 )" This reverts commit 03f4a63fd86fe2d22202c7aee6a4e62c13b4f561. Reverted https://github.com/pytorch/pytorch/pull/94521 on behalf of https://github.com/ezyang due to fails internal tests	2023-02-19 15:05:56 +00:00
kshitij12345	06489a3c1c	[functorch] roll : fix batching rule for scalar tensor (#95048 ) Fixes https://github.com/pytorch/pytorch/issues/94925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95048 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2023-02-19 09:30:30 +00:00
Yanan Cao (PyTorch)	039b4c8809	Add meta function for _upsample_bilinear2d_aa (#94982 ) Differential Revision: D43353000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94982 Approved by: https://github.com/ezyang	2023-02-19 07:11:20 +00:00
Nicolas Macchioni	17d0b7f532	[pt2][inductor]global autotuning cache (#94922 ) Summary: this diff adds logic to handle a global autotuning cache, stored in json format at config.global_cache_path. what is changing from `DiskCache`: * `DiskCache` is renamed to `PersistentCache` * the local cache is now stored as a single file in json format, located at `/tmp/torchinductor_{$USER}/local_cache`. the file contains a dictionary structure like `local_cache[name][inputs][choice]` where `name` is the type of operation, like `addmm`, `inputs` is the repr of the inputs, and `choice` is the hash of a `ChoiceCaller`. the stored value is the benchmark time for that `ChoiceCaller`. * a global cache is added, initially stored at `fbcode/caffe2/torch/_inductor/global_cache`, with almost identical format as the local cache. since the global cache exists over different machines, there is an additional `dinfo` field, such that `global_cache[dinfo] = local_cache` (at least structure wise, there is no guarantee that the global cache and local cache share the same values). `dinfo` is just a repr of the cuda device properties. * the autotuner will prioritize the global cache, and return values from there first, before looking in the local cache * the autotuner will look in both the global cache and the local cache even when `max_autotune=False`, but will still only generate values if `max_autotune=True`. * the autotuner will log global cache hits and misses to a scuba table (inductor_autotuning_cache) which will be used to update the global cache at regular intervals Test Plan: D43285472 Differential Revision: D42785435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94922 Approved by: https://github.com/jansel	2023-02-19 05:35:18 +00:00
Lei Zuo	3f381473cd	[blob inspector] free memory from workspace for di blobs post stats (#95064 ) Differential Revision: D43250357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95064 Approved by: https://github.com/michaelay	2023-02-19 05:05:35 +00:00
alexdremov	a17a7ccc92	[MPS] LogSoftmax numerical stability (#95091 ) Fixes #94043 Calculations are now consistent with numericaly stable formula and CPU: $LogSoftmax(X, \dim) = X - \max(X, \dim) - \log(sum(X - \max(X, \dim), \dim))$ @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/95091 Approved by: https://github.com/malfet, https://github.com/kulinseth	2023-02-18 18:26:29 +00:00
Ramin Azarmehr	9511b9fad2	[MPS] Fix copy_cast_mps() on tensors with storage offset (#95093 ) - The copy_cast path requires storage_offset to be applied before casting - This should fix some correctness issues in transformer models Fixes #94980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95093 Approved by: https://github.com/kulinseth	2023-02-18 16:29:01 +00:00
Li-Huai (Allan) Lin	25ee6dd335	[MPS] Fix fill_ where input tensor has a storage offset (#95113 ) Fixes #94390 Apart from fixing the issue above, this PR also fixes a bug that when an input tensor can be sliced, a sliced array view is created. This array view seems to be not writable or have a different storage from the original tensor, causing incorrect results with the in-place `fill`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95113 Approved by: https://github.com/kulinseth	2023-02-18 16:19:15 +00:00
PyTorch MergeBot	57830a9655	[vision hash update] update the pinned vision hash (#95106 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95106 Approved by: https://github.com/pytorchbot	2023-02-18 03:30:18 +00:00
Yuxin Wu	9bb2fe3eae	fix numpy1.24 deprecations in unittests (#93997 ) Fixes https://github.com/pytorch/pytorch/issues/91329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93997 Approved by: https://github.com/ngimel, https://github.com/jerryzh168	2023-02-18 00:59:09 +00:00
vasiliy	9dbfca7840	Add various uninterpreted bit tensor data types (#94992 ) Summary: This PR adds a set of unintrepreted data types on PyTorch which can be used to implement experimental functionality out of core (think fp8, int4, int16 quant, etc). Note: this is a copy-pasta of https://github.com/pytorch/pytorch/pull/89990 with a bug fix for clang9, easier to just to put up another PR since I'm not sure how comandeering works with Meta-only changes. @bypass-github-export-checks Test Plan: ``` python test/test_quantization.py -k TestBits ``` Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94992 Approved by: https://github.com/angelayi	2023-02-18 00:04:30 +00:00
Eli Uriegas	e44737e619	Revert "Update error messages to reflect why test is skipped (#95049 )" This reverts commit 22e797a8786ffbb1f3b947b70cd8647cc43d6f3e.	2023-02-17 15:41:02 -08:00
William Wen	8928e7bdb8	Raise error on 3.11 dynamo export (#95088 ) For https://github.com/pytorch/pytorch/issues/94914. Realized that `dynamo.export` doesn't immediately raise an error when dynamo is trying to run on 3.11/windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95088 Approved by: https://github.com/weiwangmeta	2023-02-17 23:33:38 +00:00
andrewor14	4fc277c338	[Quant] Add lowering for pixel_shuffle (#94769 ) Summary: `torch.nn.functional.pixel_shuffle` accepts both float and quantized inputs. However, previously we would unnecessarily dequantize quantized inputs into floats before passing them to the function. This commit fixes this by lowering the pattern [dequant - pixel_shuffle - quant]. Test Plan: python test/test_quantization.py TestQuantizeFxOps.test_pixel_shuffle Reviewers: vkuzo Subscribers: vkuzo, supriyar Pull Request resolved: https://github.com/pytorch/pytorch/pull/94769 Approved by: https://github.com/vkuzo	2023-02-17 23:11:17 +00:00
Wei Wang	c16b2916f1	Back out "fix: make sure `sorter` indices are inbound in `searchsorted` (#94863 )" (#95086 ) Summary: Original commit changeset: 96a2200d1fd8 Original Phabricator Diff: D43342962 Test Plan: Sandcastle and land castle as well as buck2 build mode/opt //frl/et/projects/Masquerade/stable/datasets/masquerade/c6p7:post_processing Reviewed By: seemethere, bigfootjon Differential Revision: D43402398 @bypass-github-export-checks Pull Request resolved: https://github.com/pytorch/pytorch/pull/95086 Approved by: https://github.com/bigfootjon	2023-02-17 22:48:22 +00:00
Michael Gschwind	22e797a878	Update error messages to reflect why test is skipped (#95049 ) Summary: Update error messages to reflect why test is skipped Test Plan: github Differential Revision: D43386390 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95049 Approved by: https://github.com/nmacchioni, https://github.com/cpuhrsch	2023-02-17 22:42:25 +00:00
Michael Voznesensky	500ebb2cd6	Fine grained dynamic shape controls (#94787 ) https://docs.google.com/document/d/1aoIyYE8_6cYpWqS25thzVoIiKsT5aaUEOiiPwbIXt8k/edit Pull Request resolved: https://github.com/pytorch/pytorch/pull/94787 Approved by: https://github.com/ezyang	2023-02-17 22:28:37 +00:00
PyTorch MergeBot	30c07722d1	Revert "Inductor: fix incorrect result of inplace unsqueeze (#94797 )" This reverts commit 6ae06e49ac92442e583f05e6b88f58670cecebaa. Reverted https://github.com/pytorch/pytorch/pull/94797 on behalf of https://github.com/ezyang due to bad approach, and can lead to subtle further bugs	2023-02-17 22:22:27 +00:00
PyTorch MergeBot	17c149ad9e	Revert "[CI] Use prebuilt triton from nightly repo (#94732 )" This reverts commit 18d93cdc5dba50633a72363625601f9cf7253162. Reverted https://github.com/pytorch/pytorch/pull/94732 on behalf of https://github.com/kit1980 due to Reverting per offline discussion to try to fix dynamo test failures after triton update	2023-02-17 21:51:25 +00:00
Johnson	0205ffb8d9	Fix expired deprecation of comparison dtype for NumPy 1.24+ (#91517 ) > The `dtype=` argument to comparison ufuncs is now applied correctly. That > means that only `bool` and `object` are valid values and `dtype=object` is > enforced. Source: https://numpy.org/doc/stable/release/1.24.0-notes.html#expired-deprecations Fixes #91516 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91517 Approved by: https://github.com/zou3519, https://github.com/huydhn	2023-02-17 21:11:03 +00:00
Nikita Shulga	d5d55363d9	Add broadcastable check to index_put (#94849 ) Copy-n-paste it from `989299802c/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L582-L583)` Which is used for both CPU and CUDA checks, unless op is called for GPU with `deterministicAlgorithms()` set to true Followup: do the same for XLA and fix the case when indices are not null Fixes https://github.com/pytorch/pytorch/issues/94667 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94849 Approved by: https://github.com/ngimel	2023-02-17 20:37:23 +00:00
PyTorch MergeBot	e0ede1cc30	Revert "Fine grained dynamic shape controls (#94787 )" This reverts commit 2aa806608bc28a401292255a621f03ec507134f9. Reverted https://github.com/pytorch/pytorch/pull/94787 on behalf of https://github.com/kit1980 due to After this PR, test_autocast_sdpa_dynamic_shapes_static_default started to fail with RuntimeError: Cannot call sizes() on tensor with symbolic sizes/strides: https://github.com/pytorch/pytorch/actions/runs/4206176846/jobs/7299657478	2023-02-17 19:52:16 +00:00
Li-Huai (Allan) Lin	0a9c608461	[MPS] Fix tensor with non-zero storage offset graph gathering (#91071 ) Previously, the "can slice" flag in Placeholder constructor in `OperationUtils.mm` is conditioned on whether the numbers of dimensions of base shape and view shape are the same. This doesn't consider the situation that a view tensor could be the base tensor's sliced and then unsqueezed version, resulting in different num of dims. For example, if we want to stack `y_mps` and `x_mps` on the last dim: ``` t_mps = torch.tensor([1, 2, 3, 4], device="mps") x_mps = t_mps[2:] # [3, 4] y_mps = t_mps[:2] # [1, 2] res_mps = torch.stack((y_mps, x_mps), dim=-1) ``` the kernel will unsqueeze both of them on the last dim and then concatenate them, which is equivalent to: ``` res_mps = torch.cat((y_mps.unsqueeze(-1), x_mps.unsqueeze(-1)), dim=-1) ``` `x_mps.unsqueeze(-1)` is an unsqueezed and contiguous tensor with a storage offset, this kind of tensors should be sliceable without cloning its storage. Fixes #87856 Fixes #91065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91071 Approved by: https://github.com/kulinseth	2023-02-17 18:44:20 +00:00
Nikita Shulga	5de3ead712	[MPS] Add optional `minor` argument to `is_macos13_or_newer` (#95065 ) Will be needed if one wants to make accurate XFAIL validation I.e. `torch.backends.mps.is_macos13_or_newer()` will return True if PyTorch is running on MacOS 13.0 or newer, `torch.backends.mps.is_macos13_or_newer(1)` will return True if running on MacOS 13.1 or newer and `torch.backends.mps.is_macos13_or_newer(2)` will return True if running on MacOS 13.2 or newer Do not use 13.3 check as `@available` does not really work for shared libraries Pull Request resolved: https://github.com/pytorch/pytorch/pull/95065 Approved by: https://github.com/albanD	2023-02-17 18:30:20 +00:00
Rohan Varma	c43e88665a	[Resubmit] helpers to torch.dist.utils (#95025 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95025 Approved by: https://github.com/fegin	2023-02-17 18:24:20 +00:00
Michael Voznesensky	2aa806608b	Fine grained dynamic shape controls (#94787 ) https://docs.google.com/document/d/1aoIyYE8_6cYpWqS25thzVoIiKsT5aaUEOiiPwbIXt8k/edit Pull Request resolved: https://github.com/pytorch/pytorch/pull/94787 Approved by: https://github.com/ezyang	2023-02-17 17:39:22 +00:00
zhxchen17	766d51b496	[export] Add a data type for representing export workflow information. (#95013 ) upstreaming some of our internal work to OSS so that we can get a better preiew of how export pipeline works. there'll be more modularized work sent in later. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95013 Approved by: https://github.com/tugsbayasgalan	2023-02-17 16:28:17 +00:00
Fabio Rocha	c137d3d688	inductor: enable lowering for bitwise_right_shift (#94997 ) triton pin has been moved past the relevant bug fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94997 Approved by: https://github.com/Skylion007, https://github.com/jansel	2023-02-17 15:51:34 +00:00
Thiago Crepaldi	d978395f55	Deprecate Caffe2 ONNX exporter (#94994 ) Discussed on Weekly meeting with Meta on 2/16/2023 with @kit1980 @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/94994 Approved by: https://github.com/Skylion007, https://github.com/BowenBao	2023-02-17 15:41:11 +00:00
Edward Z. Yang	2f9ffe7b0a	Add torch.utils._sympy.interp (#94985 ) This utility allows us to conveniently abstract interpret Sympy expressions with respect to some alternative domain. I am particularly interested in using ValueRanges to do range analysis on expressions (not this PR). Some minor house-keeping: * ReferenceAnalysis got moved to its own file, sprouted a constant() implementation, and some uses of math.* got converted to sympy.* * ValueRangeAnalysis now understands mod * Test file gets moved from `test_value_ranges.py` to `test_sympy_utils.py` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94985 Approved by: https://github.com/eellison	2023-02-17 14:28:18 +00:00
Edward Z. Yang	ccef485221	Add boolean/comparison operator support to ValueRanges (#94944 ) Pretty straightforward. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94944 Approved by: https://github.com/lezcano	2023-02-17 14:28:18 +00:00
Edward Z. Yang	08ef83f07c	Add exhaustive testing to ValueRanges, fix bugs (#94939 ) Since I didn't want to deal with nondeterministic tests, I went the exhaustive testing route for a fixed list of constants to look at. The tests generate random ranges, propagate the range through the function, and then pick elements in the range and check that the result on the operation is in the resulting range. This caught bugs in log, sqrt and pow. My resolution for pow was a little special, because I had trouble figuring out the correct semantics under all inputs domains. Instead, I picked two input domains (pow on two point ranges, and pow where exponent is known) and only implemented those. Everything else we give up. I think this is unlikely to affect perf. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94939 Approved by: https://github.com/lezcano, https://github.com/eellison, https://github.com/nunoplopes	2023-02-17 14:28:15 +00:00
Edward Z. Yang	12c9a932ca	Assert more invariants on ValueRanges (#94906 ) The main new invariant is lower/upper must be a Sympy expression of some sort (filtered through `simple_sympify`). There are some simpler sanity checks (mostly making sure the range is well formed). There is a type confusion problem (it's not immediately obvious if a range is for float/int/bool) but we aren't going to solve this for now as it is more complicated. Billing of changes: * ValueRanges.wrap() now accepts sympy expressions * ValueRanges now accepts non-sympy expressions and will sympyify them appropriately. Rewrite calls to ValueRanges to not sympify manually as it is unnecessary * Don't attempt to test sqrt(-1) * Add ValuesRanges.unknown() which gives -oo, oo bounds, and rewrite direct calls to -math.inf, math.inf to use it * Make multiply work between ValueRanges.unknown() and ValueRanges.wrap(0) * Consistently use sympy.oo instead of math.inf Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94906 Approved by: https://github.com/eellison	2023-02-17 14:28:11 +00:00
Yanbo Liang	950a9efcc3	[Dynamo] Enable test_autocast_sdpa (#95011 ) Enable test_autocast_sdpa since the blocker has been removed Pull Request resolved: https://github.com/pytorch/pytorch/pull/95011 Approved by: https://github.com/drisspg	2023-02-17 09:37:25 +00:00
cyy	2cf1a7d79b	Fix clang warnings and other minor issues (#94975 ) Fix various clang warnings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94975 Approved by: https://github.com/Skylion007	2023-02-17 08:59:14 +00:00
Nikita Shulga	45d775cedb	[BE] Cleanup triton builds (#95026 ) Remove Python-3.7 clause Do not install llvm-11, as llvm-14 is installed by triton/python/setup.py script Pull Request resolved: https://github.com/pytorch/pytorch/pull/95026 Approved by: https://github.com/osalpekar, https://github.com/weiwangmeta	2023-02-17 05:55:36 +00:00
Denis Vieriu	a2afc657da	[MPS] Fix upsample for NHWC output (#94963 ) Fixes https://github.com/huggingface/diffusers/issues/941 Before: <img width="1144" alt="Screenshot 2023-02-15 at 8 11 53 PM" src="https://user-images.githubusercontent.com/104024078/219266709-6a77636a-2fc0-4802-b130-85069b95953f.png"> After: <img width="1144" alt="Screenshot 2023-02-15 at 8 12 02 PM" src="https://user-images.githubusercontent.com/104024078/219266694-ea743c02-fb55-44f1-b7d6-5946106527c3.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94963 Approved by: https://github.com/razarmehr	2023-02-17 05:07:22 +00:00
Will Constable	a8cbf70ffc	Inductor support for aten::all_reduce (#93111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93111 Approved by: https://github.com/jansel, https://github.com/wanchaol	2023-02-17 04:42:04 +00:00
Denis Vieriu	5d1e9fd214	[MPS] Fix prelu backward pass (#94933 ) Allocate the correct shape for the weights gradient Pull Request resolved: https://github.com/pytorch/pytorch/pull/94933 Approved by: https://github.com/razarmehr	2023-02-17 03:45:12 +00:00
PyTorch MergeBot	acc1dfe670	[vision hash update] update the pinned vision hash (#95017 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95017 Approved by: https://github.com/pytorchbot	2023-02-17 03:32:32 +00:00
Colin Taylor	16a4579335	[FSDP] [composable] [BE] warning should read TorchRec, not DMP (#95010 ) Summary: as title Test Plan: N/A Differential Revision: D43375189 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95010 Approved by: https://github.com/awgu, https://github.com/fegin	2023-02-17 03:31:30 +00:00
Colin Taylor	e5496ebcac	[torch] [composable] [analytics] add analytics logging to PT-D composable APIs (#95016 ) Summary: as title Test Plan: N/A Differential Revision: D43376274 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95016 Approved by: https://github.com/awgu, https://github.com/rohan-varma, https://github.com/fegin	2023-02-17 02:49:16 +00:00
Eddie Yan	13ebffe088	[CUDA] `sm_87` / Jetson Orin support (#95008 ) Surfaced from #94438 CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/95008 Approved by: https://github.com/ezyang	2023-02-17 02:22:23 +00:00
Edward Z. Yang	0dffbcd4fa	Remove unnecessary TensorMeta rewrap (#95004 ) Extracted from https://github.com/pytorch/pytorch/pull/94523 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95004 Approved by: https://github.com/voznesenskym, https://github.com/ngimel, https://github.com/Skylion007	2023-02-17 00:52:37 +00:00
Edward Z. Yang	d9950c5215	Hard code known true contiguity settings for unbacked SymInts (#95003 ) Extracted from https://github.com/pytorch/pytorch/pull/94523 which has E2E test Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/95003 Approved by: https://github.com/voznesenskym, https://github.com/ngimel	2023-02-17 00:42:41 +00:00
Edward Z. Yang	a2f44d82f8	Flag guard unbacked SymInt/SymFloat support (#94987 ) I believe this fixes the AllenaiLongformerBase problem in periodic. The longer version of the problem is here is we are currently optimistically converting all item() calls into unbacked SymInt/SymFloat, but sometimes this results in a downstream error due to a data-dependent guard. Fallbacks for this case are non-existent; this will just crash the model. This is bad. So we flag guard until we get working fallbacks. What could these fallbacks look like? One idea I have is to optimistically make data-dependent calls unbacked, but then if it results in a crash, restart Dynamo analysis with the plan of graph breaking when the item() call immediately happened. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94987 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-02-17 00:25:05 +00:00
mingfeima	30d0112bf3	fix performance issue in torch.sparse.mm reduce mode (#94969 ) Fix performance bug for `torch.sparse.mm()` with reduce flag. Found this bug within internal benchmarking. Made a mistake when updating previous patch which causes load imbalance between threads: Test on ogbn-products datasets on Xeon CLX with 24 cores: #### before ``` sparse.mm: mean: 1156.148 ms sparse.mm: sum: 1163.754 ms sparse.mm: (using mkl): 703.227 ms ``` #### after ``` sparse.mm: mean: 662.578 ms sparse.mm: sum: 662.301 ms sparse.mm: (using mkl): 700.178 ms ``` The result also indicates that the current spmm kernel is no worse than MKL's sparse_mm . Also update results on `pyg benchmark` with: ``` python gnn.py --use_sage --epochs=3 --runs=1 --inference ``` * Out of box: `13.32s` * Without the fix in this PR: `5.87s` * With the fix in this PR: `3.19s` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94969 Approved by: https://github.com/jgong5	2023-02-17 00:20:00 +00:00
Iris	bb347dc3c3	[PTD][DCP] Add 1D DTensor based DCP (#94868 ) Add 1D DTensor based DCP along with its test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94868 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-02-16 23:38:04 +00:00
William Wen	5cdedab0cc	Raise error if torch.compile is called from windows or py 3.11 (#94940 ) For https://github.com/pytorch/pytorch/issues/94914 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94940 Approved by: https://github.com/albanD	2023-02-16 23:34:52 +00:00
Huy Do	8126bb5529	Mark linux-focal-py3.8-gcc7 / test (distributed) as unstable temporarily (#95002 ) This has become flaky recently (5.11% > 5% threshold) https://hud.pytorch.org/reliability/pytorch/pytorch?jobName=pull%20%2F%20linux-focal-py3.8-gcc7%20%2F%20test%20(distributed), moving it to unstable makes sense because the more important CUDA distributed jobs are still run in trunk. The issue is being investigated in https://github.com/pytorch/pytorch/issues/94954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95002 Approved by: https://github.com/ZainRizvi	2023-02-16 23:28:33 +00:00
PyTorch MergeBot	b45ec156a8	Revert "Temporarily disable ROCm trunk tests (#94995 )" This reverts commit 920ad2415c5fadc171279059136ab3836b6822a0. Reverted https://github.com/pytorch/pytorch/pull/94995 on behalf of https://github.com/huydhn due to ROCm runners have been cleaned up	2023-02-16 23:17:55 +00:00
Sahdev Zala	e0106e1850	Use the run_subtests utility instead of self.subTest (#94983 ) The use of run_subtests utility is a better test practice. Related #84071 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94983 Approved by: https://github.com/awgu	2023-02-16 22:13:13 +00:00
Wanchao Liang	ee0e7f0529	[dtensor] add checkpointing example (#94743 ) This PR adds some DTensor sharding example on a simple MLP model for checkpointing reference purposes Note that checkpointing itself is not implemented yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94743 Approved by: https://github.com/wz337	2023-02-16 22:04:09 +00:00
Daniil Kutz	59005bb998	Fix segmentation fault in script_type_parser.cpp and unpickler.cpp (#94815 ) Hi! I've been fuzzing different pytorch modules, and found a few crashes. Proposed checks fixes multiple segmentation faults and heap buffer overflows that was found during fuzzing pytorch with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch). ### Crash files ### 1) Heap buffer overflow that leads to crash [crash-842314913bf1820ec19cddfbb7400ffdbb756920.zip](https://github.com/pytorch/pytorch/files/9461316/crash-842314913bf1820ec19cddfbb7400ffdbb756920.zip) ``` "AsanReport": [ "==3751==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x619000033478 at pc 0x0000005f9bc3 bp 0x7fffffff1eb0 sp 0x7fffffff1ea8\n", "READ of size 4 at 0x619000033478 thread T0\n", "[Detaching after fork from child process 3762]\n", " #0 0x5f9bc2 in c10::IValue::IValue(c10::IValue&&) /pytorch_fuzz/aten/src/ATen/core/ivalue.h:192:43\n", " #1 0x9ecd0a7 in torch::jit::pop(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/aten/src/ATen/core/stack.h:102:12\n", " #2 0x9ecd0a7 in torch::jit::Unpickler::readInstruction() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:380:17\n", " #3 0x9ecafc7 in torch::jit::Unpickler::run() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:226:27\n", " #4 0x9ecac62 in torch::jit::Unpickler::parse_ivalue() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:183:3\n", " #5 0x9e45996 in torch::jit::unpickle(std::function<unsigned long (char, unsigned long)>, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch_fuzz/torch/csrc/jit/serialization/pickle.cpp:127:20\n", " #6 0x9e4626d in torch::jit::unpickle(char const, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> ()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch_fuzz/torch/csrc/jit/serialization/pickle.cpp:137:10\n", ``` 2) Segmentation fault [crash-e690c58718e88921350562f0b4d9180938145d77.zip](https://github.com/pytorch/pytorch/files/9461331/crash-e690c58718e88921350562f0b4d9180938145d77.zip) ``` "AsanReport": [ "==3744==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x000009122754 bp 0x7fffffff5290 sp 0x7fffffff5270 T0)\n", "==3744==The signal is caused by a READ memory access.\n", "==3744==Hint: this fault was caused by a dereference of a high value address (see register values below). Disassemble the provided pc to learn which register was used.\n", "[Detaching after fork from child process 3763]\n", " #0 0x9122754 in c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> >::retain_() /pytorch_fuzz/c10/util/intrusive_ptr.h:269:54\n", " #1 0x9127929 in c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> >::intrusive_ptr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&) /pytorch_fuzz/c10/util/intrusive_ptr.h:352:5\n", " #2 0x9127929 in torch::jit::Expr::Expr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&) /pytorch_fuzz/torch/csrc/jit/frontend/tree_views.h:269:49\n", " #3 0x91b1bbb in torch::jit::Maybe<torch::jit::Expr>::get() const /pytorch_fuzz/torch/csrc/jit/frontend/tree_views.h:211:12\n", " #4 0x92a8f74 in torch::jit::ScriptTypeParser::parseClassConstant(torch::jit::Assign const&) /pytorch_fuzz/torch/csrc/jit/frontend/script_type_parser.cpp:461:41\n", " #5 0x9e1c09b in torch::jit::SourceImporterImpl::importClass(c10::QualifiedName const&, torch::jit::ClassDef const&, bool) /pytorch_fuzz/torch/csrc/jit/serialization/import_source.cpp:549:34\n", " #6 0x9e13f00 in torch::jit::SourceImporterImpl::importNamedType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::ClassDef const&) /pytorch_fuzz/torch/csrc/jit/serialization/import_source.cpp:288:5\n", " #7 0x9e11fbc in torch::jit::SourceImporterImpl::findNamedType(c10::QualifiedName const&) /pytorch_fuzz/torch/csrc/jit/serialization/import_source.cpp:140:5\n", ``` 3) Unhandled out of bounds access in a vector [crash-ccd524e7ba19a37982dd91e0d6fc06bb26dd0b10.zip](https://github.com/pytorch/pytorch/files/9461367/crash-ccd524e7ba19a37982dd91e0d6fc06bb26dd0b10.zip) ``` "AsanReport": [ "==3792== ERROR: libFuzzer: deadly signal\n", "[Detaching after fork from child process 3809]\n", " #0 0x59cc11 in __sanitizer_print_stack_trace /llvm-project/compiler-rt/lib/asan/asan_stack.cpp:87:3\n", " #1 0x511547 in fuzzer::PrintStackTrace() /llvm-project/compiler-rt/lib/fuzzer/FuzzerUtil.cpp:210:5\n", " #2 0x4f7753 in fuzzer::Fuzzer::CrashCallback() /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:233:3\n", " #3 0x7ffff7c6741f (/lib/x86_64-linux-gnu/libpthread.so.0+0x1441f)\n", " #4 0x7ffff7a8700a in __libc_signal_restore_set /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/internal-signals.h:86:3\n", " #5 0x7ffff7a8700a in raise /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:48:3\n", " #6 0x7ffff7a66858 in abort /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:79:7\n", " #7 0x7ffff7e73910 (/lib/x86_64-linux-gnu/libstdc++.so.6+0x9e910)\n", " #8 0x7ffff7e7f38b (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa38b)\n", " #9 0x7ffff7e7f3f6 in std::terminate() (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa3f6)\n", " #10 0x7ffff7e7f6a8 in __cxa_throw (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa6a8)\n", " #11 0x7ffff7e763aa (/lib/x86_64-linux-gnu/libstdc++.so.6+0xa13aa)\n", " #12 0x6aeedf in std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_range_check(unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:1073:4\n", " #13 0x9ecd66c in torch::jit::Unpickler::readInstruction() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp\n", " #14 0x9ecafc7 in torch::jit::Unpickler::run() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:226:27\n", " #15 0x9ecac62 in torch::jit::Unpickler::parse_ivalue() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:183:3\n", ``` Some other crashes found by fuzzer: [crash-0cab888cbd1e9fea92ab6ddeadf40b958b87d62b.zip](https://github.com/pytorch/pytorch/files/9461406/crash-0cab888cbd1e9fea92ab6ddeadf40b958b87d62b.zip) [crash-04c9ba8e3b0f15028fd0fb0ed014fd352e182a1d.zip](https://github.com/pytorch/pytorch/files/9461407/crash-04c9ba8e3b0f15028fd0fb0ed014fd352e182a1d.zip) [crash-422ad8c3a3472980ba751f4c7f79cf2b53e49927.zip](https://github.com/pytorch/pytorch/files/9461408/crash-422ad8c3a3472980ba751f4c7f79cf2b53e49927.zip) ### How to reproduce ### 1. To reproduce the crashes, use provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/blob/master/projects/pytorch/Dockerfile) 2. Build the container: `docker build -t oss-sydr-fuzz-pytorch-reproduce .` 3. Copy crash file to the current directory 4. Run the container: `` docker run --privileged --network host -v `pwd`:/homedir --rm -it oss-sydr-fuzz-pytorch-reproduce /bin/bash `` 5. And execute fuzz-targets with provided crash-files. After execution completes you will see ASAN reports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94815 Approved by: https://github.com/davidberard98	2023-02-16 21:41:11 +00:00
Edward Z. Yang	03f4a63fd8	Only truncate leading 1s if the value is too big. (#94521 ) If it's just right, broadcasting will do the right thing automatically. This helps with unbacked SymInts as I can avoid testing one equality on the inside. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94521 Approved by: https://github.com/voznesenskym	2023-02-16 21:33:12 +00:00
Yanbo Liang	4f257a507c	[Dynamo] Support Python builtin sorted function (#94949 ) Fixes #94750 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94949 Approved by: https://github.com/jansel, https://github.com/Skylion007	2023-02-16 21:27:11 +00:00
Edward Z. Yang	5747a51657	Fix flaky StaticRuntime.Nonzero test (#94418 ) If the operator produces a zero size tensor, the memory may be equal to the original. With nonzero, we would sometimes get unlucky and everything was zero. See failing tests at https://hud.pytorch.org/failure/%5B%20%20FAILED%20%20%5D%20StaticRuntime.Nonzero Arguably we should also fix the seeding but it was less obvious to me where to do that. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94418 Approved by: https://github.com/albanD	2023-02-16 21:25:15 +00:00
fduwjj	b209d8fa0d	[PT-D][Sequence Parallelism] Enable DTensor based Naive sequence parallelism (#94369 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94369 Approved by: https://github.com/wanchaol	2023-02-16 21:21:00 +00:00
Ramin Azarmehr	29fdb354ff	[MPS] Fix embedding_backward() issue with Float16 (#94950 ) - Casting the float16 input tensor to float32 and cast back the output tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/94950 Approved by: https://github.com/DenisVieriu97	2023-02-16 20:55:09 +00:00
jjsjann123	21eb7f70f1	Nvfuser python API import fix (#94036 ) 1. Having nvfuser python API import working with both devel and upstream; 2. Add environment variable to allow custom nvfuser code base to be built with upstream pytorch core. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94036 Approved by: https://github.com/malfet, https://github.com/davidberard98	2023-02-16 20:10:40 +00:00
Edward Z. Yang	7aaebe00ee	Fail dynamic_aot_eager AllenaiLongformerBase model (#94986 ) ``` GuardOnDataDependentSymNode: It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.) The expression we were trying to evaluate is Eq(i3, -1). Scroll up to see where each of these data-dependent accesses originally occurred. While executing %as_strided : [#users=1] = call_method[target=as_strided](args = (%pad,), kwargs = {size: (12, %add, 768, 64), stride: (%getitem, %mul, %getitem_1, %getitem_2)}) Original traceback: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/models/longformer/modeling_longformer.py", line 928, in <graph break in _sliding_chunks_matmul_attn_probs_value> chunked_value = padded_value.as_strided(size=chunked_value_size, stride=chunked_value_stride) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94986 Approved by: https://github.com/albanD	2023-02-16 20:02:46 +00:00
Sergii Dymchenko	920ad2415c	Temporarily disable ROCm trunk tests (#94995 ) ROCm tests are failing with No space left on device https://github.com/pytorch/pytorch/actions/runs/4197259561/jobs/7279713058 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94995 Approved by: https://github.com/huydhn	2023-02-16 19:59:36 +00:00
Rodrigo Kumpera	641cb4243c	Fix c10d regression during cleanup. (#94988 ) This fixes a regression introduced earlier today with a change to c10d global state. It must be cleaned up in destroy_process_group or root PG and its Store will stay alive. Fixes regression in test_c10d_nccl.py :: RendezvousEnvTest.test_common_errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/94988 Approved by: https://github.com/H-Huang, https://github.com/wanchaol, https://github.com/malfet	2023-02-16 19:12:00 +00:00
Fabio Rocha	b652577d8e	Change test_torchinductor_opinfo.py to mark skips/xfails in a better way (#94813 ) With this change, expected failures will be correctly reported as such by pytest (instead of passes as before). It was sometimes a little confusing to see operators you did not expect to work in inductor reported as passing their tests. One downside is that expected failures/skips for test variants have now to be identified by tuples. I.e., `("max", "reduction_no_dim"): {f16},` instead of just `"max.reduction_no_dim": {f16}`. It seems to me it is worth it. This change would also allow to simplify `TestInductorOpInfo` class a little, since it doesn't have to handle the skips/xfails anymore, but that might require dropping support for things like `PYTORCH_COLLECT_EXPECT` and `PYTORCH_FAIL_ON_SUCCESS` so I didn't do it. Also couple of other minor changes: - Got rid of c32, c64, c128 in torchinductor_opinfo. We don't support complex numbers, so they shouldn't be necessary. - Renamed TestExpect Enum to ExpectedTestResult to get rid of a pytest warning that thinks it is a class that has tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94813 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-02-16 18:57:01 +00:00
Huy Do	981511d0fe	Upload coredump from ROCm and print the stacktrace (#94938 ) There was a burst of `test_cuda` SIGSEGV or SIGIOT from ROCm today, for example `5705199fb1`. So, I'm trying to apply the same logic from Linux [test workflows](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_linux-test.yml#L248-L261) here to uploading the core dump to GitHub and print the stack trace. This would help debug similar issues in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94938 Approved by: https://github.com/ZainRizvi	2023-02-16 17:56:36 +00:00
Edward Z. Yang	ef5de0a4cf	Don't use PrimTorch decomposition for empty (#94512 ) This PR removes the unnecessary == 0 guard when constructing empty tensors, by ensuring that when we create a contiguous tensor we go directly to the C++ torch.empty implementation (instead of indirecting through empty_strided), where we can bypass doing zero tests when computing the size of the storage. This probably also speeds up trace time. When I did this, I found out that `empty_tensor_restride_symint` was flagrantly wrong (we had never exercised it before because we redirected to `empty_strided` in PrimTorch decomp, which doesn't hit this codepath.) The bugs: * Stride computation was wrong (only `last_idx` was ever written to) * Using set_sizes_and_strides with `sym_sizes` input doesn't work, because there is some sort of ordering problem where `clone_symvec` isn't safe when you clone a vector into itself. Probably should fix this. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94512 Approved by: https://github.com/ngimel	2023-02-16 16:04:41 +00:00
Edward Z. Yang	2f32fd7762	Introduce branchless implementations of TensorImpl bools (#94473 ) This is the main payload of this diff stack. With it, we are able to construct a 1D tensor from unbacked SymInt with guards that are equivalent to asserting that the size is non-negative (which makes sense!) To get here, I had to arrange for all of the guards that occur when doing contiguity tests to be lazy. This was done by writing non-branching implementations of each of the tests in `sympy_is_contiguous` etc functions, and then using those implementations when we don't branch. I also had to do some bug fixes for `is_non_overlapping_and_dense`, as unbacked SymInts were very untested previously (and that was the only time you would actually hit the Python version of the code.) In particular, we now consistently pass separate sizes/strides lists into each of the boolean computation functions (and only pack them into a single argument list when going to Sympy, which doesn't support lists of variables in custom functions.) Finally, to actually test that this is doing something, I add a simple assumptions system from https://github.com/pytorch/pytorch/pull/90985 and use this to get the end to end test test_item_to_constructor passing. Soon, I intend to replace this with a range analysis system which will be used for assumptions in the short term. (We still might use Z3, but for all the stray assumptions I've seen range analysis will be good enough.) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94473 Approved by: https://github.com/albanD	2023-02-16 16:02:13 +00:00
Rodrigo Kumpera	e22d791287	[PTD] Introduce tracing friendly collectives. (#93990 ) This change adds torch.distributed.traceable_collectives. This experimental API enables collectives to be fully traced by dynamo and FX. See #93173 for the RFC Pull Request resolved: https://github.com/pytorch/pytorch/pull/93990 Approved by: https://github.com/wconstab, https://github.com/wanchaol, https://github.com/H-Huang	2023-02-16 15:35:01 +00:00
Nikita Shulga	d0fbed76c6	Test inductor with stock g++ (#90710 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/90710 Approved by: https://github.com/jansel	2023-02-16 15:10:17 +00:00
Edward Z. Yang	89e16c4f18	Assume sympy is always installed (#94903 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94903 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-02-16 14:09:58 +00:00
Michael Voznesensky	23b1af0399	Inductor cache clear (#94918 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94918 Approved by: https://github.com/ezyang, https://github.com/jansel	2023-02-16 14:09:01 +00:00
Brian Hirsh	68600fc7c6	avoid extra copies in batchnorm inference by introducing a new op, _native_batch_norm_legit_no_training (#94946 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94946 Approved by: https://github.com/ezyang	2023-02-16 11:41:20 +00:00
Yanbo Liang	2ef6659107	[Dynamo] Raise warning if user has hooks installed on the module (#94848 ) We don't support hooks for ```nn.Module``` yet, should raise warnings if we detect hooks have been installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94848 Approved by: https://github.com/jansel	2023-02-16 10:01:42 +00:00
Tal Ben-Nun	bfec4965a1	[inductor] Get compiler from environment variable if exists (#94926 ) Fixes an issue where the default `g++` compiler does not specify the right compiler to use (or does not exist). Pull Request resolved: https://github.com/pytorch/pytorch/pull/94926 Approved by: https://github.com/ngimel	2023-02-16 08:23:32 +00:00
AllenTiTaiWang	28e69954a1	[ONNX] Support aten::bit_wise_not in fx-onnx exporter (#94919 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94919 Approved by: https://github.com/justinchuby, https://github.com/wschin	2023-02-16 06:21:59 +00:00
Khushi	a0389681c2	[complex] nansum & nanmean (#93199 ) Follows: #71472 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93199 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kshitij12345	2023-02-16 06:13:42 +00:00
blzheng	6ae06e49ac	Inductor: fix incorrect result of inplace unsqueeze (#94797 ) This pr aims to fix the incorrect result in the following test case. ``` @torch._dynamo.optimize("inductor") def fn(a): unsqueeze_ = torch.ops.aten.unsqueeze_.default(a, 0) return unsqueeze_ args = [ ((1, 1, 1, 12, 11, 3), (396, 396, 396, 33, 3, 1), torch.int64, "cpu") ] args = [rand_strided(sh, st, dt, dev) for (sh, st, dt, dev) in args] with torch.no_grad(): out = fn(args) # expected result: (396, 396, 396, 396, 33, 3, 1) torch.Size([1, 1, 1, 1, 12, 11, 3]) print(args[0].stride(), args[0].shape) # incorrect result: (396, 396, 396, 396, 396, 396, 33, 3, 1) torch.Size([1, 1, 1, 1, 1, 1, 12, 11, 3]) ``` Root cause* 1. [fake_tensor](https://github.com/pytorch/pytorch/blob/master/torch/_dynamo/variables/builder.py#L140) is changed during [tracer.run](https://github.com/pytorch/pytorch/blob/master/torch/_dynamo/convert_frame.py#L311), then it will [pass incorrect inputs to inductor](https://github.com/pytorch/pytorch/blob/master/torch/_dynamo/output_graph.py#L670). 2. example_inputs are changed during [propagate](https://github.com/pytorch/pytorch/blob/master/torch/_inductor/mkldnn.py#L509) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94797 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-02-16 05:57:42 +00:00
PyTorch MergeBot	aa9e481e0c	Revert "Re-enable a FX-to-ONNX kwargs Test (#94763 )" This reverts commit 04b4704a0bbf2d3831ca7685264db574ff71216d. Reverted https://github.com/pytorch/pytorch/pull/94763 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it has a tiny lint error that breaks trunk https://github.com/pytorch/pytorch/actions/runs/4190787551/jobs/7264666070. This looks weird cause your PR lint signal was green	2023-02-16 05:24:07 +00:00
PyTorch MergeBot	a049bbb100	Revert "Change test_torchinductor_opinfo.py to mark skips/xfails in a better way (#94813 )" This reverts commit bfc0d5e22c34e5888c394735bf696e2f45e07816. Reverted https://github.com/pytorch/pytorch/pull/94813 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it causes failures on trunk `bfc0d5e22c` due to a landrace with `b6df987671`	2023-02-16 05:08:23 +00:00
PyTorch MergeBot	e751553848	[vision hash update] update the pinned vision hash (#94866 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml). Update the pinned vision hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94866 Approved by: https://github.com/pytorchbot	2023-02-16 05:00:48 +00:00
Zheng Yan	753c33bf86	Enable half type support for unique cpu (#91666 ) Test Plan: CI Differential Revision: D42326527 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91666 Approved by: https://github.com/jgong5, https://github.com/ngimel	2023-02-16 04:59:35 +00:00
Wei-Sheng Chin	04b4704a0b	Re-enable a FX-to-ONNX kwargs Test (#94763 ) As title. The re-factorization of ONNX test framework disabled one exporter. This PR just brings that test back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94763 Approved by: https://github.com/justinchuby, https://github.com/abock, https://github.com/titaiwangms	2023-02-16 04:46:34 +00:00
ydwu4	4b2d1beca2	[dynamo] keep submodule's name for nn.Sequential when unroolling (#94913 ) Currently, when unrolling an nn.Sequential, we use an integer to represent its submodule's name. This produces some difficulty in tracking the origin of the parameters in the export path: ```python model = nn.Sequential(OrderedDict([ ('conv1', nn.Conv2d(1,20,5)), ('relu1', nn.ReLU()), ('conv2', nn.Conv2d(20,64,5)), ('relu2', nn.ReLU()) ])) ``` Currently, the submodules will have names such as model.0, model.1 instead of model.conv1, model.relu1. This discrepency causes it difficult to track the origin of paramers because they are represented as model.conv1.foo and model.relu1.foo in model.named_parameters(). We replace enumerate() with named_children() to keep submodule's name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94913 Approved by: https://github.com/jansel	2023-02-16 04:43:05 +00:00
Nikita Karetnikov	8c44ae2f5d	[inductor] enable `test_lowmem_dropout1_dynamic_shapes` (#94884 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94884 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-16 04:41:19 +00:00
ganler	5e1de31548	fix: make sure `sorter` indices are inbound in `searchsorted` (#94863 ) Fixes #91606 Add a checker to `sorter` to make sure indices are inbound (as NumPy). Pull Request resolved: https://github.com/pytorch/pytorch/pull/94863 Approved by: https://github.com/Skylion007, https://github.com/malfet	2023-02-16 04:28:39 +00:00
Zain Rizvi	a863d5e37c	Hide failing merge rule's name in the internal debugging section (#94932 ) Fixes https://github.com/pytorch/test-infra/issues/1081 The merge rule name is not helpful to most readers, and most of the time it's just "superuser." Move this to a less prominent place in the "Details for Dev Infra team" section Pull Request resolved: https://github.com/pytorch/pytorch/pull/94932 Approved by: https://github.com/huydhn	2023-02-16 04:20:10 +00:00
David Berard	a4085ab837	[dynamo] support custom __getattr__ on torch.nn.Modules (#94658 ) Summary: torch.nn.Module implementations previously did not support custom implementations of `__getattr__`; if a torch.nn.Module subclass implemented `__getattr__` and we tried to access an attribute that was expected to be present in `__getattr__`, dynamo would not check `__getattr__` and would error out with an AttributeError. This PR copies the functionality from UserDefinedObjectVariable into torch.nn.Module so that it also supports `__getattr__` Example of a module which previously would fail: ```python class MyMod(torch.nn.Module): def __init__(self): super().__init__() self.custom_dict = {"queue": [torch.rand((2, 2)) for _ in range(3)]} self.other_attr = torch.rand((2, 2)) def __getattr__(self, name): custom_dict = self.custom_dict if name in custom_dict: return custom_dict[name] return super().__getattr__(name) def forward(self, x): return x @ self.other_attr + self.queue[-1] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94658 Approved by: https://github.com/yanboliang, https://github.com/jansel	2023-02-16 04:00:51 +00:00
Fabio Rocha	bfc0d5e22c	Change test_torchinductor_opinfo.py to mark skips/xfails in a better way (#94813 ) With this change, expected failures will be correctly reported as such by pytest (instead of passes as before). It was sometimes a little confusing to see operators you did not expect to work in inductor reported as passing their tests. One downside is that expected failures/skips for test variants have now to be identified by tuples. I.e., `("max", "reduction_no_dim"): {f16},` instead of just `"max.reduction_no_dim": {f16}`. It seems to me it is worth it. This change would also allow to simplify `TestInductorOpInfo` class a little, since it doesn't have to handle the skips/xfails anymore, but that might require dropping support for things like `PYTORCH_COLLECT_EXPECT` and `PYTORCH_FAIL_ON_SUCCESS` so I didn't do it. Also couple of other minor changes: - Got rid of c32, c64, c128 in torchinductor_opinfo. We don't support complex numbers, so they shouldn't be necessary. - Renamed TestExpect Enum to ExpectedTestResult to get rid of a pytest warning that thinks it is a class that has tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94813 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-02-16 03:32:01 +00:00
BowenBao	3d40a86acd	[ONNX] Enable skipped gpt2 test (#94930 ) I think the skip is outdated. Test passed in CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94930 Approved by: https://github.com/wschin	2023-02-16 03:23:57 +00:00
fduwjj	b4c8186774	[BE][1/N] Add deprecate msg to Sharded Partial and Replicate Tensor (#94928 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94928 Approved by: https://github.com/wanchaol	2023-02-16 03:23:53 +00:00
Driss Guessous	07bc6b9587	[SDPA] Update dispatch logic to check for sm86 and head_size == 128 for flash attention (#94921 ) Fixes #94883 Where backward for flash_attention on sm86 hardware with head_size == 128 is not supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94921 Approved by: https://github.com/cpuhrsch, https://github.com/albanD	2023-02-16 03:11:16 +00:00
Mengwei Liu	41865bd8ed	[executorch] Add RuntimeContext to generated C++ API Signature (#94570 ) Summary: Pass runtime context all the way to kernel level. RegisterCodegenUnboxedKernels.cpp: ``` static Operator operators_to_register[] = { Operator( "aten::add.out", [](torch::executor::RuntimeContext & context, EValue** stack) { EValue& self = stack[0]; EValue& other = stack[1]; EValue& alpha = stack[2]; EValue& out = stack[3]; const torch::executor::Tensor & self_base = self.to<torch::executor::Tensor>(); const torch::executor::Tensor & other_base = other.to<torch::executor::Tensor>(); const torch::executor::Scalar & alpha_base = alpha.to<torch::executor::Scalar>(); torch::executor::Tensor & out_base = out.to<torch::executor::Tensor>(); EXECUTORCH_SCOPE_PROF("native_call_add.out"); torch::executor::aten::add_outf(context, self_base, other_base, alpha_base, out_base); } ), } ``` Functions.h ``` // aten::add.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!) TORCH_API inline at::Tensor & add_outf(torch::executor::RuntimeContext & context, const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha, at::Tensor & out) { return torch::executor::native::add_out(self, other, alpha, out); } ``` Test Plan: TBD Differential Revision: D41325633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94570 Approved by: https://github.com/cccclai	2023-02-16 02:43:18 +00:00
soulitzer	e5c2a35d83	Add check that embedding_bag's weight is 2D (#94931 ) Fixes https://github.com/pytorch/pytorch/issues/94445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94931 Approved by: https://github.com/albanD	2023-02-16 02:37:47 +00:00
Masaki Kozuki	3e9df622fb	[mta] implement `_foreach_pow` (#92303 ) Mainly for foreach path of `Adam` and `AdamW` rel: https://github.com/pytorch/pytorch/issues/58833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92303 Approved by: https://github.com/albanD	2023-02-16 02:28:26 +00:00
Jason Ansel	e28ba6813d	Enable persistent reductions (#94847 ) Now that we have newer triton this might be safe Pull Request resolved: https://github.com/pytorch/pytorch/pull/94847 Approved by: https://github.com/Chillee	2023-02-16 01:47:29 +00:00
Driss Guessous	0d7913c9c1	add backwards for layer norm nested (#94781 ) Fixes #94702 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94781 Approved by: https://github.com/cpuhrsch	2023-02-16 01:42:57 +00:00
Edward Z. Yang	904d549ca4	Add some simple sanity tests to ValueRanges (#94905 ) To start, I simply test that unary/binary ops agree with reference when the ranges are singleton. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94905 Approved by: https://github.com/lezcano, https://github.com/eellison	2023-02-16 01:29:45 +00:00
Li-Huai (Allan) Lin	e8dc34eaeb	[MPS] Move max_pool2d to mps dispatch key (#90772 ) Related issue: #77394 This PR also modifies some assertions in the codegen, an explanatory comment for it has been added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90772 Approved by: https://github.com/albanD	2023-02-16 01:13:08 +00:00
Chien-Chin Huang	250c054bdd	[SPMD] Pull the minimal working distribute API and SPMD module to PyTorch (#94802 ) Pull the minimal working distribute API and SPMD module to PyTorch. The original code is on https://github.com/pytorch/tau/tree/main/spmd/compiler. Other main contributors to the original code base: @anj-s, @lessw2020, @wanchaol @aazzolini Differential Revision: [D43197230](https://our.internmc.facebook.com/intern/diff/D43197230/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94802 Approved by: https://github.com/anj-s, https://github.com/wanchaol	2023-02-16 00:36:16 +00:00
Denis Vieriu	bc361fdfdf	[MPS] Fix bilinear backward pass (#94892 ) Fixes backward pass for bilinear. Summary of changes: - bilinear op is able to produce contiguous, non-view tensors with a storage offset, such as: shape=`[1, 1, 1, 1]`, `storage_offset=12`. This seems a weird case, but it is valid, and for these type of tensors we wouldn't be able to gather/scatter since we look at the view flag (which is not set here). This change looks into `storage_offset` only rather than the is_view flag which is not being set - reduction sum must return a zeroed out output if passing an input with 0 elements (e.g a shape of (0, 5)). Pull Request resolved: https://github.com/pytorch/pytorch/pull/94892 Approved by: https://github.com/kulinseth	2023-02-16 00:30:29 +00:00
Nicolas Macchioni	dd7e2b7c0e	[pt2][inductor] update choice caller hashes (#94853 ) Summary: update the hashing method for `ChoiceCaller` class. `TritonTemplateCaller` objects will now be hashed to: `{name}-({BLOCK_M}, {BLOCK_N}, {BLOCK_K})-{num_stages}-{num_warps}-{code_hash}` for example: `triton_mm-(64, 32, 32)-4-8-cptlntwzcl2gaaofd2oabdwhaqv4ox3lluvbuxitjfhhpz6cyl4o` `ExternKernelCaller` objects will now be hashed to: `{name}-{kwargs.keys()[0]}={kwargs.vals()[0]}-...-{code_hash}` for example: `addmm-alpha=1-beta=1-c4xxd3iocu4yt6z4udrlqnumays7q6mfnfd3qprh4fxgsvyhqdkf` Test Plan: sandcastle Differential Revision: D43285470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94853 Approved by: https://github.com/jansel, https://github.com/bertmaher	2023-02-16 00:11:26 +00:00
PyTorch MergeBot	0698af67c7	Revert "Fix XNNPACK OSS Buck build (#94935 )" This reverts commit 9d2fddf820f8cf4273b12a8be5a556ba230c21cf. Reverted https://github.com/pytorch/pytorch/pull/94935 on behalf of https://github.com/kit1980 due to The issue already mitigated by https://github.com/pytorch/pytorch/pull/94785	2023-02-15 23:14:44 +00:00
Nikita Shulga	c01f5118a6	Add float to list of allowed ops (#94910 ) By adding `BINFLOAT` op support Fixes https://github.com/pytorch/pytorch/issues/94670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94910 Approved by: https://github.com/albanD	2023-02-15 23:13:21 +00:00
Sergii Dymchenko	9d2fddf820	Fix XNNPACK OSS Buck build (#94935 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94935 Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet	2023-02-15 23:06:32 +00:00
Denis Vieriu	a005dd1c01	[MPS] Fix nn.functional.conv_transpose2d grad (#94871 ) - add _mps_convolution_impl that takes optional shape - for conv_tranpose2d grad, use the shape from forward pass directly - for conv, calculate the shape from input - remove nn.functional.conv_transpose2d grad from blocklist Pull Request resolved: https://github.com/pytorch/pytorch/pull/94871 Approved by: https://github.com/kulinseth	2023-02-15 21:45:11 +00:00
Wanchao Liang	cd9ca4c73f	[tp] additional doc fixes (#94786 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94786 Approved by: https://github.com/fduwjj	2023-02-15 21:25:26 +00:00
min-jean-cho	b6df987671	[Inductor] Added aten.normal_ decomp (#91207 ) Fixes #91085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91207 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano	2023-02-15 21:21:46 +00:00
Edward Z. Yang	092e28f17f	Make the glue compute short circuit only if possible (#94437 ) If the inputs are unhinted, they will use the branchless implementation. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94437 Approved by: https://github.com/voznesenskym	2023-02-15 21:06:42 +00:00
Edward Z. Yang	ff7772317b	Stub all TensorImpl bools; do not go to Python if not hinted. (#94431 ) The basic idea behind this PR is that we want to continue using the guarding implementations of contiguity tests, if all of the elements are backend (aka, have hints). If they don't have hints, we'll have to do something slower (use the non-short circuiting, non guarding implementations of contiguity), but most of the time you aren't dealing with unbacked SymInts. So this PR has three parts. 1. We expose `has_hint` on `SymNode`. This allows us to query whether or not a SymInt is backed or not from C++. Fairly self explanatory. Will require LTC/XLA updates; but for backends that don't support unbacked SymInts you can just always return true. 2. We update `compute_non_overlapping_and_dense` to test if the inputs are hinted. If they are all hinted, we use the conventional C++ implementation. Otherwise we call into Python. The Python case is not heavily tested right now because I haven't gotten all of the pieces for unbacked SymInts working yet. Coming soon. 3. We add stubs for all of the other contiguity tests. The intention is to apply the same treatment to them as well, but this is not wired up yet for safety reasons. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/94431 Approved by: https://github.com/voznesenskym	2023-02-15 21:06:42 +00:00
Cuiqing Li	6da88bc966	try to fix OSS CI error (#94785 ) Differential Revision: D43259005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94785 Approved by: https://github.com/weiwangmeta, https://github.com/digantdesai	2023-02-15 21:00:55 +00:00
Ramin Azarmehr	dea05cdbf0	[MPS] Fix the crash in elu_backward() (#94923 ) Fixes a crash where the inputTensor could go null and cause a crash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94923 Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth	2023-02-15 20:49:30 +00:00
Kilian Lieret	66bea59538	Clarify meaning of `pin_memory_device` argument (#94349 ) I don't think the docstring explaining `pin_memory_device` is very clear. If it weren't for the string type, I would not have guessed that this was about the device that is referred to in the `pin_memory` option (and honestly, it took me a few minutes before noticing the type). Pull Request resolved: https://github.com/pytorch/pytorch/pull/94349 Approved by: https://github.com/ejguan	2023-02-15 20:40:28 +00:00
Justin Yip	f2c26420f2	[pytorch] Add support for "height" and "width" dimension for the "select" operator on pytorch vulkan backend (#94612 ) Summary: Add support for "height" and "width" dimension for the "select" operator on pytorch vulkan backend. Test Plan: ``` yipjustin@yipjustin-mbp fbsource % buck run -c pt.vulkan_full_precision=1 --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -- --gtest_filter="select_3d" Downloaded 1/2 artifacts, 1.29 Mbytes, 0.0% cache miss (for updated rules) Building: finished in 3.7 sec (100%) 450/450 jobs, 2/450 updated Total time: 3.8 sec BUILD SUCCEEDED Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc Note: Google Test filter = select_3d [==========] Running 9 tests from 1 test suite. [----------] Global test environment set-up. [----------] 9 tests from VulkanAPITest [ RUN ] VulkanAPITest.select_3d_depth_small [ OK ] VulkanAPITest.select_3d_depth_small (30 ms) [ RUN ] VulkanAPITest.select_3d_depth_medium [ OK ] VulkanAPITest.select_3d_depth_medium (0 ms) [ RUN ] VulkanAPITest.select_3d_depth_large [ OK ] VulkanAPITest.select_3d_depth_large (1 ms) [ RUN ] VulkanAPITest.select_3d_height_small [ OK ] VulkanAPITest.select_3d_height_small (0 ms) [ RUN ] VulkanAPITest.select_3d_height_medium [ OK ] VulkanAPITest.select_3d_height_medium (0 ms) [ RUN ] VulkanAPITest.select_3d_height_large [ OK ] VulkanAPITest.select_3d_height_large (3 ms) [ RUN ] VulkanAPITest.select_3d_width_small [ OK ] VulkanAPITest.select_3d_width_small (0 ms) [ RUN ] VulkanAPITest.select_3d_width_medium [ OK ] VulkanAPITest.select_3d_width_medium (0 ms) [ RUN ] VulkanAPITest.select_3d_width_large [ OK ] VulkanAPITest.select_3d_width_large (1 ms) [----------] 9 tests from VulkanAPITest (40 ms total) [----------] Global test environment tear-down [==========] 9 tests from 1 test suite ran. (40 ms total) [ PASSED ] 9 tests. ``` Reviewed By: SS-JIA Differential Revision: D43020796 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94612 Approved by: https://github.com/SS-JIA	2023-02-15 19:15:17 +00:00
PyTorch MergeBot	fa1ea9f9bc	Revert "Re-enable a FX-to-ONNX kwargs Test (#94763 )" This reverts commit ea657726d951662005688e03115a44a658c4144c. Reverted https://github.com/pytorch/pytorch/pull/94763 on behalf of https://github.com/wschin due to One line conflict with https://github.com/pytorch/pytorch/pull/94878	2023-02-15 18:07:25 +00:00
Aaron Gokaslan	b46b2e35d4	[BE] Add flake8-logging-format linter (#94840 ) Follow up to #94708 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94840 Approved by: https://github.com/ezyang	2023-02-15 17:54:50 +00:00
Johan Nordberg	dc4f2af6f6	Take `CUDA_VISIBLE_DEVICES` into account for nvml calls (#94568 ) Fixes #94472 Pull Request resolved: https://github.com/pytorch/pytorch/pull/94568 Approved by: https://github.com/ngimel	2023-02-15 17:50:12 +00:00
Wei-Sheng Chin	ea657726d9	Re-enable a FX-to-ONNX kwargs Test (#94763 ) As title. The re-factorization of ONNX test framework disabled one exporter. This PR just brings that test back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94763 Approved by: https://github.com/justinchuby, https://github.com/abock	2023-02-15 17:31:04 +00:00
XiaobingSuper	1d7133c542	inductor(cpu): fix C++ compile error when sigmoid's post ops is a reduction op (#94890 ) For timm nfnet_l0 model. CPU path has the following error: `torch._dynamo.exc.BackendCompilerFailed: inductor raised CppCompileError: C++ compile error`. There has a simple test case: ``` def fn(x): x = torch.ops.aten.sigmoid.default(x) return torch.ops.aten.mean.dim(x, [-1, -2], True) x = torch.randn((1, 8, 8, 8)) opt_fn = torch._dynamo.optimize("inductor")(fn) opt_fn(x) real_out = fn(x) compiled_out = opt_fn(x) tol = 0.0001 print(torch.allclose(real_out, compiled_out, atol=tol, rtol=tol)) ``` before: ``` extern "C" void kernel(float* __restrict__ in_out_ptr0, const float* __restrict__ in_ptr0) { auto out_ptr0 = in_out_ptr0; { #pragma GCC ivdep for(long i0=0; i0<8; i0+=1) { { #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}}) float tmp2 = 0; auto tmp2_vec = at::vec::Vectorized<float>(tmp2); for(long i1=0; i1<4; i1+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + (16i1) + (64i0)); auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp()); tmp2_vec += tmp1; } #pragma omp simd simdlen(8) reduction(+:tmp3) for(long i1=64; i1<64; i1+=1) { auto tmp0 = in_ptr0[i1 + (64i0)]; auto tmp1 = std::exp(-tmp0); auto tmp2 = 1 / (1 + tmp1); tmp3 += tmp2; } tmp2 += at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp2_vec); out_ptr0[i0] = tmp3; } } } { for(long i0=0; i0<0; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + 16i0); auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(64)); auto tmp2 = tmp0 / tmp1; tmp2.store(in_out_ptr0 + 16i0); } #pragma omp simd simdlen(8) for(long i0=0; i0<8; i0+=1) { auto tmp0 = out_ptr0[i0]; auto tmp1 = static_cast<float>(64); auto tmp2 = tmp0 / tmp1; in_out_ptr0[i0] = tmp2; } } } ``` after: ``` extern "C" void kernel(float __restrict__ in_out_ptr0, const float* __restrict__ in_ptr0) { auto out_ptr0 = in_out_ptr0; #pragma omp parallel num_threads(40) { { #pragma omp for for(long i0=0; i0<8; i0+=1) { { #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}}) float tmp2 = 0; auto tmp2_vec = at::vec::Vectorized<float>(tmp2); for(long i1=0; i1<4; i1+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + (16i1) + (64i0)); auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp()); tmp2_vec += tmp1; } #pragma omp simd simdlen(8) reduction(+:tmp2) for(long i1=64; i1<64; i1+=1) { auto tmp0 = in_ptr0[i1 + (64i0)]; auto tmp1 = decltype(tmp0)(1) / (decltype(tmp0)(1) + std::exp(-tmp0)); tmp2 += tmp1; } tmp2 += at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp2_vec); out_ptr0[i0] = tmp2; } } } #pragma omp single { { for(long i0=0; i0<0; i0+=1) { auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + 16i0); auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(64)); auto tmp2 = tmp0 / tmp1; tmp2.store(in_out_ptr0 + 16*i0); } #pragma omp simd simdlen(8) for(long i0=0; i0<8; i0+=1) { auto tmp0 = out_ptr0[i0]; auto tmp1 = static_cast<float>(64); auto tmp2 = tmp0 / tmp1; in_out_ptr0[i0] = tmp2; } } } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94890 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/lezcano	2023-02-15 17:13:45 +00:00
Ramin Azarmehr	7dd7dde033	[MPS] Convert output back to ChannelsLast for MaxPool2D (#94877 ) Since we re-stride the indices and output in MPS pooling from ChannelsLast to Contiguous, we need to convert the results back to ChannelsLast. This will fix the failure with test_memory_format with MaxPool2D in test_modules.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94877 Approved by: https://github.com/kulinseth, https://github.com/DenisVieriu97	2023-02-15 16:19:21 +00:00
Kulin Seth	54ebf255ab	[MPS] Fixes for LSTM. (#94889 ) - Backward pass has to give explicit bias tensor of zeros if none is passed to the op or the bias gradient will not be calculated. - Fixed bias tensor mistakenly getting overwritten to zeros - Fixes crash when lstm op called with has_biases set to false. Change takes into account the changed shape of the input params TensorList depending on the bias flag. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/94889 Approved by: https://github.com/DenisVieriu97	2023-02-15 16:10:40 +00:00
AllenTiTaiWang	799df90d0e	[ONNX] Add bloom ops (#94878 ) `449a85bdbf` should be included Pull Request resolved: https://github.com/pytorch/pytorch/pull/94878 Approved by: https://github.com/justinchuby	2023-02-15 16:05:39 +00:00

4987 changed files with 566291 additions and 176399 deletions

3

.bazelignore Normal file

View File

 @ -0,0 +1,3 @@
 # We do not use this library in our Bazel build. It contains an
 # infinitely recursing symlink that makes Bazel very unhappy.
 third_party/ittapi/

7

.bazelrc

View File

 @ -69,10 +69,6 @@ build --per_file_copt='^//.*\.(cpp|cc)$'@-Werror=all
 # The following warnings come from -Wall. We downgrade them from error
 # to warnings here.
 #
 # sign-compare has a tremendous amount of violations in the
 # codebase. It will be a lot of work to fix them, just disable it for
 # now.
 build --per_file_copt='^//.*\.(cpp|cc)$'@-Wno-sign-compare
 # We intentionally use #pragma unroll, which is compiler specific.
 build --per_file_copt='^//.*\.(cpp|cc)$'@-Wno-error=unknown-pragmas
 @ -100,6 +96,9 @@ build --per_file_copt='^//.*\.(cpp|cc)$'@-Wno-unused-parameter
 # likely want to have this disabled for the most part.
 build --per_file_copt='^//.*\.(cpp|cc)$'@-Wno-missing-field-initializers
 build --per_file_copt='^//.*\.(cpp|cc)$'@-Wno-unused-function
 build --per_file_copt='^//.*\.(cpp|cc)$'@-Wno-unused-variable
 build --per_file_copt='//:aten/src/ATen/RegisterCompositeExplicitAutograd\.cpp$'@-Wno-error=unused-function
 build --per_file_copt='//:aten/src/ATen/RegisterCompositeImplicitAutograd\.cpp$'@-Wno-error=unused-function
 build --per_file_copt='//:aten/src/ATen/RegisterMkldnnCPU\.cpp$'@-Wno-error=unused-function

2

.bazelversion

View File

 @ -1 +1 @@
 .2.1
 .1.1

1

.buckconfig.oss

View File

 @ -14,6 +14,7 @@
 [cxx]
   cxxflags = -std=c++17
   ldflags = -Wl,--no-undefined
   should_remap_host_platform = true
   cpp = /usr/bin/clang
   cc = /usr/bin/clang

									
										8

.ci/docker/README.md
									
												View File
												
				@ -1,7 +1,7 @@

				# Docker images for Jenkins

				# Docker images for GitHub CI

				This directory contains everything needed to build the Docker images

				that are used in our CI

				that are used in our CI.

				The Dockerfiles located in subdirectories are parameterized to

				conditionally run build stages depending on build arguments passed to

				@ -12,13 +12,13 @@ each image as the `BUILD_ENVIRONMENT` environment variable.

				See `build.sh` for valid build environments (it's the giant switch).

				Docker builds are now defined with `.circleci/cimodel/data/simple/docker_definitions.py`

				## Contents

				* `build.sh` -- dispatch script to launch all builds

				* `common` -- scripts used to execute individual Docker build stages

				* `ubuntu` -- Dockerfile for Ubuntu image for CPU build and test jobs

				* `ubuntu-cuda` -- Dockerfile for Ubuntu image with CUDA support for nvidia-docker

				* `ubuntu-rocm` -- Dockerfile for Ubuntu image with ROCm support

				## Usage

									
										2

.ci/docker/android/build.gradle
									
												View File
												
				@ -53,7 +53,7 @@ dependencies {

				    implementation 'androidx.appcompat:appcompat:1.0.0'

				    implementation 'com.facebook.fbjni:fbjni-java-only:0.2.2'

				    implementation 'com.google.code.findbugs:jsr305:3.0.1'

				    implementation 'com.facebook.soloader:nativeloader:0.10.4'

				    implementation 'com.facebook.soloader:nativeloader:0.10.5'

				    implementation 'junit:junit:' + rootProject.junitVersion

				    implementation 'androidx.test:core:' + rootProject.coreVersion

									
										155

.ci/docker/build.sh
									
												View File
												
				@ -46,9 +46,7 @@ if [[ "$image" == *xla* ]]; then

				  exit 0

				fi

				if [[ "$image" == *-bionic* ]]; then

				  UBUNTU_VERSION=18.04

				elif [[ "$image" == *-focal* ]]; then

				if [[ "$image" == *-focal* ]]; then

				  UBUNTU_VERSION=20.04

				elif [[ "$image" == *-jammy* ]]; then

				  UBUNTU_VERSION=22.04

				@ -81,18 +79,18 @@ fi

				# CMake 3.18 is needed to support CUDA17 language variant

				CMAKE_VERSION=3.18.5

				_UCX_COMMIT=31e74cac7bee0ef66bef2af72e7d86d9c282e5ab

				_UCC_COMMIT=1c7a7127186e7836f73aafbd7697bbc274a77eee

				_UCX_COMMIT=00bcc6bb18fc282eb160623b4c0d300147f579af

				_UCC_COMMIT=7cb07a76ccedad7e56ceb136b865eb9319c258ea

				# It's annoying to rename jobs every time you want to rewrite a

				# configuration, so we hardcode everything here rather than do it

				# from scratch

				case "$image" in

				  pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7)

				    CUDA_VERSION=11.6.2

				  pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=8

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=7

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				@ -100,12 +98,13 @@ case "$image" in

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7)

				    CUDA_VERSION=11.7.0

				  pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=8

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=7

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				@ -113,8 +112,24 @@ case "$image" in

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-bionic-cuda11.8-cudnn8-py3-gcc7)

				  pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc9)

				    CUDA_VERSION=11.8.0

				    CUDNN_VERSION=8

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc7)

				    CUDA_VERSION=11.8.0

				    CUDNN_VERSION=8

				    ANACONDA_PYTHON_VERSION=3.10

				@ -126,14 +141,36 @@ case "$image" in

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3-clang7-asan)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=7

				    pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc7-inductor-benchmarks)

				    CUDA_VERSION=11.8.0

				    CUDNN_VERSION=8

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=7

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=8

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3-clang10-onnx)

				    ANACONDA_PYTHON_VERSION=3.8

				@ -142,9 +179,10 @@ case "$image" in

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    ONNX=yes

				    ;;

				  pytorch-linux-focal-py3-clang7-android-ndk-r19c)

				    ANACONDA_PYTHON_VERSION=3.7

				    ANACONDA_PYTHON_VERSION=3.8

				    CLANG_VERSION=7

				    LLVMDEV=yes

				    PROTOBUF=yes

				@ -153,45 +191,38 @@ case "$image" in

				    GRADLE_VERSION=6.8.3

				    NINJA_VERSION=1.9.0

				    ;;

				  pytorch-linux-bionic-py3.8-clang9)

				  pytorch-linux-focal-py3.8-clang10)

				    ANACONDA_PYTHON_VERSION=3.8

				    CLANG_VERSION=9

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    VULKAN_SDK_VERSION=1.2.162.1

				    SWIFTSHADER=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-bionic-py3.11-clang9)

				  pytorch-linux-focal-py3.11-clang10)

				    ANACONDA_PYTHON_VERSION=3.11

				    CLANG_VERSION=9

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    VULKAN_SDK_VERSION=1.2.162.1

				    SWIFTSHADER=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-bionic-py3.8-gcc9)

				  pytorch-linux-focal-py3.8-gcc9)

				    ANACONDA_PYTHON_VERSION=3.8

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-1-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=5.3

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    ;;

				  pytorch-linux-focal-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    GCC_VERSION=9

				    PROTOBUF=yes

				@ -200,6 +231,18 @@ case "$image" in

				    ROCM_VERSION=5.4.2

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ROCM_VERSION=5.6

				    NINJA_VERSION=1.9.0

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3.8-gcc7)

				    ANACONDA_PYTHON_VERSION=3.8

				@ -209,24 +252,20 @@ case "$image" in

				    VISION=yes

				    KATEX=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    DOCS=yes

				    ;;

				  pytorch-linux-jammy-cuda11.6-cudnn8-py3.8-clang12)

				    pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.8

				    CUDA_VERSION=11.6

				    CUDNN_VERSION=8

				    CLANG_VERSION=12

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    ;;

				  pytorch-linux-jammy-cuda11.7-cudnn8-py3.8-clang12)

				    ANACONDA_PYTHON_VERSION=3.8

				    CUDA_VERSION=11.7

				    CUDNN_VERSION=8

				    CLANG_VERSION=12

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    DOCS=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda11.8-cudnn8-py3.8-clang12)

				    ANACONDA_PYTHON_VERSION=3.8

				@ -236,6 +275,27 @@ case "$image" in

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3-clang12-asan)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=12

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-py3.8-gcc11)

				    ANACONDA_PYTHON_VERSION=3.8

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    CONDA_CMAKE=yes

				    TRITON=yes

				    DOCS=yes

				    ;;

				  pytorch-linux-focal-linter)

				    # TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

				@ -260,6 +320,7 @@ case "$image" in

				    if [[ "$image" == *rocm* ]]; then

				      extract_version_from_image_name rocm ROCM_VERSION

				      NINJA_VERSION=1.9.0

				      TRITON=yes

				    fi

				    if [[ "$image" == *centos7* ]]; then

				      NINJA_VERSION=1.10.2

				@ -323,11 +384,15 @@ docker build \

				       --build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \

				       --build-arg "KATEX=${KATEX:-}" \

				       --build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \

				       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx906}" \

				       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx906;gfx90a}" \

				       --build-arg "IMAGE_NAME=${IMAGE_NAME}" \

				       --build-arg "UCX_COMMIT=${UCX_COMMIT}" \

				       --build-arg "UCC_COMMIT=${UCC_COMMIT}" \

				       --build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \

				       --build-arg "TRITON=${TRITON}" \

				       --build-arg "ONNX=${ONNX}" \

				       --build-arg "DOCS=${DOCS}" \

				       --build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \

				       -f $(dirname ${DOCKERFILE})/Dockerfile \

				       -t "$tmp_tag" \

				       "$@" \

									
										60

.ci/docker/build_docker.sh
									
												View File
											
				@ -1,60 +0,0 @@

				#!/bin/bash

				set -ex

				retry () {

				    $*  || (sleep 1 && $*) || (sleep 2 && $*)

				}

				# If UPSTREAM_BUILD_ID is set (see trigger job), then we can

				# use it to tag this build with the same ID used to tag all other

				# base image builds. Also, we can try and pull the previous

				# image first, to avoid rebuilding layers that haven't changed.

				#until we find a way to reliably reuse previous build, this last_tag is not in use

				# last_tag="$(( CIRCLE_BUILD_NUM - 1 ))"

				tag="${DOCKER_TAG}"

				registry="308535385114.dkr.ecr.us-east-1.amazonaws.com"

				image="${registry}/pytorch/${IMAGE_NAME}"

				login() {

				  aws ecr get-authorization-token --region us-east-1 --output text --query 'authorizationData[].authorizationToken' |

				    base64 -d |

				    cut -d: -f2 |

				    docker login -u AWS --password-stdin "$1"

				}

				# Only run these steps if not on github actions

				if [[ -z "${GITHUB_ACTIONS}" ]]; then

				  # Retry on timeouts (can happen on job stampede).

				  retry login "${registry}"

				  # Logout on exit

				  trap "docker logout ${registry}" EXIT

				fi

				# Try to pull the previous image (perhaps we can reuse some layers)

				# if [ -n "${last_tag}" ]; then

				#   docker pull "${image}:${last_tag}" || true

				# fi

				# Build new image

				./build.sh ${IMAGE_NAME} -t "${image}:${tag}"

				# Only push if `DOCKER_SKIP_PUSH` = false

				if [ "${DOCKER_SKIP_PUSH:-true}" = "false" ]; then

				  # Only push if docker image doesn't exist already.

				  # ECR image tags are immutable so this will avoid pushing if only just testing if the docker jobs work

				  # NOTE: The only workflow that should push these images should be the docker-builds.yml workflow

				  if ! docker manifest inspect "${image}:${tag}" >/dev/null 2>/dev/null; then

				    docker push "${image}:${tag}"

				  fi

				fi

				if [ -z "${DOCKER_SKIP_S3_UPLOAD:-}" ]; then

				  trap "rm -rf ${IMAGE_NAME}:${tag}.tar" EXIT

				  docker save -o "${IMAGE_NAME}:${tag}.tar" "${image}:${tag}"

				  aws s3 cp "${IMAGE_NAME}:${tag}.tar" "s3://ossci-linux-build/pytorch/base/${IMAGE_NAME}:${tag}.tar" --acl public-read

				fi

									
										4

.ci/docker/centos-rocm/Dockerfile
									
												View File
												
				@ -64,9 +64,9 @@ ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV and ffmpeg

				ARG VISION

				COPY ./common/install_vision.sh install_vision.sh

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				RUN rm install_vision.sh

				RUN rm install_vision.sh cache_vision_models.sh common_utils.sh

				ENV INSTALLED_VISION ${VISION}

				# Install rocm

1

.ci/docker/ci_commit_pins/huggingface.txt Normal file

View File

				`@ -0,0 +1 @@`
				`4.27.4`

1

.ci/docker/ci_commit_pins/timm.txt Normal file

View File

				`@ -0,0 +1 @@`
				`b9d43c7dcac1fe05e851dd7be7187b108af593d2`

1

.ci/docker/ci_commit_pins/triton-rocm.txt Normal file

View File

				`@ -0,0 +1 @@`
				`05d67b9418cacda0d356c2102d7c1a887948b013`

1

.ci/docker/ci_commit_pins/triton.txt Normal file

View File

				`@ -0,0 +1 @@`
				`e6216047b8b0aef1fe8da6ca8667a3ad0a016411`

									
										18

.ci/docker/common/cache_vision_models.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,18 @@

				#!/bin/bash

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				# Cache the test models at ~/.cache/torch/hub/

				IMPORT_SCRIPT_FILENAME="/tmp/torchvision_import_script.py"

				as_jenkins echo 'import torchvision; torchvision.models.mobilenet_v2(pretrained=True); torchvision.models.mobilenet_v3_large(pretrained=True);' > "${IMPORT_SCRIPT_FILENAME}"

				pip_install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cpu

				# Very weird quoting behavior here https://github.com/conda/conda/issues/10972,

				# so echo the command to a file and run the file instead

				conda_run python "${IMPORT_SCRIPT_FILENAME}"

				# Cleaning up

				conda_run pip uninstall -y torch torchvision

				rm "${IMPORT_SCRIPT_FILENAME}" || true

									
										6

.ci/docker/common/common_utils.sh
									
												View File
												
				@ -13,7 +13,7 @@ as_jenkins() {

				  # NB: Pass on PATH and LD_LIBRARY_PATH to sudo invocation

				  # NB: This must be run from a directory that jenkins has access to,

				  # works around https://github.com/conda/conda-package-handling/pull/34

				  $SUDO -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env "PATH=$PATH" "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" $*

				  $SUDO -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env "PATH=$PATH" "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" $*

				}

				conda_install() {

				@ -30,3 +30,7 @@ conda_run() {

				pip_install() {

				  as_jenkins conda run -n py_$ANACONDA_PYTHON_VERSION pip install --progress-bar off $*

				}

				get_pinned_commit() {

				  cat "${1}".txt

				}

									
										3

.ci/docker/common/install_android.sh
									
												View File
												
				@ -107,3 +107,6 @@ chgrp -R jenkins /var/lib/jenkins/.gradle

				popd

				rm -rf /var/lib/jenkins/.gradle/daemon

				# Cache vision models used by the test

				source "$(dirname "${BASH_SOURCE[0]}")/cache_vision_models.sh"

									
										26

.ci/docker/common/install_base.sh
									
												View File
												
				@ -31,10 +31,13 @@ install_ubuntu() {

				    maybe_libomp_dev=""

				  fi

				  # TODO: Remove this once nvidia package repos are back online

				  # Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968

				  # shellcheck disable=SC2046

				  sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")

				  # HACK: UCC testing relies on libnccl library from NVIDIA repo, and version 2.16 crashes

				  # See https://github.com/pytorch/pytorch/pull/105260#issuecomment-1673399729

				  if [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "11.8"* ]]; then

				    maybe_libnccl_dev="libnccl2=2.15.5-1+cuda11.8 libnccl-dev=2.15.5-1+cuda11.8 --allow-downgrades --allow-change-held-packages"

				  else

				    maybe_libnccl_dev=""

				  fi

				  # Install common dependencies

				  apt-get update

				@ -63,6 +66,7 @@ install_ubuntu() {

				    libasound2-dev \

				    libsndfile-dev \

				    ${maybe_libomp_dev} \

				    ${maybe_libnccl_dev} \

				    software-properties-common \

				    wget \

				    sudo \

				@ -77,20 +81,6 @@ install_ubuntu() {

				  # see: https://github.com/pytorch/pytorch/issues/65931

				  apt-get install -y libgnutls30

				  # cuda-toolkit does not work with gcc-11.2.0 which is default in Ubunutu 22.04

				  # see: https://github.com/NVlabs/instant-ngp/issues/119

				  if [[ "$UBUNTU_VERSION" == "22.04"* ]]; then

				    apt-get install -y g++-10

				    update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 30

				    update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 30

				    update-alternatives --install /usr/bin/gcov gcov /usr/bin/gcov-10 30

				    # https://www.spinics.net/lists/libreoffice/msg07549.html

				    sudo rm -rf /usr/lib/gcc/x86_64-linux-gnu/11

				    wget https://github.com/gcc-mirror/gcc/commit/2b2d97fc545635a0f6aa9c9ee3b017394bc494bf.patch -O noexecpt.patch

				    sudo patch  /usr/include/c++/10/bits/range_access.h noexecpt.patch

				  fi

				  # Cleanup package manager

				  apt-get autoclean && apt-get clean

				  rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

									
										13

.ci/docker/common/install_cache.sh
									
												View File
												
				@ -36,14 +36,11 @@ if [ -n "$ROCM_VERSION" ]; then

				  curl --retry 3 http://repo.radeon.com/misc/.sccache_amd/sccache -o /opt/cache/bin/sccache

				else

				  ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				  case "$ID" in

				    ubuntu)

				      install_ubuntu

				      ;;

				    *)

				      install_binary

				      ;;

				  esac

				  # TODO: Install the pre-built binary from S3 as building from source

				  # https://github.com/pytorch/sccache has started failing mysteriously

				  # in which sccache server couldn't start with the following error:

				  #   sccache: error: Invalid argument (os error 22)

				  install_binary

				fi

				chmod a+x /opt/cache/bin/sccache

									
										11

.ci/docker/common/install_clang.sh
									
												View File
												
				@ -4,10 +4,7 @@ set -ex

				if [ -n "$CLANG_VERSION" ]; then

				  if [[ $CLANG_VERSION == 7 && $UBUNTU_VERSION == 16.04 ]]; then

				    wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -

				    sudo apt-add-repository "deb http://apt.llvm.org/xenial/ llvm-toolchain-xenial-7 main"

				  elif [[ $CLANG_VERSION == 9 && $UBUNTU_VERSION == 18.04 ]]; then

				  if [[ $CLANG_VERSION == 9 && $UBUNTU_VERSION == 18.04 ]]; then

				    sudo apt-get update

				    # gpg-agent is not available by default on 18.04

				    sudo apt-get install  -y --no-install-recommends gpg-agent

				@ -28,11 +25,11 @@ if [ -n "$CLANG_VERSION" ]; then

				  fi

				  # Use update-alternatives to make this version the default

				  # TODO: Decide if overriding gcc as well is a good idea

				  # update-alternatives --install /usr/bin/gcc gcc /usr/bin/clang-"$CLANG_VERSION" 50

				  # update-alternatives --install /usr/bin/g++ g++ /usr/bin/clang++-"$CLANG_VERSION" 50

				  update-alternatives --install /usr/bin/clang clang /usr/bin/clang-"$CLANG_VERSION" 50

				  update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-"$CLANG_VERSION" 50

				  # Override cc/c++ to clang as well

				  update-alternatives --install /usr/bin/cc cc /usr/bin/clang 50

				  update-alternatives --install /usr/bin/c++ c++ /usr/bin/clang++ 50

				  # clang's packaging is a little messed up (the runtime libs aren't

				  # added into the linker path), so give it a little help

									
										40

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -7,6 +7,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  BASE_URL="https://repo.anaconda.com/miniconda"

				  MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)

				  MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)

				  case "$MAJOR_PYTHON_VERSION" in

				    2)

				@ -52,21 +53,23 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README

				  CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"

				  if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ]; then

				    # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

				    # TODO: Stop using `-c malfet`

				    conda_install numpy=1.23.5 ${CONDA_COMMON_DEPS} llvmdev=8.0.0 -c malfet

				    conda_install numpy=1.23.5 ${CONDA_COMMON_DEPS}

				  elif [ "$ANACONDA_PYTHON_VERSION" = "3.10" ]; then

				    # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

				    conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS} llvmdev=8.0.0

				    conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}

				  elif [ "$ANACONDA_PYTHON_VERSION" = "3.9" ]; then

				    # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

				    conda_install numpy=1.19.2 ${CONDA_COMMON_DEPS} llvmdev=8.0.0

				    conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}

				  elif [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then

				    # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

				    conda_install numpy=1.18.5 ${CONDA_COMMON_DEPS} llvmdev=8.0.0

				    conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}

				  else

				    # Install `typing-extensions` for 3.7

				    conda_install numpy=1.18.5 ${CONDA_COMMON_DEPS} typing-extensions

				    conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS} typing-extensions

				  fi

				  # This is only supported in 3.8 upward

				  if [ "$MINOR_PYTHON_VERSION" -gt "7" ]; then

				    # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

				    # and libpython-static for torch deploy

				    conda_install llvmdev=8.0.0 "libpython-static=${ANACONDA_PYTHON_VERSION}"

				  fi

				  # Use conda cmake in some cases. Conda cmake will be newer than our supported

				@ -94,5 +97,22 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				    pip_install scikit-learn==0.20.3

				  fi

				  if [ -n "$DOCS" ]; then

				    apt-get update

				    apt-get -y install expect-dev

				    # We are currently building docs with python 3.8 (min support version)

				    pip_install -r /opt/conda/requirements-docs.txt

				  fi

				  # HACK HACK HACK

				  # gcc-9 for ubuntu-18.04 from http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu

				  # Pulls llibstdc++6 13.1.0-8ubuntu1~18.04 which is too new for conda

				  # So remove libstdc++6.so.3.29 installed by https://anaconda.org/anaconda/libstdcxx-ng/files?version=11.2.0

				  # Same is true for gcc-12 from Ubuntu-22.04

				  if grep -e [12][82].04.[623] /etc/issue >/dev/null; then

				    rm /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/libstdc++.so.6

				  fi

				  popd

				fi

									
										6

.ci/docker/common/install_cudnn.sh
									
												View File
												
				@ -4,9 +4,9 @@ if [[ ${CUDNN_VERSION} == 8 ]]; then

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn && cd tmp_cudnn

				    CUDNN_NAME="cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive"

				    if [[ ${CUDA_VERSION:0:4} == "11.7" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-8.5.0.96_cuda11-archive"

				        curl --retry 3 -OLs https://ossci-linux.s3.amazonaws.com/${CUDNN_NAME}.tar.xz

				    if [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-8.9.2.26_cuda12-archive"

				        curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz

				    elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-8.7.0.84_cuda11-archive"

				        curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/${CUDNN_NAME}.tar.xz

									
										2

.ci/docker/common/install_docs_reqs.sh
									
												View File
												
				@ -7,7 +7,7 @@ if [ -n "$KATEX" ]; then

				  # Ignore error if gpg-agent doesn't exist (for Ubuntu 16.04)

				  apt-get install -y gpg-agent || :

				  curl --retry 3 -sL https://deb.nodesource.com/setup_12.x | sudo -E bash -

				  curl --retry 3 -sL https://deb.nodesource.com/setup_16.x | sudo -E bash -

				  sudo apt-get install -y nodejs

				  curl --retry 3 -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -

									
										15

.ci/docker/common/install_gcc.sh
									
												View File
												
				@ -7,17 +7,10 @@ if [ -n "$GCC_VERSION" ]; then

				  # Need the official toolchain repo to get alternate packages

				  add-apt-repository ppa:ubuntu-toolchain-r/test

				  apt-get update

				  if [[ "$UBUNTU_VERSION" == "16.04" && "${GCC_VERSION:0:1}" == "5" ]]; then

				    apt-get install -y g++-5=5.4.0-6ubuntu1~16.04.12

				    update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 50

				    update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-5 50

				    update-alternatives --install /usr/bin/gcov gcov /usr/bin/gcov-5 50

				  else

				    apt-get install -y g++-$GCC_VERSION

				    update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-"$GCC_VERSION" 50

				    update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-"$GCC_VERSION" 50

				    update-alternatives --install /usr/bin/gcov gcov /usr/bin/gcov-"$GCC_VERSION" 50

				  fi

				  apt-get install -y g++-$GCC_VERSION

				  update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-"$GCC_VERSION" 50

				  update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-"$GCC_VERSION" 50

				  update-alternatives --install /usr/bin/gcov gcov /usr/bin/gcov-"$GCC_VERSION" 50

				  # Cleanup package manager

									
										28

.ci/docker/common/install_inductor_benchmark_deps.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,28 @@

				#!/bin/bash

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				function install_huggingface() {

				  local version

				  version=$(get_pinned_commit huggingface)

				  pip_install pandas

				  pip_install scipy

				  pip_install z3-solver

				  pip_install "transformers==${version}"

				}

				function install_timm() {

				  local commit

				  commit=$(get_pinned_commit timm)

				  pip_install pandas

				  pip_install scipy

				  pip_install z3-solver

				  pip_install "git+https://github.com/rwightman/pytorch-image-models@${commit}"

				}

				# Pango is needed for weasyprint which is needed for doctr

				conda_install pango

				install_huggingface

				# install_timm

									
										51

.ci/docker/common/install_onnx.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,51 @@

				#!/bin/bash

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				# A bunch of custom pip dependencies for ONNX

				pip_install \

				  beartype==0.10.4 \

				  filelock==3.9.0 \

				  flatbuffers==2.0 \

				  mock==5.0.1 \

				  ninja==1.10.2 \

				  networkx==2.0 \

				  numpy==1.22.4

				# ONNXRuntime should be installed before installing

				# onnx-weekly. Otherwise, onnx-weekly could be

				# overwritten by onnx.

				pip_install \

				  onnxruntime==1.15.1 \

				  parameterized==0.8.1 \

				  pytest-cov==4.0.0 \

				  pytest-subtests==0.10.0 \

				  tabulate==0.9.0 \

				  transformers==4.31.0

				# Using 1.15dev branch for the following not yet released features and fixes.

				# - Segfault fix for shape inference.

				# - Inliner to workaround ORT segfault.

				pip_install onnx-weekly==1.15.0.dev20230717

				# TODO: change this when onnx-script is on testPypi

				# pip_install onnxscript-preview==0.1.0.dev20230809 --no-deps

				# NOTE: temp change for CI to run on unpublished onnxscript PR.

				pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@f69be19ebd3f2e0d7efe64b0c7be3329cbab3822" --no-deps

				# Cache the transformers model to be used later by ONNX tests. We need to run the transformers

				# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

				IMPORT_SCRIPT_FILENAME="/tmp/onnx_import_script.py"

				as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2");' > "${IMPORT_SCRIPT_FILENAME}"

				# Need a PyTorch version for transformers to work

				pip_install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu

				# Very weird quoting behavior here https://github.com/conda/conda/issues/10972,

				# so echo the command to a file and run the file instead

				conda_run python "${IMPORT_SCRIPT_FILENAME}"

				# Cleaning up

				conda_run pip uninstall -y torch

				rm "${IMPORT_SCRIPT_FILENAME}" || true

									
										38

.ci/docker/common/install_rocm.sh
									
												View File
												
				@ -61,13 +61,23 @@ install_ubuntu() {

				                   rocprofiler-dev \

				                   roctracer-dev

				    # precompiled miopen kernels added in ROCm 3.5; search for all unversioned packages

				    # precompiled miopen kernels added in ROCm 3.5, renamed in ROCm 5.5

				    # search for all unversioned packages

				    # if search fails it will abort this script; use true to avoid case where search fails

				    MIOPENKERNELS=$(apt-cache search --names-only miopenkernels | awk '{print $1}' | grep -F -v . || true)

				    if [[ "x${MIOPENKERNELS}" = x ]]; then

				      echo "miopenkernels package not available"

				    if [[ $(ver $ROCM_VERSION) -ge $(ver 5.5) ]]; then

				        MIOPENHIPGFX=$(apt-cache search --names-only miopen-hip-gfx | awk '{print $1}' | grep -F -v . || true)

				        if [[ "x${MIOPENHIPGFX}" = x ]]; then

				          echo "miopen-hip-gfx package not available" && exit 1

				        else

				          DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated ${MIOPENHIPGFX}

				        fi

				    else

				      DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated ${MIOPENKERNELS}

				        MIOPENKERNELS=$(apt-cache search --names-only miopenkernels | awk '{print $1}' | grep -F -v . || true)

				        if [[ "x${MIOPENKERNELS}" = x ]]; then

				          echo "miopenkernels package not available" && exit 1

				        else

				          DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated ${MIOPENKERNELS}

				        fi

				    fi

				    # Cleanup

				@ -123,6 +133,24 @@ install_centos() {

				                   rocprofiler-dev \

				                   roctracer-dev

				  # precompiled miopen kernels; search for all unversioned packages

				  # if search fails it will abort this script; use true to avoid case where search fails

				  if [[ $(ver $ROCM_VERSION) -ge $(ver 5.5) ]]; then

				      MIOPENHIPGFX=$(yum -q search miopen-hip-gfx | grep miopen-hip-gfx | awk '{print $1}'| grep -F kdb. || true)

				      if [[ "x${MIOPENHIPGFX}" = x ]]; then

				        echo "miopen-hip-gfx package not available" && exit 1

				      else

				        yum install -y ${MIOPENHIPGFX}

				      fi

				  else

				      MIOPENKERNELS=$(yum -q search miopenkernels | grep miopenkernels- | awk '{print $1}'| grep -F kdb. || true)

				      if [[ "x${MIOPENKERNELS}" = x ]]; then

				        echo "miopenkernels package not available" && exit 1

				      else

				        yum install -y ${MIOPENKERNELS}

				      fi

				  fi

				  # Cleanup

				  yum clean all

				  rm -rf /var/cache/yum

									
										4

.ci/docker/common/install_rocm_magma.sh
									
												View File
												
				@ -6,7 +6,7 @@ set -ex

				git clone https://bitbucket.org/icl/magma.git

				pushd magma

				# Fixes memory leaks of magma found while executing linalg UTs

				git checkout 5959b8783e45f1809812ed96ae762f38ee701972

				git checkout 28592a7170e4b3707ed92644bf4a689ed600c27f

				cp make.inc-examples/make.inc.hip-gcc-mkl make.inc

				echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc

				echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc

				@ -18,7 +18,7 @@ else

				  amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`

				fi

				for arch in $amdgpu_targets; do

				  echo "DEVCCFLAGS += --amdgpu-target=$arch" >> make.inc

				  echo "DEVCCFLAGS += --offload-arch=$arch" >> make.inc

				done

				# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition

				sed -i 's/^FOPENMP/#FOPENMP/g' make.inc

									
										66

.ci/docker/common/install_triton.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,66 @@

				#!/bin/bash

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				get_conda_version() {

				  as_jenkins conda list -n py_$ANACONDA_PYTHON_VERSION | grep -w $* | head -n 1 | awk '{print $2}'

				}

				conda_reinstall() {

				  as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y --force-reinstall $*

				}

				if [ -n "${ROCM_VERSION}" ]; then

				  TRITON_REPO="https://github.com/ROCmSoftwarePlatform/triton"

				  TRITON_TEXT_FILE="triton-rocm"

				else

				  TRITON_REPO="https://github.com/openai/triton"

				  TRITON_TEXT_FILE="triton"

				fi

				# The logic here is copied from .ci/pytorch/common_utils.sh

				TRITON_PINNED_COMMIT=$(get_pinned_commit ${TRITON_TEXT_FILE})

				apt update

				apt-get install -y gpg-agent

				if [ -n "${CONDA_CMAKE}" ]; then

				  # Keep the current cmake and numpy version here, so we can reinstall them later

				  CMAKE_VERSION=$(get_conda_version cmake)

				  NUMPY_VERSION=$(get_conda_version numpy)

				fi

				if [ -z "${MAX_JOBS}" ]; then

				    export MAX_JOBS=$(nproc)

				fi

				if [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}" == "7" ]]; then

				  # Triton needs at least gcc-9 to build

				  apt-get install -y g++-9

				  CXX=g++-9 pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"

				elif [ -n "${CLANG_VERSION}" ]; then

				  # Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain

				  add-apt-repository -y ppa:ubuntu-toolchain-r/test

				  apt-get install -y g++-9

				  CXX=g++-9 pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"

				else

				  pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"

				fi

				if [ -n "${CONDA_CMAKE}" ]; then

				  # TODO: This is to make sure that the same cmake and numpy version from install conda

				  # script is used. Without this step, the newer cmake version (3.25.2) downloaded by

				  # triton build step via pip will fail to detect conda MKL. Once that issue is fixed,

				  # this can be removed.

				  #

				  # The correct numpy version also needs to be set here because conda claims that it

				  # causes inconsistent environment.  Without this, conda will attempt to install the

				  # latest numpy version, which fails ASAN tests with the following import error: Numba

				  # needs NumPy 1.20 or less.

				  conda_reinstall cmake="${CMAKE_VERSION}"

				  conda_reinstall numpy="${NUMPY_VERSION}"

				fi

									
										3

.ci/docker/common/install_vision.sh
									
												View File
												
				@ -43,3 +43,6 @@ case "$ID" in

				    exit 1

				    ;;

				esac

				# Cache vision models used by the test

				source "$(dirname "${BASH_SOURCE[0]}")/cache_vision_models.sh"

55

.ci/docker/requirements-ci.txt

View File

 @ -25,10 +25,10 @@ coremltools==5.0b5
 #Pinned versions:
 #test that import:
 expecttest==0.1.3
 expecttest==0.1.6
 #Description: method for writing tests where test framework auto populates
 # the expected output based on previous runs
 #Pinned versions: 0.1.3
 #Pinned versions: 0.1.6
 #test that import:
 flatbuffers==2.0
 @ -62,7 +62,7 @@ librosa>=0.6.2 ; python_version < "3.11"
 #mkl-devel
 # see mkl
 #mock # breaks ci/circleci: docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c
 #mock
 #Description: A testing library that allows you to replace parts of your
 #system under test with mock objects
 #Pinned versions:
 @ -75,16 +75,16 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==0.960
 mypy==1.4.1
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 0.960
 #Pinned versions: 1.4.1
 #test that import: test_typing.py, test_type_hints.py
 networkx==2.6.3
 networkx==2.8.8
 #Description: creation, manipulation, and study of
 #the structure, dynamics, and functions of complex networks
 #Pinned versions: 2.6.3 (latest version that works with Python 3.7+)
 #Pinned versions: 2.8.8
 #test that import: functorch
 #ninja
 @ -124,7 +124,8 @@ opt-einsum==3.3
 #Pinned versions: 3.3
 #test that import: test_linalg.py
 #pillow
 pillow==9.3.0 ; python_version <= "3.8"
 pillow==9.5.0 ; python_version > "3.8"
 #Description:  Python Imaging Library fork
 #Pinned versions:
 #test that import:
 @ -139,17 +140,17 @@ psutil
 #Pinned versions:
 #test that import: test_profiler.py, test_openmp.py, test_dataloader.py
 pytest
 pytest==7.3.2
 #Description: testing framework
 #Pinned versions:
 #test that import: test_typing.py, test_cpp_extensions_aot.py, run_test.py
 pytest-xdist
 pytest-xdist==3.3.1
 #Description: plugin for running pytest in parallel
 #Pinned versions:
 #test that import:
 pytest-shard
 pytest-shard==0.1.2
 #Description: plugin spliting up tests in pytest
 #Pinned versions:
 #test that import:
 @ -159,7 +160,7 @@ pytest-flakefinder==1.1.0
 #Pinned versions: 1.1.0
 #test that import:
 pytest-rerunfailures
 pytest-rerunfailures>=10.3
 #Description: plugin for rerunning failure tests in pytest
 #Pinned versions:
 #test that import:
 @ -179,7 +180,7 @@ xdoctest==1.1.0
 #Pinned versions: 1.1.0
 #test that import:
 pygments==2.12.0
 pygments==2.15.0
 #Description: support doctest highlighting
 #Pinned versions: 2.12.0
 #test that import: the doctests
 @ -199,7 +200,8 @@ pygments==2.12.0
 #Pinned versions: 10.9.0
 #test that import:
 scikit-image
 scikit-image==0.19.3 ; python_version < "3.10"
 scikit-image==0.20.0 ; python_version >= "3.10"
 #Description: image processing routines
 #Pinned versions:
 #test that import: test_nn.py
 @ -211,7 +213,7 @@ scikit-image
 scipy==1.6.3 ; python_version < "3.10"
 scipy==1.8.1 ; python_version == "3.10"
 scipy==1.9.3 ; python_version == "3.11"
 scipy==1.10.1 ; python_version == "3.11"
 # Pin SciPy because of failing distribution tests (see #60347)
 #Description: scientific python
 #Pinned versions: 1.6.3
 @ -224,7 +226,7 @@ scipy==1.9.3 ; python_version == "3.11"
 #Pinned versions:
 #test that import:
 tb-nightly
 tb-nightly==2.13.0a20230426
 #Description: TensorBoard
 #Pinned versions:
 #test that import:
 @ -244,9 +246,9 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
 #Pinned versions:
 #test that import:
 lintrunner==0.9.2
 #Description: all about linters
 #Pinned versions: 0.9.2
 lintrunner==0.10.7
 #Description: all about linters!
 #Pinned versions: 0.10.7
 #test that import:
 rockset==1.0.3
 @ -258,3 +260,18 @@ ghstack==0.7.1
 #Description: ghstack tool
 #Pinned versions: 0.7.1
 #test that import:
 jinja2==3.1.2
 #Description: jinja2 template engine
 #Pinned versions: 3.1.2
 #test that import:
 pytest-cpp==2.3.0
 #Description: This is used by pytest to invoke C++ tests
 #Pinned versions: 2.3.0
 #test that import:
 z3-solver
 #Description: The Z3 Theorem Prover Project
 #Pinned versions:
 #test that import:

49

.ci/docker/requirements-docs.txt Normal file

View File

 @ -0,0 +1,49 @@
 sphinx==5.3.0
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 5.3.0
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
 # TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
 # but it doesn't seem to work and hangs around idly. The initial thought is probably
 # something related to Docker setup. We can investigate this later
 sphinxcontrib.katex==0.8.6
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 0.8.6
 matplotlib==3.5.3
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 3.5.3
 tensorboard==2.13.0
 #Description: This is used to generate PyTorch docs
 #Pinned versions: 2.13.0
 breathe==4.34.0
 #Description: This is used to generate PyTorch C++ docs
 #Pinned versions: 4.34.0
 exhale==0.2.3
 #Description: This is used to generate PyTorch C++ docs
 #Pinned versions: 0.2.3
 docutils==0.16
 #Description: This is used to generate PyTorch C++ docs
 #Pinned versions: 0.16
 bs4==0.0.1
 #Description: This is used to generate PyTorch C++ docs
 #Pinned versions: 0.0.1
 IPython==8.12.0
 #Description: This is used to generate PyTorch functorch docs
 #Pinned versions: 8.12.0
 myst-nb==0.17.2
 #Description: This is used to generate PyTorch functorch docs
 #Pinned versions: 0.13.2
 # The following are required to build torch.distributed.elastic.rendezvous.etcd* docs
 python-etcd==0.4.5
 sphinx-copybutton==0.5.0
 sphinx-panels==0.4.1
 myst-parser==0.18.1

1

.ci/docker/triton_version.txt Normal file

View File

				`@ -0,0 +1 @@`
				`2.1.0`

									
										23

.ci/docker/ubuntu-cuda/Dockerfile
									
												View File
												
				@ -58,9 +58,9 @@ ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV and ffmpeg

				ARG VISION

				COPY ./common/install_vision.sh install_vision.sh

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				RUN rm install_vision.sh

				RUN rm install_vision.sh cache_vision_models.sh common_utils.sh

				ENV INSTALLED_VISION ${VISION}

				# (optional) Install UCC

				@ -85,6 +85,24 @@ COPY ./common/install_cmake.sh install_cmake.sh

				RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi

				RUN rm install_cmake.sh

				ARG INDUCTOR_BENCHMARKS

				COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/huggingface.txt huggingface.txt

				COPY ci_commit_pins/timm.txt timm.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				ARG TRITON

				# Install triton, this needs to be done before sccache because the latter will

				# try to reach out to S3, which docker build runners don't have access

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton.txt triton.txt

				COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

				@ -127,6 +145,7 @@ RUN rm install_cudnn.sh

				# Delete /usr/local/cuda-11.X/cuda-11.X symlinks

				RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi

				RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi

				RUN if [ -h /usr/local/cuda-12.1/cuda-12.1 ]; then rm /usr/local/cuda-12.1/cuda-12.1; fi

				USER jenkins

				CMD ["bash"]

									
										15

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -55,9 +55,9 @@ ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV and ffmpeg

				ARG VISION

				COPY ./common/install_vision.sh install_vision.sh

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				RUN rm install_vision.sh

				RUN rm install_vision.sh cache_vision_models.sh common_utils.sh

				ENV INSTALLED_VISION ${VISION}

				# Install rocm

				@ -68,6 +68,7 @@ RUN rm install_rocm.sh

				COPY ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh

				RUN rm install_rocm_magma.sh

				ENV ROCM_PATH /opt/rocm

				ENV PATH /opt/rocm/bin:$PATH

				ENV PATH /opt/rocm/hcc/bin:$PATH

				ENV PATH /opt/rocm/hip/bin:$PATH

				@ -89,6 +90,16 @@ COPY ./common/install_ninja.sh install_ninja.sh

				RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi

				RUN rm install_ninja.sh

				ARG TRITON

				# Install triton, this needs to be done before sccache because the latter will

				# try to reach out to S3, which docker build runners don't have access

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt

				COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

									
										37

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -36,12 +36,14 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

				# Install conda and other packages (e.g., numpy, pytest)

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ARG DOCS

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				COPY requirements-ci.txt /opt/conda/requirements-ci.txt

				ENV DOCS=$DOCS

				COPY requirements-ci.txt requirements-docs.txt /opt/conda/

				COPY ./common/install_conda.sh install_conda.sh

				COPY ./common/common_utils.sh common_utils.sh

				RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt

				RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt

				# Install gcc

				ARG GCC_VERSION

				@ -86,20 +88,20 @@ ENV INSTALLED_DB ${DB}

				# (optional) Install vision packages like OpenCV and ffmpeg

				ARG VISION

				COPY ./common/install_vision.sh install_vision.sh

				COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi

				RUN rm install_vision.sh

				RUN rm install_vision.sh cache_vision_models.sh common_utils.sh

				ENV INSTALLED_VISION ${VISION}

				# (optional) Install Android NDK

				ARG ANDROID

				ARG ANDROID_NDK

				ARG GRADLE_VERSION

				COPY ./common/install_android.sh install_android.sh

				COPY ./common/install_android.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

				COPY ./android/AndroidManifest.xml AndroidManifest.xml

				COPY ./android/build.gradle build.gradle

				RUN if [ -n "${ANDROID}" ]; then bash ./install_android.sh; fi

				RUN rm install_android.sh

				RUN rm install_android.sh cache_vision_models.sh common_utils.sh

				RUN rm AndroidManifest.xml

				RUN rm build.gradle

				ENV INSTALLED_ANDROID ${ANDROID}

				@ -134,6 +136,29 @@ ENV OPENSSL_ROOT_DIR /opt/openssl

				ENV OPENSSL_DIR /opt/openssl

				RUN rm install_openssl.sh

				ARG INDUCTOR_BENCHMARKS

				COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/huggingface.txt huggingface.txt

				COPY ci_commit_pins/timm.txt timm.txt

				RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi

				RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt

				ARG TRITON

				# Install triton, this needs to be done before sccache because the latter will

				# try to reach out to S3, which docker build runners don't have access

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton.txt triton.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt

				ARG ONNX

				# Install ONNX dependencies

				COPY ./common/install_onnx.sh ./common/common_utils.sh ./

				RUN if [ -n "${ONNX}" ]; then bash ./install_onnx.sh; fi

				RUN rm install_onnx.sh common_utils.sh

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

									
										72

.ci/onnx/test.sh
									
												View File
												
				@ -3,72 +3,18 @@

				# shellcheck source=./common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				if [[ ${BUILD_ENVIRONMENT} == *onnx* ]]; then

				  pip install click mock tabulate networkx==2.0

				  pip -q install --user "file:///var/lib/jenkins/workspace/third_party/onnx#egg=onnx"

				fi

				# Use to retry ONNX test, only retry it twice

				retry () {

				    "$@" || (sleep 60 && "$@")

				}

				# Skip tests in environments where they are not built/applicable

				if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then

				  echo 'Skipping tests'

				  exit 0

				fi

				if [[ "${BUILD_ENVIRONMENT}" == *-rocm* ]]; then

				  # temporary to locate some kernel issues on the CI nodes

				  export HSAKMT_DEBUG_LEVEL=4

				fi

				# These additional packages are needed for circleci ROCm builds.

				if [[ $BUILD_ENVIRONMENT == *rocm* ]]; then

				    # Need networkx 2.0 because bellmand_ford was moved in 2.1 . Scikit-image by

				    # defaults installs the most recent networkx version, so we install this lower

				    # version explicitly before scikit-image pulls it in as a dependency

				    pip install networkx==2.0

				    # click - onnx

				    pip install --progress-bar off click protobuf tabulate virtualenv mock typing-extensions

				fi

				################################################################################

				# Python tests #

				################################################################################

				if [[ "$BUILD_ENVIRONMENT" == *cmake* ]]; then

				  exit 0

				fi

				# If pip is installed as root, we must use sudo.

				# CircleCI docker images could install conda as jenkins user, or use the OS's python package.

				PIP=$(which pip)

				PIP_USER=$(stat --format '%U' $PIP)

				CURRENT_USER=$(id -u -n)

				if [[ "$PIP_USER" = root && "$CURRENT_USER" != root ]]; then

				  MAYBE_SUDO=sudo

				fi

				# Uninstall pre-installed hypothesis and coverage to use an older version as newer

				# versions remove the timeout parameter from settings which ideep/conv_transpose_test.py uses

				$MAYBE_SUDO pip -q uninstall -y hypothesis

				$MAYBE_SUDO pip -q uninstall -y coverage

				# "pip install hypothesis==3.44.6" from official server is unreliable on

				# CircleCI, so we host a copy on S3 instead

				$MAYBE_SUDO pip -q install attrs==18.1.0 -f https://s3.amazonaws.com/ossci-linux/wheels/attrs-18.1.0-py2.py3-none-any.whl

				$MAYBE_SUDO pip -q install coverage==4.5.1 -f https://s3.amazonaws.com/ossci-linux/wheels/coverage-4.5.1-cp36-cp36m-macosx_10_12_x86_64.whl

				$MAYBE_SUDO pip -q install hypothesis==4.57.1

				##############

				# ONNX tests #

				##############

				if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then

				  # TODO: This can be removed later once vision is also part of the Docker image

				  pip install -q --user --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"

				  pip install -q --user transformers==4.25.1

				  pip install -q --user ninja flatbuffers==2.0 numpy==1.22.4 onnxruntime==1.14.0 beartype==0.10.4

				  # TODO: change this when onnx 1.13.1 is released.

				  pip install --no-use-pep517 'onnx @ git+https://github.com/onnx/onnx@e192ba01e438d22ca2dedd7956e28e3551626c91'

				  # TODO: change this when onnx-script is on testPypi

				  pip install 'onnx-script @ git+https://github.com/microsoft/onnx-script@a71e35bcd72537bf7572536ee57250a0c0488bf6'

				  # numba requires numpy <= 1.20, onnxruntime requires numpy >= 1.21.

				  # We don't actually need it for our tests, but it's imported if it's present, so uninstall.

				  pip uninstall -q --yes numba

				  # JIT C++ extensions require ninja, so put it into PATH.

				  export PATH="/var/lib/jenkins/.local/bin:$PATH"

				  "$ROOT_DIR/scripts/onnx/test.sh"

				  # NB: ONNX test is fast (~15m) so it's ok to retry it few more times to avoid any flaky issue, we

				  # need to bring this to the standard PyTorch run_test eventually. The issue will be tracked in

				  # https://github.com/pytorch/pytorch/issues/98626

				  retry "$ROOT_DIR/scripts/onnx/test.sh"

				fi

									
										42

.ci/pytorch/build-asan.sh
									
												View File
											
				@ -1,42 +0,0 @@

				#!/bin/bash

				# Required environment variable: $BUILD_ENVIRONMENT

				# (This is set by default in the Docker images we build, so you don't

				# need to set it yourself.

				# shellcheck source=./common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				# shellcheck source=./common-build.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"

				echo "Clang version:"

				clang --version

				python tools/stats/export_test_times.py

				# detect_leaks=0: Python is very leaky, so we need suppress it

				# symbolize=1: Gives us much better errors when things go wrong

				export ASAN_OPTIONS=detect_leaks=0:detect_stack_use_after_return=1:symbolize=1:detect_odr_violation=0

				if [ -n "$(which conda)" ]; then

				  export CMAKE_PREFIX_PATH=/opt/conda

				fi

				# TODO: Make the ASAN flags a centralized env var and unify with USE_ASAN option

				CC="clang" CXX="clang++" LDSHARED="clang --shared" \

				  CFLAGS="-fsanitize=address -fsanitize=undefined -fno-sanitize-recover=all -fsanitize-address-use-after-scope -shared-libasan" \

				  USE_ASAN=1 USE_CUDA=0 USE_MKLDNN=0 \

				  python setup.py bdist_wheel

				  pip_install_whl "$(echo dist/*.whl)"

				# Test building via the sdist source tarball

				python setup.py sdist

				mkdir -p /tmp/tmp

				pushd /tmp/tmp

				tar zxf "$(dirname "${BASH_SOURCE[0]}")/../../dist/"*.tar.gz

				cd torch-*

				python setup.py build --cmake-only

				popd

				print_sccache_stats

				assert_git_not_dirty

									
										29

.ci/pytorch/build-tsan.sh
									
												View File
											
				@ -1,29 +0,0 @@

				#!/bin/bash

				# Required environment variable: $BUILD_ENVIRONMENT

				# (This is set by default in the Docker images we build, so you don't

				# need to set it yourself.

				# shellcheck source=./common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				# shellcheck source=./common-build.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"

				echo "Clang version:"

				clang --version

				python tools/stats/export_test_times.py

				if [ -n "$(which conda)" ]; then

				  export CMAKE_PREFIX_PATH=/opt/conda

				fi

				CC="clang" CXX="clang++" LDSHARED="clang --shared" \

				  CFLAGS="-fsanitize=thread" \

				  USE_TSAN=1 USE_CUDA=0 USE_MKLDNN=0 \

				  python setup.py bdist_wheel

				  pip_install_whl "$(echo dist/*.whl)"

				print_sccache_stats

				assert_git_not_dirty

									
										29

.ci/pytorch/build.sh
									
												View File
												
				@ -11,14 +11,6 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				# shellcheck source=./common-build.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"

				if [[ "$BUILD_ENVIRONMENT" == *-clang7-asan* ]]; then

				  exec "$(dirname "${BASH_SOURCE[0]}")/build-asan.sh" "$@"

				fi

				if [[ "$BUILD_ENVIRONMENT" == *-clang7-tsan* ]]; then

				  exec "$(dirname "${BASH_SOURCE[0]}")/build-tsan.sh" "$@"

				fi

				if [[ "$BUILD_ENVIRONMENT" == *-mobile-*build* ]]; then

				  exec "$(dirname "${BASH_SOURCE[0]}")/build-mobile.sh" "$@"

				fi

				@ -44,6 +36,7 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then

				  if [[ "$BUILD_ENVIRONMENT" != *cuda11.3* && "$BUILD_ENVIRONMENT" != *clang* ]]; then

				    # TODO: there is a linking issue when building with UCC using clang,

				    # disable it for now and to be fix later.

				    # TODO: disable UCC temporarily to enable CUDA 12.1 in CI

				    export USE_UCC=1

				    export USE_SYSTEM_UCC=1

				  fi

				@ -171,6 +164,15 @@ if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then

				  export CXX=clang++

				fi

				if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then

				  export LDSHARED="clang --shared"

				  export USE_CUDA=0

				  export USE_ASAN=1

				  export USE_MKLDNN=0

				  export UBSAN_FLAGS="-fno-sanitize-recover=all;-fno-sanitize=float-divide-by-zero;-fno-sanitize=float-cast-overflow"

				  unset USE_LLVM

				fi

				if [[ "${BUILD_ENVIRONMENT}" == *no-ops* ]]; then

				  export USE_PER_OPERATOR_HEADERS=0

				fi

				@ -191,16 +193,19 @@ if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then

				  set -e

				  get_bazel

				  install_sccache_nvcc_for_bazel

				  # Leave 1 CPU free and use only up to 80% of memory to reduce the change of crashing

				  # the runner

				  BAZEL_MEM_LIMIT="--local_ram_resources=HOST_RAM*.8"

				  BAZEL_CPU_LIMIT="--local_cpu_resources=HOST_CPUS-1"

				  tools/bazel build --config=no-tty "${BAZEL_MEM_LIMIT}" "${BAZEL_CPU_LIMIT}" //...

				  # Build torch, the Python module, and tests for CPU-only

				  tools/bazel build --config=no-tty "${BAZEL_MEM_LIMIT}" "${BAZEL_CPU_LIMIT}" --config=cpu-only :torch :_C.so :all_tests

				  if [[ "$CUDA_VERSION" == "cpu" ]]; then

				    # Build torch, the Python module, and tests for CPU-only

				    tools/bazel build --config=no-tty "${BAZEL_MEM_LIMIT}" "${BAZEL_CPU_LIMIT}" --config=cpu-only :torch :torch/_C.so :all_tests

				  else

				    tools/bazel build --config=no-tty "${BAZEL_MEM_LIMIT}" "${BAZEL_CPU_LIMIT}" //...

				  fi

				else

				  # check that setup.py would fail with bad arguments

				  echo "The next three invocations are expected to fail with invalid command error messages."

									
										9

.ci/pytorch/common-build.sh
									
												View File
												
				@ -31,7 +31,7 @@ if [[ "$BUILD_ENVIRONMENT" != *win-* ]]; then

				            # as though sccache still gets used even when the sscache server isn't started

				            # explicitly

				            echo "Skipping sccache server initialization, setting environment variables"

				            export SCCACHE_IDLE_TIMEOUT=1200

				            export SCCACHE_IDLE_TIMEOUT=0

				            export SCCACHE_ERROR_LOG=~/sccache_error.log

				            export RUST_LOG=sccache::server=error

				        elif [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then

				@ -39,11 +39,12 @@ if [[ "$BUILD_ENVIRONMENT" != *win-* ]]; then

				        else

				            # increasing SCCACHE_IDLE_TIMEOUT so that extension_backend_test.cpp can build after this PR:

				            # https://github.com/pytorch/pytorch/pull/16645

				            SCCACHE_ERROR_LOG=~/sccache_error.log SCCACHE_IDLE_TIMEOUT=1200 RUST_LOG=sccache::server=error sccache --start-server

				            SCCACHE_ERROR_LOG=~/sccache_error.log SCCACHE_IDLE_TIMEOUT=0 RUST_LOG=sccache::server=error sccache --start-server

				        fi

				        # Report sccache stats for easier debugging

				        sccache --zero-stats

				        # Report sccache stats for easier debugging. It's ok if this commands

				        # timeouts and fails on MacOS

				        sccache --zero-stats || true

				    fi

				    if which ccache > /dev/null; then

									
										4

.ci/pytorch/common.sh
									
												View File
												
				@ -22,7 +22,3 @@ fi

				# TODO: Renable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598

				# shellcheck disable=SC2034

				BUILD_TEST_LIBTORCH=0

				retry () {

				  "$@"  || (sleep 1 && "$@") || (sleep 2 && "$@")

				}

									
										154

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -80,19 +80,34 @@ function get_exit_code() {

				}

				function get_bazel() {

				  if [[ $(uname) == "Darwin" ]]; then

				    # download bazel version

				    retry curl https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-darwin-x86_64  -Lo tools/bazel

				    # verify content

				    echo '74d93848f0c9d592e341e48341c53c87e3cb304a54a2a1ee9cff3df422f0b23c  tools/bazel' | shasum -a 256 -c >/dev/null

				  else

				    # download bazel version

				    retry curl https://ossci-linux.s3.amazonaws.com/bazel-4.2.1-linux-x86_64 -o tools/bazel

				    # verify content

				    echo '1a4f3a3ce292307bceeb44f459883859c793436d564b95319aacb8af1f20557c  tools/bazel' | shasum -a 256 -c >/dev/null

				  fi

				  # Download and use the cross-platform, dependency-free Python

				  # version of Bazelisk to fetch the platform specific version of

				  # Bazel to use from .bazelversion.

				  retry curl --location --output tools/bazel \

				    https://raw.githubusercontent.com/bazelbuild/bazelisk/v1.16.0/bazelisk.py

				  shasum --algorithm=1 --check \

				    <(echo 'd4369c3d293814d3188019c9f7527a948972d9f8  tools/bazel')

				  chmod u+x tools/bazel

				}

				  chmod +x tools/bazel

				# This function is bazel specific because of the bug

				# in the bazel that requires some special paths massaging

				# as a workaround. See

				# https://github.com/bazelbuild/bazel/issues/10167

				function install_sccache_nvcc_for_bazel() {

				  sudo mv /usr/local/cuda/bin/nvcc /usr/local/cuda/bin/nvcc-real

				  # Write the `/usr/local/cuda/bin/nvcc`

				  cat << EOF | sudo tee /usr/local/cuda/bin/nvcc

				#!/bin/sh

				if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then

				  exec sccache /usr/local/cuda/bin/nvcc "\$@"

				else

				  exec external/local_cuda/cuda/bin/nvcc-real "\$@"

				fi

				EOF

				  sudo chmod +x /usr/local/cuda/bin/nvcc

				}

				function install_monkeytype {

				@ -105,16 +120,62 @@ function get_pinned_commit() {

				  cat .github/ci_commit_pins/"${1}".txt

				}

				function install_torchtext() {

				function install_torchaudio() {

				  local commit

				  commit=$(get_pinned_commit text)

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/text.git@${commit}"

				  commit=$(get_pinned_commit audio)

				  if [[ "$1" == "cuda" ]]; then

				    # TODO: This is better to be passed as a parameter from _linux-test workflow

				    # so that it can be consistent with what is set in build

				    TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install --no-use-pep517 --user "git+https://github.com/pytorch/audio.git@${commit}"

				  else

				    pip_install --no-use-pep517 --user "git+https://github.com/pytorch/audio.git@${commit}"

				  fi

				}

				function install_torchtext() {

				  local data_commit

				  local text_commit

				  data_commit=$(get_pinned_commit data)

				  text_commit=$(get_pinned_commit text)

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/data.git@${data_commit}"

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/text.git@${text_commit}"

				}

				function install_torchvision() {

				  local orig_preload

				  local commit

				  commit=$(get_pinned_commit vision)

				  orig_preload=${LD_PRELOAD}

				  if [ -n "${LD_PRELOAD}" ]; then

				    # Silence dlerror to work-around glibc ASAN bug, see https://sourceware.org/bugzilla/show_bug.cgi?id=27653#c9

				    echo 'char* dlerror(void) { return "";}'|gcc -fpic -shared -o "${HOME}/dlerror.so" -x c -

				    LD_PRELOAD=${orig_preload}:${HOME}/dlerror.so

				  fi

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git@${commit}"

				  if [ -n "${LD_PRELOAD}" ]; then

				    LD_PRELOAD=${orig_preload}

				  fi

				}

				function install_torchrec_and_fbgemm() {

				  local torchrec_commit

				  torchrec_commit=$(get_pinned_commit torchrec)

				  local fbgemm_commit

				  fbgemm_commit=$(get_pinned_commit fbgemm)

				  pip_uninstall torchrec-nightly

				  pip_uninstall fbgemm-gpu-nightly

				  pip_install setuptools-git-versioning scikit-build pyre-extensions

				  # See https://github.com/pytorch/pytorch/issues/106971

				  CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"

				  pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"

				}

				function install_numpy_pytorch_interop() {

				  local commit

				  commit=$(get_pinned_commit numpy_pytorch_interop)

				  # TODO: --no-use-pep517 will result in failure.

				  pip_install --user "git+https://github.com/Quansight-Labs/numpy_pytorch_interop.git@${commit}"

				}

				function clone_pytorch_xla() {

				@ -129,59 +190,15 @@ function clone_pytorch_xla() {

				  fi

				}

				function install_filelock() {

				  pip_install filelock

				}

				function install_triton() {

				  local commit

				  commit=$(get_pinned_commit triton)

				  local short_hash

				  short_hash=$(echo "${commit}"|cut -c -10)

				  local index_url

				  index_url=https://download.pytorch.org/whl/nightly/cpu

				  if [[ "${TEST_CONFIG}" == *rocm* ]]; then

				    echo "skipping triton due to rocm"

				  elif pip install "pytorch-triton==2.0.0+${short_hash}" --index-url "${index_url}"; then

				     echo "Using prebuilt version ${short_hash}"

				  else

				    if [[ "${BUILD_ENVIRONMENT}" == *gcc7* ]]; then

				      # Trition needs gcc-9 to build

				      sudo apt-get install -y g++-9

				      CXX=g++-9 pip_install --user "git+https://github.com/openai/triton@${commit}#subdirectory=python"

				    elif [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then

				      # Trition needs <filesystem> which surprisingly is not available with clang-9 toolchain

				      sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test

				      sudo apt-get install -y g++-9

				      CXX=g++-9 pip_install --user "git+https://github.com/openai/triton@${commit}#subdirectory=python"

				    else

				      pip_install --user "git+https://github.com/openai/triton@${commit}#subdirectory=python"

				    fi

				    pip_install --user jinja2

				  fi

				}

				function setup_torchdeploy_deps(){

				  conda install -y -n "py_${ANACONDA_PYTHON_VERSION}" "libpython-static=${ANACONDA_PYTHON_VERSION}"

				  local CC

				  local CXX

				  CC="$(which gcc)"

				  CXX="$(which g++)"

				  export CC

				  export CXX

				  pip install --upgrade pip

				}

				function checkout_install_torchdeploy() {

				  local commit

				  commit=$(get_pinned_commit multipy)

				  setup_torchdeploy_deps

				  pushd ..

				  git clone --recurse-submodules https://github.com/pytorch/multipy.git

				  pushd multipy

				  git checkout "${commit}"

				  python multipy/runtime/example/generate_examples.py

				  pip install -e . --install-option="--cudatests"

				  BUILD_CUDA_TESTS=1 pip install -e .

				  popd

				  popd

				}

				@ -195,26 +212,21 @@ function test_torch_deploy(){

				 popd

				}

				function install_huggingface() {

				  local commit

				  commit=$(get_pinned_commit huggingface)

				  pip_install pandas

				  pip_install scipy

				  pip_install "git+https://github.com/huggingface/transformers.git@${commit}#egg=transformers"

				}

				function install_timm() {

				  local commit

				  commit=$(get_pinned_commit timm)

				  pip_install pandas

				  pip_install scipy

				  pip_install z3-solver

				  pip_install "git+https://github.com/rwightman/pytorch-image-models@${commit}"

				}

				function checkout_install_torchbench() {

				  local commit

				  commit=$(get_pinned_commit torchbench)

				  git clone https://github.com/pytorch/benchmark torchbench

				  pushd torchbench

				  git checkout no_torchaudio

				  git checkout "$commit"

				  if [ "$1" ]; then

				    python install.py --continue_on_fail models "$@"

				@ -226,10 +238,6 @@ function checkout_install_torchbench() {

				  popd

				}

				function test_functorch() {

				  python test/run_test.py --functorch --verbose

				}

				function print_sccache_stats() {

				  echo 'PyTorch Build Statistics'

				  sccache --show-stats

									
										18

.circleci/scripts/cpp_doc_push_script.sh → .ci/pytorch/cpp_doc_push_script.sh
									
												View File
												
				@ -1,8 +1,4 @@

				# =================== The following code **should** be executed inside Docker container ===================

				# Install dependencies

				sudo apt-get -y update

				sudo apt-get -y install expect-dev

				#!/bin/bash

				# This is where the local pytorch install in the docker image is located

				pt_checkout="/var/lib/jenkins/workspace"

				@ -20,7 +16,7 @@ echo "cpp_doc_push_script.sh: Invoked with $*"

				#       but since DOCS_INSTALL_PATH can be derived from DOCS_VERSION it's probably better to

				#       try and gather it first, just so we don't potentially break people who rely on this script

				# Argument 2: What version of the Python API docs we are building.

				version="${2:-${DOCS_VERSION:-master}}"

				version="${2:-${DOCS_VERSION:-main}}"

				if [ -z "$version" ]; then

				echo "error: cpp_doc_push_script.sh: version (arg2) not specified"

				  exit 1

				@ -34,11 +30,6 @@ echo "error: cpp_doc_push_script.sh: install_path (arg1) not specified"

				  exit 1

				fi

				is_main_doc=false

				if [ "$version" == "master" ]; then

				  is_main_doc=true

				fi

				echo "install_path: $install_path  version: $version"

				# ======================== Building PyTorch C++ API Docs ========================

				@ -53,7 +44,6 @@ set -ex

				# Generate ATen files

				pushd "${pt_checkout}"

				pip install -r requirements.txt

				time python -m torchgen.gen \

				  -s aten/src/ATen \

				  -d build/aten/src/ATen

				@ -68,7 +58,6 @@ time python tools/setup_helpers/generate_code.py \

				# Build the docs

				pushd docs/cpp

				pip install -r requirements.txt

				time make VERBOSE=1 html -j

				popd

				@ -79,7 +68,7 @@ pushd cppdocs

				# Purge everything with some exceptions

				mkdir /tmp/cppdocs-sync

				mv _config.yml README.md /tmp/cppdocs-sync/

				rm -rf *

				rm -rf ./*

				# Copy over all the newly generated HTML

				cp -r "${pt_checkout}"/docs/cpp/build/html/* .

				@ -102,4 +91,3 @@ if [[ "${WITH_PUSH:-}" == true ]]; then

				fi

				popd

				# =================== The above code **should** be executed inside Docker container ===================

									
										136

.ci/pytorch/create_test_cert.py
									
												View File
												
				@ -1,10 +1,10 @@

				from datetime import datetime, timedelta

				from tempfile import mkdtemp

				from cryptography.hazmat.primitives import serialization

				from cryptography.hazmat.primitives.asymmetric import rsa

				from cryptography import x509

				from cryptography.hazmat.primitives import hashes, serialization

				from cryptography.hazmat.primitives.asymmetric import rsa

				from cryptography.x509.oid import NameOID

				from cryptography.hazmat.primitives import hashes

				temp_dir = mkdtemp()

				print(temp_dir)

				@ -16,37 +16,43 @@ def genrsa(path):

				        key_size=2048,

				    )

				    with open(path, "wb") as f:

				        f.write(key.private_bytes(

				            encoding=serialization.Encoding.PEM,

				            format=serialization.PrivateFormat.TraditionalOpenSSL,

				            encryption_algorithm=serialization.NoEncryption(),

				        ))

				        f.write(

				            key.private_bytes(

				                encoding=serialization.Encoding.PEM,

				                format=serialization.PrivateFormat.TraditionalOpenSSL,

				                encryption_algorithm=serialization.NoEncryption(),

				            )

				        )

				    return key

				def create_cert(path, C, ST, L, O, key):

				    subject = issuer = x509.Name([

				        x509.NameAttribute(NameOID.COUNTRY_NAME, C),

				        x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),

				        x509.NameAttribute(NameOID.LOCALITY_NAME, L),

				        x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),

				    ])

				    cert = x509.CertificateBuilder().subject_name(

				        subject

				    ).issuer_name(

				        issuer

				    ).public_key(

				        key.public_key()

				    ).serial_number(

				        x509.random_serial_number()

				    ).not_valid_before(

				        datetime.utcnow()

				    ).not_valid_after(

				        # Our certificate will be valid for 10 days

				        datetime.utcnow() + timedelta(days=10)

				    ).add_extension(

				        x509.BasicConstraints(ca=True, path_length=None), critical=True,

				    ).sign(key, hashes.SHA256())

				    subject = issuer = x509.Name(

				        [

				            x509.NameAttribute(NameOID.COUNTRY_NAME, C),

				            x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),

				            x509.NameAttribute(NameOID.LOCALITY_NAME, L),

				            x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),

				        ]

				    )

				    cert = (

				        x509.CertificateBuilder()

				        .subject_name(subject)

				        .issuer_name(issuer)

				        .public_key(key.public_key())

				        .serial_number(x509.random_serial_number())

				        .not_valid_before(datetime.utcnow())

				        .not_valid_after(

				            # Our certificate will be valid for 10 days

				            datetime.utcnow()

				            + timedelta(days=10)

				        )

				        .add_extension(

				            x509.BasicConstraints(ca=True, path_length=None),

				            critical=True,

				        )

				        .sign(key, hashes.SHA256())

				    )

				    # Write our certificate out to disk.

				    with open(path, "wb") as f:

				        f.write(cert.public_bytes(serialization.Encoding.PEM))

				@ -54,43 +60,65 @@ def create_cert(path, C, ST, L, O, key):

				def create_req(path, C, ST, L, O, key):

				    csr = x509.CertificateSigningRequestBuilder().subject_name(x509.Name([

				        # Provide various details about who we are.

				        x509.NameAttribute(NameOID.COUNTRY_NAME, C),

				        x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),

				        x509.NameAttribute(NameOID.LOCALITY_NAME, L),

				        x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),

				    ])).sign(key, hashes.SHA256())

				    csr = (

				        x509.CertificateSigningRequestBuilder()

				        .subject_name(

				            x509.Name(

				                [

				                    # Provide various details about who we are.

				                    x509.NameAttribute(NameOID.COUNTRY_NAME, C),

				                    x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),

				                    x509.NameAttribute(NameOID.LOCALITY_NAME, L),

				                    x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),

				                ]

				            )

				        )

				        .sign(key, hashes.SHA256())

				    )

				    with open(path, "wb") as f:

				        f.write(csr.public_bytes(serialization.Encoding.PEM))

				    return csr

				def sign_certificate_request(path, csr_cert, ca_cert, private_ca_key):

				    cert = x509.CertificateBuilder().subject_name(

				        csr_cert.subject

				    ).issuer_name(

				        ca_cert.subject

				    ).public_key(

				        csr_cert.public_key()

				    ).serial_number(

				        x509.random_serial_number()

				    ).not_valid_before(

				        datetime.utcnow()

				    ).not_valid_after(

				        # Our certificate will be valid for 10 days

				        datetime.utcnow() + timedelta(days=10)

				        # Sign our certificate with our private key

				    ).sign(private_ca_key, hashes.SHA256())

				    cert = (

				        x509.CertificateBuilder()

				        .subject_name(csr_cert.subject)

				        .issuer_name(ca_cert.subject)

				        .public_key(csr_cert.public_key())

				        .serial_number(x509.random_serial_number())

				        .not_valid_before(datetime.utcnow())

				        .not_valid_after(

				            # Our certificate will be valid for 10 days

				            datetime.utcnow()

				            + timedelta(days=10)

				            # Sign our certificate with our private key

				        )

				        .sign(private_ca_key, hashes.SHA256())

				    )

				    with open(path, "wb") as f:

				        f.write(cert.public_bytes(serialization.Encoding.PEM))

				    return cert

				ca_key = genrsa(temp_dir + "/ca.key")

				ca_cert = create_cert(temp_dir + "/ca.pem", u"US", u"New York", u"New York", u"Gloo Certificate Authority", ca_key)

				ca_cert = create_cert(

				    temp_dir + "/ca.pem",

				    "US",

				    "New York",

				    "New York",

				    "Gloo Certificate Authority",

				    ca_key,

				)

				pkey = genrsa(temp_dir + "/pkey.key")

				csr = create_req(temp_dir + "/csr.csr", u"US", u"California", u"San Francisco", u"Gloo Testing Company", pkey)

				csr = create_req(

				    temp_dir + "/csr.csr",

				    "US",

				    "California",

				    "San Francisco",

				    "Gloo Testing Company",

				    pkey,

				)

				cert = sign_certificate_request(temp_dir + "/cert.pem", csr, ca_cert, ca_key)

									
										1

.ci/pytorch/docs-test.sh
									
												View File
												
				@ -6,5 +6,4 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				echo "Testing pytorch docs"

				cd docs

				pip_install -r requirements.txt

				make doctest

									
										40

.ci/pytorch/functorch_doc_push_script.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,40 @@

				#!/bin/bash

				# This is where the local pytorch install in the docker image is located

				pt_checkout="/var/lib/jenkins/workspace"

				source "$pt_checkout/.ci/pytorch/common_utils.sh"

				echo "functorch_doc_push_script.sh: Invoked with $*"

				set -ex

				version=${DOCS_VERSION:-nightly}

				echo "version: $version"

				# Build functorch docs

				pushd $pt_checkout/functorch/docs

				make html

				popd

				git clone https://github.com/pytorch/functorch -b gh-pages --depth 1 functorch_ghpages

				pushd functorch_ghpages

				if [ "$version" == "main" ]; then

				  version=nightly

				fi

				git rm -rf "$version" || true

				mv "$pt_checkout/functorch/docs/build/html" "$version"

				git add "$version" || true

				git status

				git config user.email "soumith+bot@pytorch.org"

				git config user.name "pytorchbot"

				# If there aren't changes, don't make a commit; push is no-op

				git commit -m "Generate Python docs from pytorch/pytorch@${GITHUB_SHA}" || true

				git status

				if [[ "${WITH_PUSH:-}" == true ]]; then

				  git push -u origin gh-pages

				fi

				popd

									
										16

.ci/pytorch/macos-build.sh
									
												View File
												
				@ -40,8 +40,14 @@ cross_compile_arm64() {

				  USE_DISTRIBUTED=0 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel

				}

				compile_arm64() {

				  # Compilation for arm64

				  # TODO: Compile with OpenMP support (but this causes CI regressions as cross-compilation were done with OpenMP disabled)

				  USE_DISTRIBUTED=0 USE_OPENMP=0 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel

				}

				compile_x86_64() {

				  USE_DISTRIBUTED=0 WERROR=1 python setup.py bdist_wheel

				  USE_DISTRIBUTED=0 WERROR=1 python setup.py bdist_wheel --plat-name=macosx_10_9_x86_64

				}

				build_lite_interpreter() {

				@ -62,8 +68,14 @@ build_lite_interpreter() {

				    "${CPP_BUILD}/caffe2/build/bin/test_lite_interpreter_runtime"

				}

				print_cmake_info

				if [[ ${BUILD_ENVIRONMENT} = *arm64* ]]; then

				  cross_compile_arm64

				  if [[ $(uname -m) == "arm64" ]]; then

				    compile_arm64

				  else

				    cross_compile_arm64

				  fi

				elif [[ ${BUILD_ENVIRONMENT} = *lite-interpreter* ]]; then

				  export BUILD_LITE_INTERPRETER=1

				  build_lite_interpreter

									
										21

.ci/pytorch/macos-common.sh
									
												View File
												
				@ -9,6 +9,25 @@ sysctl -a | grep machdep.cpu

				# These are required for both the build job and the test job.

				# In the latter to test cpp extensions.

				export MACOSX_DEPLOYMENT_TARGET=10.9

				export MACOSX_DEPLOYMENT_TARGET=11.0

				export CXX=clang++

				export CC=clang

				print_cmake_info() {

				  CMAKE_EXEC=$(which cmake)

				  echo "$CMAKE_EXEC"

				  CONDA_INSTALLATION_DIR=$(dirname "$CMAKE_EXEC")

				  # Print all libraries under cmake rpath for debugging

				  ls -la "$CONDA_INSTALLATION_DIR/../lib"

				  export CMAKE_EXEC

				  # Explicitly add conda env lib folder to cmake rpath to address the flaky issue

				  # where cmake dependencies couldn't be found. This seems to point to how conda

				  # links $CMAKE_EXEC to its package cache when cloning a new environment

				  install_name_tool -add_rpath @executable_path/../lib "${CMAKE_EXEC}" || true

				  # Adding the rpath will invalidate cmake signature, so signing it again here

				  # to trust the executable. EXC_BAD_ACCESS (SIGKILL (Code Signature Invalid))

				  # with an exit code 137 otherwise

				  codesign -f -s - "${CMAKE_EXEC}" || true

				}

									
										29

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -25,6 +25,7 @@ setup_test_python() {

				  # using the address associated with the loopback interface.

				  export GLOO_SOCKET_IFNAME=lo0

				  echo "Ninja version: $(ninja --version)"

				  echo "Python version: $(which python) ($(python --version))"

				  # Increase default limit on open file handles from 256 to 1024

				  ulimit -n 1024

				@ -70,37 +71,19 @@ test_libtorch() {

				    VERBOSE=1 DEBUG=1 python "$BUILD_LIBTORCH_PY"

				    popd

				    python tools/download_mnist.py --quiet -d test/cpp/api/mnist

				    MNIST_DIR="${PWD}/test/cpp/api/mnist"

				    python tools/download_mnist.py --quiet -d "${MNIST_DIR}"

				    # Unfortunately it seems like the test can't load from miniconda3

				    # without these paths being set

				    export DYLD_LIBRARY_PATH="$DYLD_LIBRARY_PATH:$PWD/miniconda3/lib"

				    export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$PWD/miniconda3/lib"

				    TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" "$CPP_BUILD"/caffe2/bin/test_api

				    TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" CPP_TESTS_DIR="${CPP_BUILD}/caffe2/bin" python test/run_test.py --cpp --verbose -i cpp/test_api

				    assert_git_not_dirty

				  fi

				}

				print_cmake_info() {

				  CMAKE_EXEC=$(which cmake)

				  echo "$CMAKE_EXEC"

				  CONDA_INSTALLATION_DIR=$(dirname "$CMAKE_EXEC")

				  # Print all libraries under cmake rpath for debugging

				  ls -la "$CONDA_INSTALLATION_DIR/../lib"

				  export CMAKE_EXEC

				  # Explicitly add conda env lib folder to cmake rpath to address the flaky issue

				  # where cmake dependencies couldn't be found. This seems to point to how conda

				  # links $CMAKE_EXEC to its package cache when cloning a new environment

				  install_name_tool -add_rpath @executable_path/../lib "${CMAKE_EXEC}" || true

				  # Adding the rpath will invalidate cmake signature, so signing it again here

				  # to trust the executable. EXC_BAD_ACCESS (SIGKILL (Code Signature Invalid))

				  # with an exit code 137 otherwise

				  codesign -f -s - "${CMAKE_EXEC}" || true

				}

				test_custom_backend() {

				  print_cmake_info

				@ -166,9 +149,7 @@ test_jit_hooks() {

				  assert_git_not_dirty

				}

				if [[ "${TEST_CONFIG}" == *functorch* ]]; then

				  test_functorch

				elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then

				if [[ $NUM_TEST_SHARDS -gt 1 ]]; then

				  test_python_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    test_libtorch

									
										31

.ci/pytorch/multigpu-test.sh
									
												View File
												
				@ -8,6 +8,7 @@

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				echo "Testing pytorch"

				time python test/run_test.py --include test_cuda_multigpu test_cuda_primary_ctx --verbose

				# Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015

				# python tools/download_mnist.py --quiet -d test/cpp/api/mnist

				@ -27,23 +28,25 @@ time python test/run_test.py --verbose -i distributed/checkpoint/test_checkpoint

				time python test/run_test.py --verbose -i distributed/checkpoint/test_file_system_checkpoint

				time python test/run_test.py --verbose -i distributed/_shard/sharding_spec/test_sharding_spec

				time python test/run_test.py --verbose -i distributed/_shard/sharding_plan/test_sharding_plan

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_megatron_prototype

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor_reshard

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_chunk

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_elementwise_ops

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_embedding

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_embedding_bag

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_binary_cmp

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_init

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_linear

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_math_ops

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_matrix_ops

				time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_softmax

				time python test/run_test.py --verbose -i distributed/_shard/sharded_optim/test_sharded_optim

				time python test/run_test.py --verbose -i distributed/_shard/test_partial_tensor

				time python test/run_test.py --verbose -i distributed/_shard/test_replicated_tensor

				# functional collective tests

				time python test/run_test.py --verbose -i distributed/test_functional_api

				# DTensor tests

				time python test/run_test.py --verbose -i distributed/_tensor/test_device_mesh

				time python test/run_test.py --verbose -i distributed/_tensor/test_random_ops

				time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compile

				# DTensor/TP tests

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_ddp_2d_parallel

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_fsdp_2d_parallel

				time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples

				# Other tests

				time python test/run_test.py --verbose -i test_cuda_primary_ctx

				time python test/run_test.py --verbose -i test_optim -- -k optimizers_with_varying_tensors

				time python test/run_test.py --verbose -i test_foreach -- -k test_tensors_grouping

				assert_git_not_dirty

									
										63

.ci/pytorch/perf_test/compare_with_baseline.py
									
												View File
												
				@ -1,32 +1,41 @@

				import sys

				import argparse

				import json

				import math

				import argparse

				import sys

				parser = argparse.ArgumentParser()

				parser.add_argument('--test-name', dest='test_name', action='store',

				                    required=True, help='test name')

				parser.add_argument('--sample-stats', dest='sample_stats', action='store',

				                    required=True, help='stats from sample')

				parser.add_argument('--update', action='store_true',

				                    help='whether to update baseline using stats from sample')

				parser.add_argument(

				    "--test-name", dest="test_name", action="store", required=True, help="test name"

				)

				parser.add_argument(

				    "--sample-stats",

				    dest="sample_stats",

				    action="store",

				    required=True,

				    help="stats from sample",

				)

				parser.add_argument(

				    "--update",

				    action="store_true",

				    help="whether to update baseline using stats from sample",

				)

				args = parser.parse_args()

				test_name = args.test_name

				if 'cpu' in test_name:

				    backend = 'cpu'

				elif 'gpu' in test_name:

				    backend = 'gpu'

				if "cpu" in test_name:

				    backend = "cpu"

				elif "gpu" in test_name:

				    backend = "gpu"

				data_file_path = '../{}_runtime.json'.format(backend)

				data_file_path = f"../{backend}_runtime.json"

				with open(data_file_path) as data_file:

				    data = json.load(data_file)

				if test_name in data:

				    mean = float(data[test_name]['mean'])

				    sigma = float(data[test_name]['sigma'])

				    mean = float(data[test_name]["mean"])

				    sigma = float(data[test_name]["sigma"])

				else:

				    # Let the test pass if baseline number doesn't exist

				    mean = sys.maxsize

				@ -43,37 +52,39 @@ if math.isnan(mean) or math.isnan(sigma):

				sample_stats_data = json.loads(args.sample_stats)

				sample_mean = float(sample_stats_data['mean'])

				sample_sigma = float(sample_stats_data['sigma'])

				sample_mean = float(sample_stats_data["mean"])

				sample_sigma = float(sample_stats_data["sigma"])

				print("sample mean: ", sample_mean)

				print("sample sigma: ", sample_sigma)

				if math.isnan(sample_mean):

				    raise Exception('''Error: sample mean is NaN''')

				    raise Exception("""Error: sample mean is NaN""")

				elif math.isnan(sample_sigma):

				    raise Exception('''Error: sample sigma is NaN''')

				    raise Exception("""Error: sample sigma is NaN""")

				z_value = (sample_mean - mean) / sigma

				print("z-value: ", z_value)

				if z_value >= 3:

				    raise Exception('''\n

				    raise Exception(

				        f"""\n

				z-value >= 3, there is high chance of perf regression.\n

				To reproduce this regression, run

				`cd .ci/pytorch/perf_test/ && bash {}.sh` on your local machine

				`cd .ci/pytorch/perf_test/ && bash {test_name}.sh` on your local machine

				and compare the runtime before/after your code change.

				'''.format(test_name))

				"""

				    )

				else:

				    print("z-value < 3, no perf regression detected.")

				    if args.update:

				        print("We will use these numbers as new baseline.")

				        new_data_file_path = '../new_{}_runtime.json'.format(backend)

				        new_data_file_path = f"../new_{backend}_runtime.json"

				        with open(new_data_file_path) as new_data_file:

				            new_data = json.load(new_data_file)

				        new_data[test_name] = {}

				        new_data[test_name]['mean'] = sample_mean

				        new_data[test_name]['sigma'] = max(sample_sigma, sample_mean * 0.1)

				        with open(new_data_file_path, 'w') as new_data_file:

				        new_data[test_name]["mean"] = sample_mean

				        new_data[test_name]["sigma"] = max(sample_sigma, sample_mean * 0.1)

				        with open(new_data_file_path, "w") as new_data_file:

				            json.dump(new_data, new_data_file, indent=4)

									
										7

.ci/pytorch/perf_test/get_stats.py
									
												View File
												
				@ -1,5 +1,6 @@

				import sys

				import json

				import sys

				import numpy

				sample_data_list = sys.argv[1:]

				@ -9,8 +10,8 @@ sample_mean = numpy.mean(sample_data_list)

				sample_sigma = numpy.std(sample_data_list)

				data = {

				    'mean': sample_mean,

				    'sigma': sample_sigma,

				    "mean": sample_mean,

				    "sigma": sample_sigma,

				}

				print(json.dumps(data))

									
										6

.ci/pytorch/perf_test/update_commit_hash.py
									
												View File
												
				@ -1,5 +1,5 @@

				import sys

				import json

				import sys

				data_file_path = sys.argv[1]

				commit_hash = sys.argv[2]

				@ -7,7 +7,7 @@ commit_hash = sys.argv[2]

				with open(data_file_path) as data_file:

				    data = json.load(data_file)

				data['commit'] = commit_hash

				data["commit"] = commit_hash

				with open(data_file_path, 'w') as data_file:

				with open(data_file_path, "w") as data_file:

				    json.dump(data, data_file)

									
										8

.ci/pytorch/print_sccache_log.py
									
												View File
												
				@ -9,9 +9,9 @@ for line in lines:

				    # Ignore errors from CPU instruction set, symbol existing testing,

				    # or compilation error formatting

				    ignored_keywords = [

				        'src.c',

				        'CheckSymbolExists.c',

				        'test_compilation_error_formatting',

				        "src.c",

				        "CheckSymbolExists.c",

				        "test_compilation_error_formatting",

				    ]

				    if all([keyword not in line for keyword in ignored_keywords]):

				    if all(keyword not in line for keyword in ignored_keywords):

				        print(line)

									
										28

.circleci/scripts/python_doc_push_script.sh → .ci/pytorch/python_doc_push_script.sh
									
												View File
												
				@ -1,8 +1,4 @@

				# =================== The following code **should** be executed inside Docker container ===================

				# Install dependencies

				sudo apt-get -y update

				sudo apt-get -y install expect-dev

				#!/bin/bash

				# This is where the local pytorch install in the docker image is located

				pt_checkout="/var/lib/jenkins/workspace"

				@ -23,7 +19,7 @@ set -ex

				#       but since DOCS_INSTALL_PATH can be derived from DOCS_VERSION it's probably better to

				#       try and gather it first, just so we don't potentially break people who rely on this script

				# Argument 2: What version of the docs we are building.

				version="${2:-${DOCS_VERSION:-master}}"

				version="${2:-${DOCS_VERSION:-main}}"

				if [ -z "$version" ]; then

				echo "error: python_doc_push_script.sh: version (arg2) not specified"

				  exit 1

				@ -38,7 +34,7 @@ echo "error: python_doc_push_script.sh: install_path (arg1) not specified"

				fi

				is_main_doc=false

				if [ "$version" == "master" ]; then

				if [ "$version" == "main" ]; then

				  is_main_doc=true

				fi

				@ -55,7 +51,7 @@ echo "install_path: $install_path  version: $version"

				build_docs () {

				  set +e

				  set -o pipefail

				  make $1 2>&1 | tee /tmp/docs_build.txt

				  make "$1" 2>&1 | tee /tmp/docs_build.txt

				  code=$?

				  if [ $code -ne 0 ]; then

				    set +x

				@ -72,12 +68,12 @@ build_docs () {

				}

				git clone https://github.com/pytorch/pytorch.github.io -b $branch --depth 1

				git clone https://github.com/pytorch/pytorch.github.io -b "$branch" --depth 1

				pushd pytorch.github.io

				export LC_ALL=C

				export PATH=/opt/conda/bin:$PATH

				if [ -n $ANACONDA_PYTHON_VERSION ]; then

				if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  export PATH=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:$PATH

				fi

				@ -88,10 +84,9 @@ pushd "$pt_checkout"

				pushd docs

				# Build the docs

				pip -q install -r requirements.txt

				if [ "$is_main_doc" = true ]; then

				  build_docs html

				  [ $? -eq 0 ] || exit $?

				  build_docs html || exit $?

				  make coverage

				  # Now we have the coverage report, we need to make sure it is empty.

				  # Count the number of lines in the file and turn that number into a variable

				@ -102,19 +97,19 @@ if [ "$is_main_doc" = true ]; then

				  # Also: see docs/source/conf.py for "coverage_ignore*" items, which should

				  # be documented then removed from there.

				  lines=$(wc -l build/coverage/python.txt 2>/dev/null |cut -f1 -d' ')

				  undocumented=$(($lines - 2))

				  undocumented=$((lines - 2))

				  if [ $undocumented -lt 0 ]; then

				    echo coverage output not found

				    exit 1

				  elif [ $undocumented -gt 0 ]; then

				    echo undocumented objects found:

				    cat build/coverage/python.txt

				    echo "Make sure you've updated relevant .rsts in docs/source!"

				    exit 1

				  fi

				else

				  # skip coverage, format for stable or tags

				  build_docs html-stable

				  [ $? -eq 0 ] || exit $?

				  build_docs html-stable || exit $?

				fi

				# Move them into the docs repo

				@ -146,4 +141,3 @@ if [[ "${WITH_PUSH:-}" == true ]]; then

				fi

				popd

				# =================== The above code **should** be executed inside Docker container ===================

881

.ci/pytorch/test.sh

View File

File diff suppressed because it is too large Load Diff

									
										18

.ci/pytorch/win-build.sh
									
												View File
												
				@ -15,13 +15,6 @@ source "$SCRIPT_PARENT_DIR/common.sh"

				# shellcheck source=./common-build.sh

				source "$SCRIPT_PARENT_DIR/common-build.sh"

				IMAGE_COMMIT_ID=$(git rev-parse HEAD)

				export IMAGE_COMMIT_ID

				export IMAGE_COMMIT_TAG=${BUILD_ENVIRONMENT}-${IMAGE_COMMIT_ID}

				if [[ ${JOB_NAME} == *"develop"* ]]; then

				  export IMAGE_COMMIT_TAG=develop-${IMAGE_COMMIT_TAG}

				fi

				export TMP_DIR="${PWD}/build/win_tmp"

				TMP_DIR_WIN=$(cygpath -w "${TMP_DIR}")

				export TMP_DIR_WIN

				@ -30,14 +23,6 @@ if [[ -n "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then

				    mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true

				fi

				# This directory is used only to hold "pytorch_env_restore.bat", called via "setup_pytorch_env.bat"

				CI_SCRIPTS_DIR=$TMP_DIR/ci_scripts

				mkdir -p "$CI_SCRIPTS_DIR"

				if [ -n "$(ls "$CI_SCRIPTS_DIR"/*)" ]; then

				    rm "$CI_SCRIPTS_DIR"/*

				fi

				export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers

				set +ex

				@ -59,7 +44,4 @@ set -ex

				assert_git_not_dirty

				if [ ! -f "${TMP_DIR}"/"${IMAGE_COMMIT_TAG}".7z ] && [ ! "${BUILD_ENVIRONMENT}" == "" ]; then

				    exit 1

				fi

				echo "BUILD PASSED"

									
										29

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -111,23 +111,8 @@ if "%USE_CUDA%"=="1" (

				  set CMAKE_CUDA_COMPILER_LAUNCHER=%TMP_DIR%/bin/randomtemp.exe;%TMP_DIR%\bin\sccache.exe

				)

				@echo off

				echo @echo off >> %TMP_DIR_WIN%\ci_scripts\pytorch_env_restore.bat

				for /f "usebackq tokens=*" %%i in (`set`) do echo set "%%i" >> %TMP_DIR_WIN%\ci_scripts\pytorch_env_restore.bat

				@echo on

				if "%REBUILD%" == "" (

				  if NOT "%BUILD_ENVIRONMENT%" == "" (

				    :: Create a shortcut to restore pytorch environment

				    echo @echo off >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore_helper.bat

				    echo call "%TMP_DIR_WIN%/ci_scripts/pytorch_env_restore.bat" >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore_helper.bat

				    echo cd /D "%CD%" >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore_helper.bat

				    aws s3 cp "s3://ossci-windows/Restore PyTorch Environment.lnk" "C:\Users\circleci\Desktop\Restore PyTorch Environment.lnk"

				    if errorlevel 1 exit /b

				    if not errorlevel 0 exit /b

				  )

				)

				:: Print all existing environment variable for debugging

				set

				python setup.py bdist_wheel

				if errorlevel 1 exit /b

				@ -138,18 +123,12 @@ python -c "import os, glob; os.system('python -mpip install --no-index --no-deps

				  if "%BUILD_ENVIRONMENT%"=="" (

				    echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.

				  ) else (

				    if "%USE_CUDA%"=="1" (

				        7z a %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torchgen %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\functorch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\nvfuser && copy /Y "%TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z" "%PYTORCH_FINAL_PACKAGE_DIR%\"

				    ) else (

				        7z a %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torchgen %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\functorch && copy /Y "%TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z" "%PYTORCH_FINAL_PACKAGE_DIR%\"

				    )

				    if errorlevel 1 exit /b

				    if not errorlevel 0 exit /b

				    copy /Y "dist\*.whl" "%PYTORCH_FINAL_PACKAGE_DIR%"

				    :: export test times so that potential sharded tests that'll branch off this build will use consistent data

				    python tools/stats/export_test_times.py

				    copy /Y ".pytorch-test-times.json" "%PYTORCH_FINAL_PACKAGE_DIR%"

				    copy /Y ".pytorch-test-file-ratings.json" "%PYTORCH_FINAL_PACKAGE_DIR%"

				    :: Also save build/.ninja_log as an artifact

				    copy /Y "build\.ninja_log" "%PYTORCH_FINAL_PACKAGE_DIR%\"

									
										19

.ci/pytorch/win-test-helpers/install_test_functorch.bat
									
												View File
											
				@ -1,19 +0,0 @@

				call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat

				:: exit the batch once there's an error

				if not errorlevel 0 (

				  echo "setup pytorch env failed"

				  echo %errorlevel%

				  exit /b

				)

				echo "Test functorch"

				pushd test

				python run_test.py --functorch --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose

				popd

				if ERRORLEVEL 1 goto fail

				:eof

				exit /b 0

				:fail

				exit /b 1

									
										11

.ci/pytorch/win-test-helpers/run_python_nn_smoketests.py
									
												View File
												
				@ -1,7 +1,7 @@

				#!/usr/bin/env python3

				import subprocess

				import os

				import subprocess

				COMMON_TESTS = [

				    (

				@ -31,8 +31,7 @@ GPU_TESTS = [

				if __name__ == "__main__":

				    if 'USE_CUDA' in os.environ and os.environ['USE_CUDA'] == '1':

				    if "USE_CUDA" in os.environ and os.environ["USE_CUDA"] == "1":

				        TESTS = COMMON_TESTS + GPU_TESTS

				    else:

				        TESTS = COMMON_TESTS

				@ -44,8 +43,10 @@ if __name__ == "__main__":

				        try:

				            subprocess.check_call(command_args)

				        except subprocess.CalledProcessError as e:

				            sdk_root = os.environ.get('WindowsSdkDir', 'C:\\Program Files (x86)\\Windows Kits\\10')

				            debugger = os.path.join(sdk_root, 'Debuggers', 'x64', 'cdb.exe')

				            sdk_root = os.environ.get(

				                "WindowsSdkDir", "C:\\Program Files (x86)\\Windows Kits\\10"

				            )

				            debugger = os.path.join(sdk_root, "Debuggers", "x64", "cdb.exe")

				            if os.path.exists(debugger):

				                command_args = [debugger, "-o", "-c", "~*g; q"] + command_args

				                command_string = " ".join(command_args)

									
										37

.ci/pytorch/win-test-helpers/setup_pytorch_env.bat
									
												View File
												
				@ -1,8 +1,3 @@

				if exist "%TMP_DIR%/ci_scripts/pytorch_env_restore.bat" (

				    call %TMP_DIR%/ci_scripts/pytorch_env_restore.bat

				    exit /b 0

				)

				set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocolatey\bin;C:\Program Files\Git\cmd;C:\Program Files\Amazon\AWSCLI;C:\Program Files\Amazon\AWSCLI\bin;%PATH%

				:: Install Miniconda3

				@ -14,6 +9,13 @@ call %INSTALLER_DIR%\activate_miniconda3.bat

				if errorlevel 1 exit /b

				if not errorlevel 0 exit /b

				:: PyTorch is now installed using the standard wheel on Windows into the conda environment.

				:: However, the test scripts are still frequently referring to the workspace temp directory

				:: build\torch. Rather than changing all these references, making a copy of torch folder

				:: from conda to the current workspace is easier. The workspace will be cleaned up after

				:: the job anyway

				xcopy /s %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %TMP_DIR_WIN%\build\torch\

				pushd .

				if "%VC_VERSION%" == "" (

				    call "C:\Program Files (x86)\Microsoft Visual Studio\%VC_YEAR%\%VC_PRODUCT%\VC\Auxiliary\Build\vcvarsall.bat" x64

				@ -48,26 +50,5 @@ set NUMBAPRO_NVVM=%CUDA_PATH%\nvvm\bin\nvvm64_32_0.dll

				set PYTHONPATH=%TMP_DIR_WIN%\build;%PYTHONPATH%

				if NOT "%BUILD_ENVIRONMENT%"=="" (

				    pushd %TMP_DIR_WIN%\build

				    copy /Y %PYTORCH_FINAL_PACKAGE_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %TMP_DIR_WIN%\

				    :: 7z: -aos skips if exists because this .bat can be called multiple times

				    7z x %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z -aos

				    popd

				) else (

				    xcopy /s %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %TMP_DIR_WIN%\build\torch\

				)

				@echo off

				echo @echo off >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore.bat

				for /f "usebackq tokens=*" %%i in (`set`) do echo set "%%i" >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore.bat

				@echo on

				if NOT "%BUILD_ENVIRONMENT%" == "" (

				  :: Create a shortcut to restore pytorch environment

				  echo @echo off >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore_helper.bat

				  echo call "%TMP_DIR_WIN%/ci_scripts/pytorch_env_restore.bat" >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore_helper.bat

				  echo cd /D "%CD%" >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore_helper.bat

				  aws s3 cp "s3://ossci-windows/Restore PyTorch Environment.lnk" "C:\Users\circleci\Desktop\Restore PyTorch Environment.lnk"

				)

				:: Print all existing environment variable for debugging

				set

									
										24

.ci/pytorch/win-test-helpers/test_libtorch.bat
									
												View File
												
				@ -5,14 +5,16 @@ if "%USE_CUDA%" == "0" IF NOT "%CUDA_VERSION%" == "cpu" exit /b 0

				call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat

				if errorlevel 1 exit /b 1

				cd %TMP_DIR_WIN%\build\torch\bin

				set TEST_OUT_DIR=%~dp0\..\..\..\test\test-reports\cpp-unittest

				md %TEST_OUT_DIR%

				:: Save the current working directory so that we can go back there

				set CWD=%cd%

				set CPP_TESTS_DIR=%TMP_DIR_WIN%\build\torch\bin

				set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%

				set TEST_API_OUT_DIR=%TEST_OUT_DIR%\test_api

				md %TEST_API_OUT_DIR%

				test_api.exe --gtest_filter="-IntegrationTest.MNIST*" --gtest_output=xml:%TEST_API_OUT_DIR%\test_api.xml

				set TORCH_CPP_TEST_MNIST_PATH=%CWD%\test\cpp\api\mnist

				python tools\download_mnist.py --quiet -d %TORCH_CPP_TEST_MNIST_PATH%

				python test\run_test.py --cpp --verbose -i cpp/test_api

				if errorlevel 1 exit /b 1

				if not errorlevel 0 exit /b 1

				@ -25,6 +27,10 @@ for /r "." %%a in (*.exe) do (

				goto :eof

				:libtorch_check

				cd %CWD%

				set CPP_TESTS_DIR=%TMP_DIR_WIN%\build\torch\test

				:: Skip verify_api_visibility as it a compile level test

				if "%~1" == "verify_api_visibility" goto :eof

				@ -42,12 +48,12 @@ if "%~1" == "utility_ops_gpu_test" goto :eof

				echo Running "%~2"

				if "%~1" == "c10_intrusive_ptr_benchmark" (

				  :: NB: This is not a gtest executable file, thus couldn't be handled by pytest-cpp

				  call "%~2"

				  goto :eof

				)

				:: Differentiating the test report directories is crucial for test time reporting.

				md %TEST_OUT_DIR%\%~n2

				call "%~2" --gtest_output=xml:%TEST_OUT_DIR%\%~n2\%~1.xml

				python test\run_test.py --cpp --verbose -i "cpp/%~1"

				if errorlevel 1 (

				  echo %1 failed with exit code %errorlevel%

				  exit /b 1

									
										1

.ci/pytorch/win-test-helpers/test_python_jit_legacy.bat
									
												View File
												
				@ -2,6 +2,7 @@ call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat

				echo Copying over test times file

				copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-times.json" "%PROJECT_DIR_WIN%"

				copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-file-ratings.json" "%PROJECT_DIR_WIN%"

				pushd test

									
										1

.ci/pytorch/win-test-helpers/test_python_shard.bat
									
												View File
												
				@ -23,6 +23,7 @@ if "%SHARD_NUMBER%" == "1" (

				echo Copying over test times file

				copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-times.json" "%PROJECT_DIR_WIN%"

				copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-file-ratings.json" "%PROJECT_DIR_WIN%"

				echo Run nn tests

				python run_test.py --exclude-jit-executor --exclude-distributed-tests --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose

									
										29

.ci/pytorch/win-test.sh
									
												View File
												
				@ -5,13 +5,6 @@ SCRIPT_PARENT_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )

				# shellcheck source=./common.sh

				source "$SCRIPT_PARENT_DIR/common.sh"

				IMAGE_COMMIT_ID=$(git rev-parse HEAD)

				export IMAGE_COMMIT_ID

				export IMAGE_COMMIT_TAG=${BUILD_ENVIRONMENT}-${IMAGE_COMMIT_ID}

				if [[ ${JOB_NAME} == *"develop"* ]]; then

				  export IMAGE_COMMIT_TAG=develop-${IMAGE_COMMIT_TAG}

				fi

				export TMP_DIR="${PWD}/build/win_tmp"

				TMP_DIR_WIN=$(cygpath -w "${TMP_DIR}")

				export TMP_DIR_WIN

				@ -21,22 +14,12 @@ export PROJECT_DIR_WIN

				export TEST_DIR="${PWD}/test"

				TEST_DIR_WIN=$(cygpath -w "${TEST_DIR}")

				export TEST_DIR_WIN

				export PYTORCH_FINAL_PACKAGE_DIR="${PYTORCH_FINAL_PACKAGE_DIR:-/c/users/circleci/workspace/build-results}"

				export PYTORCH_FINAL_PACKAGE_DIR="${PYTORCH_FINAL_PACKAGE_DIR:-/c/w/build-results}"

				PYTORCH_FINAL_PACKAGE_DIR_WIN=$(cygpath -w "${PYTORCH_FINAL_PACKAGE_DIR}")

				export PYTORCH_FINAL_PACKAGE_DIR_WIN

				mkdir -p "$TMP_DIR"/build/torch

				# This directory is used only to hold "pytorch_env_restore.bat", called via "setup_pytorch_env.bat"

				CI_SCRIPTS_DIR=$TMP_DIR/ci_scripts

				mkdir -p "$CI_SCRIPTS_DIR"

				if [ -n "$(ls "$CI_SCRIPTS_DIR"/*)" ]; then

				    rm "$CI_SCRIPTS_DIR"/*

				fi

				export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers

				if [[ "$TEST_CONFIG" = "force_on_cpu" ]]; then

				@ -51,6 +34,12 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then

				  export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"

				fi

				# TODO: Move both of them to Windows AMI

				python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0

				# Install Z3 optional dependency for Windows builds.

				python -m pip install z3-solver

				run_tests() {

				    # Run nvidia-smi if available

				    for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do

				@ -60,9 +49,7 @@ run_tests() {

				        fi

				    done

				    if [[ "${TEST_CONFIG}" == *functorch* ]]; then

				        "$SCRIPT_HELPERS_DIR"/install_test_functorch.bat

				    elif [[ $NUM_TEST_SHARDS -eq 1 ]]; then

				    if [[ $NUM_TEST_SHARDS -eq 1 ]]; then

				        "$SCRIPT_HELPERS_DIR"/test_python_shard.bat

				        "$SCRIPT_HELPERS_DIR"/test_custom_script_ops.bat

				        "$SCRIPT_HELPERS_DIR"/test_custom_backend.bat

									
										36

.circleci/README.md
									
												View File
												
				@ -106,7 +106,7 @@ All binaries are built in CircleCI workflows except Windows. There are checked-i

				Some quick vocab:

				* A \**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on https://github.com/pytorch/pytorch/blob/master/.circleci/config.yml to see the workflows.

				* A \**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on https://github.com/pytorch/pytorch/blob/main/.circleci/config.yml to see the workflows.

				* **jobs** are a sequence of '**steps**'

				* **steps** are usually just a bash script or a builtin CircleCI command. *All steps run in new environments, environment variables declared in one script DO NOT persist to following steps*

				* CircleCI has a **workspace**, which is essentially a cache between steps of the *same job* in which you can store artifacts between steps.

				@ -117,8 +117,8 @@ The nightly binaries have 3 workflows. We have one job (actually 3 jobs:  build,

				1. binary_builds

				    1. every day midnight EST

				    2. linux: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/linux-binary-build-defaults.yml

				    3. macos: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/macos-binary-build-defaults.yml

				    2. linux: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/linux-binary-build-defaults.yml

				    3. macos: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/macos-binary-build-defaults.yml

				    4. For each binary configuration, e.g. linux_conda_3.7_cpu there is a

				        1. binary_linux_conda_3.7_cpu_build

				            1. Builds the build. On linux jobs this uses the 'docker executor'.

				@ -133,12 +133,12 @@ The nightly binaries have 3 workflows. We have one job (actually 3 jobs:  build,

				            2. Uploads the package

				2. update_s3_htmls

				    1. every day 5am EST

				    2. https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/binary_update_htmls.yml

				    2. https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/binary_update_htmls.yml

				    3. See below for what these are for and why they're needed

				    4. Three jobs that each examine the current contents of aws and the conda repo and update some html files in s3

				3. binarysmoketests

				    1. every day

				    2. https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml

				    2. https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml

				    3. For each binary configuration, e.g. linux_conda_3.7_cpu there is a

				        1. smoke_linux_conda_3.7_cpu

				            1. Downloads the package from the cloud, e.g. using the official pip or conda instructions

				@ -146,26 +146,26 @@ The nightly binaries have 3 workflows. We have one job (actually 3 jobs:  build,

				## How are the jobs structured?

				The jobs are in https://github.com/pytorch/pytorch/tree/master/.circleci/verbatim-sources. Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/master/.circleci/scripts .

				The jobs are in https://github.com/pytorch/pytorch/tree/main/.circleci/verbatim-sources. Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/main/.circleci/scripts .

				* Linux jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/linux-binary-build-defaults.yml

				* Linux jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/linux-binary-build-defaults.yml

				    * binary_linux_build.sh

				    * binary_linux_test.sh

				    * binary_linux_upload.sh

				* MacOS jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/macos-binary-build-defaults.yml

				* MacOS jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/macos-binary-build-defaults.yml

				    * binary_macos_build.sh

				    * binary_macos_test.sh

				    * binary_macos_upload.sh

				* Update html jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/binary_update_htmls.yml

				* Update html jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/binary_update_htmls.yml

				    * These delegate from the pytorch/builder repo

				    * https://github.com/pytorch/builder/blob/master/cron/update_s3_htmls.sh

				    * https://github.com/pytorch/builder/blob/master/cron/upload_binary_sizes.sh

				* Smoke jobs (both linux and macos): https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml

				    * https://github.com/pytorch/builder/blob/main/cron/update_s3_htmls.sh

				    * https://github.com/pytorch/builder/blob/main/cron/upload_binary_sizes.sh

				* Smoke jobs (both linux and macos): https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml

				    * These delegate from the pytorch/builder repo

				    * https://github.com/pytorch/builder/blob/master/run_tests.sh

				    * https://github.com/pytorch/builder/blob/master/smoke_test.sh

				    * https://github.com/pytorch/builder/blob/master/check_binary.sh

				* Common shared code (shared across linux and macos): https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-binary-build-defaults.yml

				    * https://github.com/pytorch/builder/blob/main/run_tests.sh

				    * https://github.com/pytorch/builder/blob/main/smoke_test.sh

				    * https://github.com/pytorch/builder/blob/main/check_binary.sh

				* Common shared code (shared across linux and macos): https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-binary-build-defaults.yml

				    * binary_checkout.sh - checks out pytorch/builder repo. Right now this also checks out pytorch/pytorch, but it shouldn't. pytorch/pytorch should just be shared through the workspace. This can handle being run before binary_populate_env.sh

				    * binary_populate_env.sh - parses BUILD_ENVIRONMENT into the separate env variables that make up a binary configuration. Also sets lots of default values, the date, the version strings, the location of folders in s3, all sorts of things. This generally has to be run before other steps.

				    * binary_install_miniconda.sh - Installs miniconda, cross platform. Also hacks this for the update_binary_sizes job that doesn't have the right env variables

				@ -308,7 +308,7 @@ Note that the Windows Python wheels are still built in conda environments. Some

				* These should all be consolidated

				* These must run on all OS types: MacOS, Linux, and Windows

				* These all run smoke tests at the moment. They inspect the packages some, maybe run a few import statements. They DO NOT run the python tests nor the cpp tests. The idea is that python tests on master and PR merges will catch all breakages. All these tests have to do is make sure the special binary machinery didn’t mess anything up.

				* These all run smoke tests at the moment. They inspect the packages some, maybe run a few import statements. They DO NOT run the python tests nor the cpp tests. The idea is that python tests on main and PR merges will catch all breakages. All these tests have to do is make sure the special binary machinery didn’t mess anything up.

				* There are separate run_tests.sh and smoke_test.sh because one used to be called by the smoke jobs and one used to be called by the binary test jobs (see circleci structure section above). This is still true actually, but these could be united into a single script that runs these checks, given an installed pytorch package.

				### Note on libtorch

				@ -340,7 +340,7 @@ The Dockerfiles are available in pytorch/builder, but there is no circleci job o

				tl;dr make a PR that looks like https://github.com/pytorch/pytorch/pull/21159

				Sometimes we want to push a change to master and then rebuild all of today's binaries after that change. As of May 30, 2019 there isn't a way to manually run a workflow in the UI. You can manually re-run a workflow, but it will use the exact same git commits as the first run and will not include any changes. So we have to make a PR and then force circleci to run the binary workflow instead of the normal tests. The above PR is an example of how to do this; essentially you copy-paste the binarybuilds workflow steps into the default workflow steps. If you need to point the builder repo to a different commit then you'd need to change https://github.com/pytorch/pytorch/blob/master/.circleci/scripts/binary_checkout.sh#L42-L45 to checkout what you want.

				Sometimes we want to push a change to mainand then rebuild all of today's binaries after that change. As of May 30, 2019 there isn't a way to manually run a workflow in the UI. You can manually re-run a workflow, but it will use the exact same git commits as the first run and will not include any changes. So we have to make a PR and then force circleci to run the binary workflow instead of the normal tests. The above PR is an example of how to do this; essentially you copy-paste the binarybuilds workflow steps into the default workflow steps. If you need to point the builder repo to a different commit then you'd need to change https://github.com/pytorch/pytorch/blob/main/.circleci/scripts/binary_checkout.sh#L42-L45 to checkout what you want.

				## How to test changes to the binaries via .circleci

									
										67

.circleci/cimodel/data/binary_build_data.py
									
												View File
												
				@ -9,9 +9,10 @@ should be "pruned".

				from collections import OrderedDict

				from cimodel.lib.conf_tree import ConfigNode

				import cimodel.data.dimensions as dimensions

				from cimodel.lib.conf_tree import ConfigNode

				LINKING_DIMENSIONS = [

				    "shared",

				@ -26,12 +27,18 @@ DEPS_INCLUSION_DIMENSIONS = [

				def get_processor_arch_name(gpu_version):

				    return "cpu" if not gpu_version else (

				        "cu" + gpu_version.strip("cuda") if gpu_version.startswith("cuda") else gpu_version

				    return (

				        "cpu"

				        if not gpu_version

				        else (

				            "cu" + gpu_version.strip("cuda")

				            if gpu_version.startswith("cuda")

				            else gpu_version

				        )

				    )

				CONFIG_TREE_DATA = OrderedDict(

				)

				CONFIG_TREE_DATA = OrderedDict()

				# GCC config variants:

				#

				@ -41,8 +48,8 @@ CONFIG_TREE_DATA = OrderedDict(

				#

				# Libtorch with new gcc ABI is built with gcc 5.4 on Ubuntu 16.04.

				LINUX_GCC_CONFIG_VARIANTS = OrderedDict(

				    manywheel=['devtoolset7'],

				    conda=['devtoolset7'],

				    manywheel=["devtoolset7"],

				    conda=["devtoolset7"],

				    libtorch=[

				        "devtoolset7",

				        "gcc5.4_cxx11-abi",

				@ -63,7 +70,9 @@ class TopLevelNode(ConfigNode):

				        self.props["smoke"] = smoke

				    def get_children(self):

				        return [OSConfigNode(self, x, c, p) for (x, (c, p)) in self.config_tree_data.items()]

				        return [

				            OSConfigNode(self, x, c, p) for (x, (c, p)) in self.config_tree_data.items()

				        ]

				class OSConfigNode(ConfigNode):

				@ -85,12 +94,20 @@ class PackageFormatConfigNode(ConfigNode):

				        self.props["python_versions"] = python_versions

				        self.props["package_format"] = package_format

				    def get_children(self):

				        if self.find_prop("os_name") == "linux":

				            return [LinuxGccConfigNode(self, v) for v in LINUX_GCC_CONFIG_VARIANTS[self.find_prop("package_format")]]

				        elif self.find_prop("os_name") == "windows" and self.find_prop("package_format") == "libtorch":

				            return [WindowsLibtorchConfigNode(self, v) for v in WINDOWS_LIBTORCH_CONFIG_VARIANTS]

				            return [

				                LinuxGccConfigNode(self, v)

				                for v in LINUX_GCC_CONFIG_VARIANTS[self.find_prop("package_format")]

				            ]

				        elif (

				            self.find_prop("os_name") == "windows"

				            and self.find_prop("package_format") == "libtorch"

				        ):

				            return [

				                WindowsLibtorchConfigNode(self, v)

				                for v in WINDOWS_LIBTORCH_CONFIG_VARIANTS

				            ]

				        else:

				            return [ArchConfigNode(self, v) for v in self.find_prop("gpu_versions")]

				@ -106,23 +123,29 @@ class LinuxGccConfigNode(ConfigNode):

				        # XXX devtoolset7 on CUDA 9.0 is temporarily disabled

				        # see https://github.com/pytorch/pytorch/issues/20066

				        if self.find_prop("gcc_config_variant") == 'devtoolset7':

				        if self.find_prop("gcc_config_variant") == "devtoolset7":

				            gpu_versions = filter(lambda x: x != "cuda_90", gpu_versions)

				        # XXX disabling conda rocm build since docker images are not there

				        if self.find_prop("package_format") == 'conda':

				            gpu_versions = filter(lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions)

				        if self.find_prop("package_format") == "conda":

				            gpu_versions = filter(

				                lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions

				            )

				        # XXX libtorch rocm build  is temporarily disabled

				        if self.find_prop("package_format") == 'libtorch':

				            gpu_versions = filter(lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions)

				        if self.find_prop("package_format") == "libtorch":

				            gpu_versions = filter(

				                lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions

				            )

				        return [ArchConfigNode(self, v) for v in gpu_versions]

				class WindowsLibtorchConfigNode(ConfigNode):

				    def __init__(self, parent, libtorch_config_variant):

				        super().__init__(parent, "LIBTORCH_CONFIG_VARIANT=" + str(libtorch_config_variant))

				        super().__init__(

				            parent, "LIBTORCH_CONFIG_VARIANT=" + str(libtorch_config_variant)

				        )

				        self.props["libtorch_config_variant"] = libtorch_config_variant

				@ -161,11 +184,15 @@ class LinkingVariantConfigNode(ConfigNode):

				        super().__init__(parent, linking_variant)

				    def get_children(self):

				        return [DependencyInclusionConfigNode(self, v) for v in DEPS_INCLUSION_DIMENSIONS]

				        return [

				            DependencyInclusionConfigNode(self, v) for v in DEPS_INCLUSION_DIMENSIONS

				        ]

				class DependencyInclusionConfigNode(ConfigNode):

				    def __init__(self, parent, deps_variant):

				        super().__init__(parent, deps_variant)

				        self.props["libtorch_variant"] = "-".join([self.parent.get_label(), self.get_label()])

				        self.props["libtorch_variant"] = "-".join(

				            [self.parent.get_label(), self.get_label()]

				        )

									
										132

.circleci/cimodel/data/binary_build_definitions.py
									
												View File
												
				@ -1,13 +1,24 @@

				from collections import OrderedDict

				import cimodel.data.simple.util.branch_filters as branch_filters

				import cimodel.data.binary_build_data as binary_build_data

				import cimodel.data.simple.util.branch_filters as branch_filters

				import cimodel.lib.conf_tree as conf_tree

				import cimodel.lib.miniutils as miniutils

				class Conf(object):

				    def __init__(self, os, gpu_version, pydistro, parms, smoke, libtorch_variant, gcc_config_variant, libtorch_config_variant):

				class Conf:

				    def __init__(

				        self,

				        os,

				        gpu_version,

				        pydistro,

				        parms,

				        smoke,

				        libtorch_variant,

				        gcc_config_variant,

				        libtorch_config_variant,

				    ):

				        self.os = os

				        self.gpu_version = gpu_version

				        self.pydistro = pydistro

				@ -18,7 +29,11 @@ class Conf(object):

				        self.libtorch_config_variant = libtorch_config_variant

				    def gen_build_env_parms(self):

				        elems = [self.pydistro] + self.parms + [binary_build_data.get_processor_arch_name(self.gpu_version)]

				        elems = (

				            [self.pydistro]

				            + self.parms

				            + [binary_build_data.get_processor_arch_name(self.gpu_version)]

				        )

				        if self.gcc_config_variant is not None:

				            elems.append(str(self.gcc_config_variant))

				        if self.libtorch_config_variant is not None:

				@ -26,7 +41,7 @@ class Conf(object):

				        return elems

				    def gen_docker_image(self):

				        if self.gcc_config_variant == 'gcc5.4_cxx11-abi':

				        if self.gcc_config_variant == "gcc5.4_cxx11-abi":

				            if self.gpu_version is None:

				                return miniutils.quote("pytorch/libtorch-cxx11-builder:cpu")

				            else:

				@ -37,30 +52,41 @@ class Conf(object):

				            if self.gpu_version is None:

				                return miniutils.quote("pytorch/conda-builder:cpu")

				            else:

				                return miniutils.quote(

				                    f"pytorch/conda-builder:{self.gpu_version}"

				                )

				                return miniutils.quote(f"pytorch/conda-builder:{self.gpu_version}")

				        docker_word_substitution = {

				            "manywheel": "manylinux",

				            "libtorch": "manylinux",

				        }

				        docker_distro_prefix = miniutils.override(self.pydistro, docker_word_substitution)

				        docker_distro_prefix = miniutils.override(

				            self.pydistro, docker_word_substitution

				        )

				        # The cpu nightlies are built on the pytorch/manylinux-cuda102 docker image

				        # TODO cuda images should consolidate into tag-base images similar to rocm

				        alt_docker_suffix = "cuda102" if not self.gpu_version else (

				            "rocm:" + self.gpu_version.strip("rocm") if self.gpu_version.startswith("rocm") else self.gpu_version)

				        docker_distro_suffix = alt_docker_suffix if self.pydistro != "conda" else (

				            "cuda" if alt_docker_suffix.startswith("cuda") else "rocm")

				        return miniutils.quote("pytorch/" + docker_distro_prefix + "-" + docker_distro_suffix)

				        alt_docker_suffix = (

				            "cuda102"

				            if not self.gpu_version

				            else (

				                "rocm:" + self.gpu_version.strip("rocm")

				                if self.gpu_version.startswith("rocm")

				                else self.gpu_version

				            )

				        )

				        docker_distro_suffix = (

				            alt_docker_suffix

				            if self.pydistro != "conda"

				            else ("cuda" if alt_docker_suffix.startswith("cuda") else "rocm")

				        )

				        return miniutils.quote(

				            "pytorch/" + docker_distro_prefix + "-" + docker_distro_suffix

				        )

				    def get_name_prefix(self):

				        return "smoke" if self.smoke else "binary"

				    def gen_build_name(self, build_or_test, nightly):

				        parts = [self.get_name_prefix(), self.os] + self.gen_build_env_parms()

				        if nightly:

				@ -78,7 +104,9 @@ class Conf(object):

				    def gen_workflow_job(self, phase, upload_phase_dependency=None, nightly=False):

				        job_def = OrderedDict()

				        job_def["name"] = self.gen_build_name(phase, nightly)

				        job_def["build_environment"] = miniutils.quote(" ".join(self.gen_build_env_parms()))

				        job_def["build_environment"] = miniutils.quote(

				            " ".join(self.gen_build_env_parms())

				        )

				        if self.smoke:

				            job_def["requires"] = [

				                "update_s3_htmls",

				@ -116,47 +144,48 @@ class Conf(object):

				        os_name = miniutils.override(self.os, {"macos": "mac"})

				        job_name = "_".join([self.get_name_prefix(), os_name, phase])

				        return {job_name : job_def}

				        return {job_name: job_def}

				    def gen_upload_job(self, phase, requires_dependency):

				        """Generate binary_upload job for configuration

				        Output looks similar to:

				          Output looks similar to:

				      - binary_upload:

				          name: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_upload

				          context: org-member

				          requires: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_test

				          filters:

				            branches:

				              only:

				                - nightly

				            tags:

				              only: /v[0-9]+(\\.[0-9]+)*-rc[0-9]+/

				          package_type: manywheel

				          upload_subfolder: cu113

				        - binary_upload:

				            name: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_upload

				            context: org-member

				            requires: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_test

				            filters:

				              branches:

				                only:

				                  - nightly

				              tags:

				                only: /v[0-9]+(\\.[0-9]+)*-rc[0-9]+/

				            package_type: manywheel

				            upload_subfolder: cu113

				        """

				        return {

				            "binary_upload": OrderedDict({

				                "name": self.gen_build_name(phase, nightly=True),

				                "context": "org-member",

				                "requires": [self.gen_build_name(

				                    requires_dependency,

				                    nightly=True

				                )],

				                "filters": branch_filters.gen_filter_dict(

				                    branches_list=["nightly"],

				                    tags_list=[branch_filters.RC_PATTERN],

				                ),

				                "package_type": self.pydistro,

				                "upload_subfolder": binary_build_data.get_processor_arch_name(

				                    self.gpu_version,

				                ),

				            })

				            "binary_upload": OrderedDict(

				                {

				                    "name": self.gen_build_name(phase, nightly=True),

				                    "context": "org-member",

				                    "requires": [

				                        self.gen_build_name(requires_dependency, nightly=True)

				                    ],

				                    "filters": branch_filters.gen_filter_dict(

				                        branches_list=["nightly"],

				                        tags_list=[branch_filters.RC_PATTERN],

				                    ),

				                    "package_type": self.pydistro,

				                    "upload_subfolder": binary_build_data.get_processor_arch_name(

				                        self.gpu_version,

				                    ),

				                }

				            )

				        }

				def get_root(smoke, name):

				def get_root(smoke, name):

				    return binary_build_data.TopLevelNode(

				        name,

				        binary_build_data.CONFIG_TREE_DATA,

				@ -165,7 +194,6 @@ def get_root(smoke, name):

				def gen_build_env_list(smoke):

				    root = get_root(smoke, "N/A")

				    config_list = conf_tree.dfs(root)

				@ -176,7 +204,8 @@ def gen_build_env_list(smoke):

				            c.find_prop("gpu"),

				            c.find_prop("package_format"),

				            [c.find_prop("pyver")],

				            c.find_prop("smoke") and not (c.find_prop("os_name") == "macos_arm64"),  # don't test arm64

				            c.find_prop("smoke")

				            and not (c.find_prop("os_name") == "macos_arm64"),  # don't test arm64

				            c.find_prop("libtorch_variant"),

				            c.find_prop("gcc_config_variant"),

				            c.find_prop("libtorch_config_variant"),

				@ -185,9 +214,11 @@ def gen_build_env_list(smoke):

				    return newlist

				def predicate_exclude_macos(config):

				    return config.os == "linux" or config.os == "windows"

				def get_nightly_uploads():

				    configs = gen_build_env_list(False)

				    mylist = []

				@ -197,6 +228,7 @@ def get_nightly_uploads():

				    return mylist

				def get_post_upload_jobs():

				    return [

				        {

				@ -210,8 +242,8 @@ def get_post_upload_jobs():

				        },

				    ]

				def get_nightly_tests():

				def get_nightly_tests():

				    configs = gen_build_env_list(False)

				    filtered_configs = filter(predicate_exclude_macos, configs)

									
										7

.circleci/cimodel/data/dimensions.py
									
												View File
												
				@ -16,9 +16,4 @@ ROCM_VERSION_LABELS = ["rocm" + v for v in ROCM_VERSIONS]

				GPU_VERSIONS = [None] + ["cuda" + v for v in CUDA_VERSIONS] + ROCM_VERSION_LABELS

				STANDARD_PYTHON_VERSIONS = [

				    "3.7",

				    "3.8",

				    "3.9",

				    "3.10"

				]

				STANDARD_PYTHON_VERSIONS = ["3.7", "3.8", "3.9", "3.10"]

									
										19

.circleci/cimodel/data/pytorch_build_data.py
									
												View File
												
				@ -1,8 +1,7 @@

				from cimodel.lib.conf_tree import ConfigNode

				CONFIG_TREE_DATA = [

				]

				CONFIG_TREE_DATA = []

				def get_major_pyver(dotted_version):

				@ -96,6 +95,7 @@ class SlowGradcheckConfigNode(TreeConfigNode):

				    def child_constructor(self):

				        return ExperimentalFeatureConfigNode

				class PureTorchConfigNode(TreeConfigNode):

				    def modify_label(self, label):

				        return "PURE_TORCH=" + str(label)

				@ -117,6 +117,7 @@ class XlaConfigNode(TreeConfigNode):

				    def child_constructor(self):

				        return ImportantConfigNode

				class MPSConfigNode(TreeConfigNode):

				    def modify_label(self, label):

				        return "MPS=" + str(label)

				@ -254,8 +255,11 @@ class XenialCompilerConfigNode(TreeConfigNode):

				    # noinspection PyMethodMayBeStatic

				    def child_constructor(self):

				        return XenialCompilerVersionConfigNode if self.props["compiler_name"] else PyVerConfigNode

				        return (

				            XenialCompilerVersionConfigNode

				            if self.props["compiler_name"]

				            else PyVerConfigNode

				        )

				class BionicCompilerConfigNode(TreeConfigNode):

				@ -267,8 +271,11 @@ class BionicCompilerConfigNode(TreeConfigNode):

				    # noinspection PyMethodMayBeStatic

				    def child_constructor(self):

				        return BionicCompilerVersionConfigNode if self.props["compiler_name"] else PyVerConfigNode

				        return (

				            BionicCompilerVersionConfigNode

				            if self.props["compiler_name"]

				            else PyVerConfigNode

				        )

				class XenialCompilerVersionConfigNode(TreeConfigNode):

									
										36

.circleci/cimodel/data/pytorch_build_definitions.py
									
												View File
												
				@ -111,10 +111,10 @@ class Conf:

				            parameters["resource_class"] = resource_class

				        if phase == "build" and self.rocm_version is not None:

				            parameters["resource_class"] = "xlarge"

				        if hasattr(self, 'filters'):

				            parameters['filters'] = self.filters

				        if hasattr(self, "filters"):

				            parameters["filters"] = self.filters

				        if self.build_only:

				            parameters['build_only'] = miniutils.quote(str(int(True)))

				            parameters["build_only"] = miniutils.quote(str(int(True)))

				        return parameters

				    def gen_workflow_job(self, phase):

				@ -122,7 +122,6 @@ class Conf:

				        job_def["name"] = self.gen_build_name(phase)

				        if Conf.is_test_phase(phase):

				            # TODO When merging the caffe2 and pytorch jobs, it might be convenient for a while to make a

				            #  caffe2 test job dependent on a pytorch build job. This way we could quickly dedup the repeated

				            #  build of pytorch in the caffe2 build job, and just run the caffe2 tests off of a completed

				@ -143,7 +142,7 @@ class Conf:

				# TODO This is a hack to special case some configs just for the workflow list

				class HiddenConf(object):

				class HiddenConf:

				    def __init__(self, name, parent_build=None, filters=None):

				        self.name = name

				        self.parent_build = parent_build

				@ -160,7 +159,8 @@ class HiddenConf(object):

				    def gen_build_name(self, _):

				        return self.name

				class DocPushConf(object):

				class DocPushConf:

				    def __init__(self, name, parent_build=None, branch="master"):

				        self.name = name

				        self.parent_build = parent_build

				@ -173,11 +173,13 @@ class DocPushConf(object):

				                "branch": self.branch,

				                "requires": [self.parent_build],

				                "context": "org-member",

				                "filters": gen_filter_dict(branches_list=["nightly"],

				                                           tags_list=RC_PATTERN)

				                "filters": gen_filter_dict(

				                    branches_list=["nightly"], tags_list=RC_PATTERN

				                ),

				            }

				        }

				def gen_docs_configs(xenial_parent_config):

				    configs = []

				@ -185,8 +187,9 @@ def gen_docs_configs(xenial_parent_config):

				        HiddenConf(

				            "pytorch_python_doc_build",

				            parent_build=xenial_parent_config,

				            filters=gen_filter_dict(branches_list=["master", "main", "nightly"],

				                                    tags_list=RC_PATTERN),

				            filters=gen_filter_dict(

				                branches_list=["master", "main", "nightly"], tags_list=RC_PATTERN

				            ),

				        )

				    )

				    configs.append(

				@ -201,8 +204,9 @@ def gen_docs_configs(xenial_parent_config):

				        HiddenConf(

				            "pytorch_cpp_doc_build",

				            parent_build=xenial_parent_config,

				            filters=gen_filter_dict(branches_list=["master", "main", "nightly"],

				                                    tags_list=RC_PATTERN),

				            filters=gen_filter_dict(

				                branches_list=["master", "main", "nightly"], tags_list=RC_PATTERN

				            ),

				        )

				    )

				    configs.append(

				@ -226,13 +230,11 @@ def gen_tree():

				def instantiate_configs(only_slow_gradcheck):

				    config_list = []

				    root = get_root()

				    found_configs = conf_tree.dfs(root)

				    for fc in found_configs:

				        restrict_phases = None

				        distro_name = fc.find_prop("distro_name")

				        compiler_name = fc.find_prop("compiler_name")

				@ -351,8 +353,7 @@ def instantiate_configs(only_slow_gradcheck):

				            and compiler_name == "gcc"

				            and fc.find_prop("compiler_version") == "5.4"

				        ):

				            c.filters = gen_filter_dict(branches_list=r"/.*/",

				                                        tags_list=RC_PATTERN)

				            c.filters = gen_filter_dict(branches_list=r"/.*/", tags_list=RC_PATTERN)

				            c.dependent_tests = gen_docs_configs(c)

				        config_list.append(c)

				@ -361,16 +362,13 @@ def instantiate_configs(only_slow_gradcheck):

				def get_workflow_jobs(only_slow_gradcheck=False):

				    config_list = instantiate_configs(only_slow_gradcheck)

				    x = []

				    for conf_options in config_list:

				        phases = conf_options.restrict_phases or dimensions.PHASES

				        for phase in phases:

				            # TODO why does this not have a test?

				            if Conf.is_test_phase(phase) and conf_options.cuda_version == "10":

				                continue

									
										32

.circleci/cimodel/data/simple/docker_definitions.py
									
												View File
												
				@ -1,39 +1,39 @@

				from collections import OrderedDict

				from cimodel.lib.miniutils import quote

				from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN

				from cimodel.lib.miniutils import quote

				# NOTE: All hardcoded docker image builds have been migrated to GHA

				IMAGE_NAMES = [

				]

				IMAGE_NAMES = []

				# This entry should be an element from the list above

				# This should contain the image matching the "slow_gradcheck" entry in

				# pytorch_build_data.py

				SLOW_GRADCHECK_IMAGE_NAME = "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"

				def get_workflow_jobs(images=IMAGE_NAMES, only_slow_gradcheck=False):

				    """Generates a list of docker image build definitions"""

				    ret = []

				    for image_name in images:

				        if image_name.startswith('docker-'):

				            image_name = image_name.lstrip('docker-')

				        if image_name.startswith("docker-"):

				            image_name = image_name.lstrip("docker-")

				        if only_slow_gradcheck and image_name is not SLOW_GRADCHECK_IMAGE_NAME:

				            continue

				        parameters = OrderedDict({

				            "name": quote(f"docker-{image_name}"),

				            "image_name": quote(image_name),

				        })

				        parameters = OrderedDict(

				            {

				                "name": quote(f"docker-{image_name}"),

				                "image_name": quote(image_name),

				            }

				        )

				        if image_name == "pytorch-linux-xenial-py3.7-gcc5.4":

				            # pushing documentation on tags requires CircleCI to also

				            # build all the dependencies on tags, including this docker image

				            parameters['filters'] = gen_filter_dict(branches_list=r"/.*/",

				                                                    tags_list=RC_PATTERN)

				        ret.append(OrderedDict(

				            {

				                "docker_build_job": parameters

				            }

				        ))

				            parameters["filters"] = gen_filter_dict(

				                branches_list=r"/.*/", tags_list=RC_PATTERN

				            )

				        ret.append(OrderedDict({"docker_build_job": parameters}))

				    return ret

									
										46

.circleci/cimodel/data/simple/ios_definitions.py
									
												View File
												
				@ -1,6 +1,6 @@

				from cimodel.data.simple.util.versions import MultiPartVersion

				from cimodel.data.simple.util.branch_filters import gen_filter_dict_exclude

				import cimodel.lib.miniutils as miniutils

				from cimodel.data.simple.util.branch_filters import gen_filter_dict_exclude

				from cimodel.data.simple.util.versions import MultiPartVersion

				XCODE_VERSION = MultiPartVersion([12, 5, 1])

				@ -11,7 +11,9 @@ class ArchVariant:

				        self.custom_build_name = custom_build_name

				    def render(self):

				        extra_parts = [self.custom_build_name] if len(self.custom_build_name) > 0 else []

				        extra_parts = (

				            [self.custom_build_name] if len(self.custom_build_name) > 0 else []

				        )

				        return "-".join([self.name] + extra_parts).replace("_", "-")

				@ -20,7 +22,9 @@ def get_platform(arch_variant_name):

				class IOSJob:

				    def __init__(self, xcode_version, arch_variant, is_org_member_context=True, extra_props=None):

				    def __init__(

				        self, xcode_version, arch_variant, is_org_member_context=True, extra_props=None

				    ):

				        self.xcode_version = xcode_version

				        self.arch_variant = arch_variant

				        self.is_org_member_context = is_org_member_context

				@ -29,11 +33,15 @@ class IOSJob:

				    def gen_name_parts(self):

				        version_parts = self.xcode_version.render_dots_or_parts("-")

				        build_variant_suffix = self.arch_variant.render()

				        return [

				            "ios",

				        ] + version_parts + [

				            build_variant_suffix,

				        ]

				        return (

				            [

				                "ios",

				            ]

				            + version_parts

				            + [

				                build_variant_suffix,

				            ]

				        )

				    def gen_job_name(self):

				        return "-".join(self.gen_name_parts())

				@ -59,8 +67,12 @@ class IOSJob:

				WORKFLOW_DATA = [

				    IOSJob(XCODE_VERSION, ArchVariant("x86_64"), is_org_member_context=False, extra_props={

				        "lite_interpreter": miniutils.quote(str(int(True)))}),

				    IOSJob(

				        XCODE_VERSION,

				        ArchVariant("x86_64"),

				        is_org_member_context=False,

				        extra_props={"lite_interpreter": miniutils.quote(str(int(True)))},

				    ),

				    # IOSJob(XCODE_VERSION, ArchVariant("arm64"), extra_props={

				    #     "lite_interpreter": miniutils.quote(str(int(True)))}),

				    # IOSJob(XCODE_VERSION, ArchVariant("arm64", "metal"), extra_props={

				@ -69,9 +81,15 @@ WORKFLOW_DATA = [

				    # IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom-ops"), extra_props={

				    #     "op_list": "mobilenetv2.yaml",

				    #     "lite_interpreter": miniutils.quote(str(int(True)))}),

				    IOSJob(XCODE_VERSION, ArchVariant("x86_64", "coreml"), is_org_member_context=False, extra_props={

				        "use_coreml": miniutils.quote(str(int(True))),

				        "lite_interpreter": miniutils.quote(str(int(True)))}),

				    IOSJob(

				        XCODE_VERSION,

				        ArchVariant("x86_64", "coreml"),

				        is_org_member_context=False,

				        extra_props={

				            "use_coreml": miniutils.quote(str(int(True))),

				            "lite_interpreter": miniutils.quote(str(int(True))),

				        },

				    ),

				    # IOSJob(XCODE_VERSION, ArchVariant("arm64", "coreml"), extra_props={

				    #     "use_coreml": miniutils.quote(str(int(True))),

				    #     "lite_interpreter": miniutils.quote(str(int(True)))}),

									
										16

.circleci/cimodel/data/simple/mobile_definitions.py
									
												View File
												
				@ -2,17 +2,14 @@

				PyTorch Mobile PR builds (use linux host toolchain + mobile build options)

				"""

				import cimodel.lib.miniutils as miniutils

				import cimodel.data.simple.util.branch_filters

				import cimodel.lib.miniutils as miniutils

				class MobileJob:

				    def __init__(

				            self,

				            docker_image,

				            docker_requires,

				            variant_parts,

				            is_master_only=False):

				        self, docker_image, docker_requires, variant_parts, is_master_only=False

				    ):

				        self.docker_image = docker_image

				        self.docker_requires = docker_requires

				        self.variant_parts = variant_parts

				@ -40,13 +37,14 @@ class MobileJob:

				        }

				        if self.is_master_only:

				            props_dict["filters"] = cimodel.data.simple.util.branch_filters.gen_filter_dict()

				            props_dict[

				                "filters"

				            ] = cimodel.data.simple.util.branch_filters.gen_filter_dict()

				        return [{"pytorch_linux_build": props_dict}]

				WORKFLOW_DATA = [

				]

				WORKFLOW_DATA = []

				def get_workflow_jobs():

									
										61

.circleci/cimodel/data/simple/nightly_ios.py
									
												View File
												
				@ -3,11 +3,7 @@ import cimodel.lib.miniutils as miniutils

				class IOSNightlyJob:

				    def __init__(self,

				                 variant,

				                 is_full_jit=False,

				                 is_upload=False):

				    def __init__(self, variant, is_full_jit=False, is_upload=False):

				        self.variant = variant

				        self.is_full_jit = is_full_jit

				        self.is_upload = is_upload

				@ -16,19 +12,24 @@ class IOSNightlyJob:

				        return "upload" if self.is_upload else "build"

				    def get_common_name_pieces(self, sep):

				        extra_name_suffix = [self.get_phase_name()] if self.is_upload else []

				        extra_name = ["full_jit"] if self.is_full_jit else []

				        common_name_pieces = [

				            "ios",

				        ] + extra_name + [

				        ] + ios_definitions.XCODE_VERSION.render_dots_or_parts(sep) + [

				            "nightly",

				            self.variant,

				            "build",

				        ] + extra_name_suffix

				        common_name_pieces = (

				            [

				                "ios",

				            ]

				            + extra_name

				            + []

				            + ios_definitions.XCODE_VERSION.render_dots_or_parts(sep)

				            + [

				                "nightly",

				                self.variant,

				                "build",

				            ]

				            + extra_name_suffix

				        )

				        return common_name_pieces

				@ -37,10 +38,14 @@ class IOSNightlyJob:

				    def gen_tree(self):

				        build_configs = BUILD_CONFIGS_FULL_JIT if self.is_full_jit else BUILD_CONFIGS

				        extra_requires = [x.gen_job_name() for x in build_configs] if self.is_upload else []

				        extra_requires = (

				            [x.gen_job_name() for x in build_configs] if self.is_upload else []

				        )

				        props_dict = {

				            "build_environment": "-".join(["libtorch"] + self.get_common_name_pieces(".")),

				            "build_environment": "-".join(

				                ["libtorch"] + self.get_common_name_pieces(".")

				            ),

				            "requires": extra_requires,

				            "context": "org-member",

				            "filters": {"branches": {"only": "nightly"}},

				@ -56,11 +61,13 @@ class IOSNightlyJob:

				        if self.is_full_jit:

				            props_dict["lite_interpreter"] = miniutils.quote(str(int(False)))

				        template_name = "_".join([

				            "binary",

				            "ios",

				            self.get_phase_name(),

				        ])

				        template_name = "_".join(

				            [

				                "binary",

				                "ios",

				                self.get_phase_name(),

				            ]

				        )

				        return [{template_name: props_dict}]

				@ -75,10 +82,14 @@ BUILD_CONFIGS_FULL_JIT = [

				    IOSNightlyJob("arm64", is_full_jit=True),

				]

				WORKFLOW_DATA = BUILD_CONFIGS + BUILD_CONFIGS_FULL_JIT + [

				    IOSNightlyJob("binary", is_full_jit=False, is_upload=True),

				    IOSNightlyJob("binary", is_full_jit=True, is_upload=True),

				]

				WORKFLOW_DATA = (

				    BUILD_CONFIGS

				    + BUILD_CONFIGS_FULL_JIT

				    + [

				        IOSNightlyJob("binary", is_full_jit=False, is_upload=True),

				        IOSNightlyJob("binary", is_full_jit=True, is_upload=True),

				    ]

				)

				def get_workflow_jobs():

									
										5

.circleci/cimodel/data/simple/util/branch_filters.py
									
												View File
												
				@ -15,10 +15,7 @@ RC_PATTERN = r"/v[0-9]+(\.[0-9]+)*-rc[0-9]+/"

				MAC_IOS_EXCLUSION_LIST = ["nightly", "postnightly"]

				def gen_filter_dict(

				        branches_list=NON_PR_BRANCH_LIST,

				        tags_list=None

				):

				def gen_filter_dict(branches_list=NON_PR_BRANCH_LIST, tags_list=None):

				    """Generates a filter dictionary for use with CircleCI's job filter"""

				    filter_dict = {

				        "branches": {

									
										2

.circleci/cimodel/data/simple/util/docker_constants.py
									
												View File
												
				@ -1,11 +1,13 @@

				AWS_DOCKER_HOST = "308535385114.dkr.ecr.us-east-1.amazonaws.com"

				def gen_docker_image(container_type):

				    return (

				        "/".join([AWS_DOCKER_HOST, "pytorch", container_type]),

				        f"docker-{container_type}",

				    )

				def gen_docker_image_requires(image_name):

				    return [f"docker-{image_name}"]

									
										4

.circleci/cimodel/data/simple/util/versions.py
									
												View File
												
				@ -12,7 +12,9 @@ class MultiPartVersion:

				        with the prefix string.

				        """

				        if self.parts:

				            return [self.prefix + str(self.parts[0])] + [str(part) for part in self.parts[1:]]

				            return [self.prefix + str(self.parts[0])] + [

				                str(part) for part in self.parts[1:]

				            ]

				        else:

				            return [self.prefix]

									
										26

.circleci/cimodel/lib/conf_tree.py
									
												View File
												
				@ -1,5 +1,5 @@

				from dataclasses import dataclass, field

				from typing import Optional, Dict

				from typing import Dict, Optional

				def X(val):

				@ -19,6 +19,7 @@ class Ver:

				    """

				    Represents a product with a version number

				    """

				    name: str

				    version: str = ""

				@ -28,7 +29,7 @@ class Ver:

				@dataclass

				class ConfigNode:

				    parent: Optional['ConfigNode']

				    parent: Optional["ConfigNode"]

				    node_name: str

				    props: Dict[str, str] = field(default_factory=dict)

				@ -40,7 +41,11 @@ class ConfigNode:

				        return []

				    def get_parents(self):

				        return (self.parent.get_parents() + [self.parent.get_label()]) if self.parent else []

				        return (

				            (self.parent.get_parents() + [self.parent.get_label()])

				            if self.parent

				            else []

				        )

				    def get_depth(self):

				        return len(self.get_parents())

				@ -69,13 +74,13 @@ class ConfigNode:

				def dfs_recurse(

				        node,

				        leaf_callback=lambda x: None,

				        discovery_callback=lambda x, y, z: None,

				        child_callback=lambda x, y: None,

				        sibling_index=0,

				        sibling_count=1):

				    node,

				    leaf_callback=lambda x: None,

				    discovery_callback=lambda x, y, z: None,

				    child_callback=lambda x, y: None,

				    sibling_index=0,

				    sibling_count=1,

				):

				    discovery_callback(node, sibling_index, sibling_count)

				    node_children = node.get_children()

				@ -96,7 +101,6 @@ def dfs_recurse(

				def dfs(toplevel_config_node):

				    config_list = []

				    def leaf_callback(node):

									
										1

.circleci/cimodel/lib/miniyaml.py
									
												View File
												
				@ -25,7 +25,6 @@ def render(fh, data, depth, is_list_member=False):

				    indentation = " " * INDENTATION_WIDTH * depth

				    if is_dict(data):

				        tuples = list(data.items())

				        if type(data) is not OrderedDict:

				            tuples.sort()

									
										3

.circleci/codegen_validation/normalize_yaml_fragment.py
									
												View File
												
				@ -2,10 +2,11 @@

				import os

				import sys

				import yaml

				# Need to import modules that lie on an upward-relative path

				sys.path.append(os.path.join(sys.path[0], '..'))

				sys.path.append(os.path.join(sys.path[0], ".."))

				import cimodel.lib.miniyaml as miniyaml

									
										2

.circleci/config.yml
									
										generated
									
												View File
												
				@ -652,7 +652,7 @@ jobs:

				            - run:

				                name: Archive artifacts into zip

				                command: |

				                  zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json

				                  zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json .pytorch-test-file-ratings.json

				                  cp artifacts.zip /Users/distiller/workspace

				      - persist_to_workspace:

									
										6

.circleci/ensure-consistency.py
									
												View File
												
				@ -21,7 +21,6 @@ Please re-run the "%s" script in the "%s" directory and commit the result. See "

				def check_consistency():

				    _, temp_filename = tempfile.mkstemp("-generated-config.yml")

				    with open(temp_filename, "w") as fh:

				@ -30,7 +29,10 @@ def check_consistency():

				    try:

				        subprocess.check_call(["cmp", temp_filename, CHECKED_IN_FILE])

				    except subprocess.CalledProcessError:

				        sys.exit(ERROR_MESSAGE_TEMPLATE % (CHECKED_IN_FILE, REGENERATION_SCRIPT, PARENT_DIR, README_PATH))

				        sys.exit(

				            ERROR_MESSAGE_TEMPLATE

				            % (CHECKED_IN_FILE, REGENERATION_SCRIPT, PARENT_DIR, README_PATH)

				        )

				    finally:

				        os.remove(temp_filename)

									
										32

.circleci/generate_config_yml.py
									
												View File
												
				@ -10,15 +10,16 @@ import shutil

				import sys

				from collections import namedtuple

				import cimodel.data.simple.anaconda_prune_defintions

				import cimodel.data.simple.docker_definitions

				import cimodel.data.simple.mobile_definitions

				import cimodel.data.simple.nightly_ios

				import cimodel.data.simple.anaconda_prune_defintions

				import cimodel.lib.miniutils as miniutils

				import cimodel.lib.miniyaml as miniyaml

				class File(object):

				class File:

				    """

				    Verbatim copy the contents of a file into config.yml

				    """

				@ -57,7 +58,7 @@ def horizontal_rule():

				    return "".join("#" * 78)

				class Header(object):

				class Header:

				    def __init__(self, title, summary=None):

				        self.title = title

				        self.summary_lines = summary or []

				@ -82,15 +83,19 @@ def _for_all_items(items, functor) -> None:

				def filter_master_only_jobs(items):

				    def _is_main_or_master_item(item):

				        filters = item.get('filters', None)

				        branches = filters.get('branches', None) if filters is not None else None

				        branches_only = branches.get('only', None) if branches is not None else None

				        return ('main' in branches_only or 'master' in branches_only) if branches_only is not None else False

				        filters = item.get("filters", None)

				        branches = filters.get("branches", None) if filters is not None else None

				        branches_only = branches.get("only", None) if branches is not None else None

				        return (

				            ("main" in branches_only or "master" in branches_only)

				            if branches_only is not None

				            else False

				        )

				    master_deps = set()

				    def _save_requires_if_master(item_type, item):

				        requires = item.get('requires', None)

				        requires = item.get("requires", None)

				        item_name = item.get("name", None)

				        if not isinstance(requires, list):

				            return

				@ -107,9 +112,9 @@ def filter_master_only_jobs(items):

				        item_name = item_name.strip('"') if item_name is not None else None

				        if not _is_main_or_master_item(item) and item_name not in master_deps:

				            return None

				        if 'filters' in item:

				        if "filters" in item:

				            item = item.copy()

				            item.pop('filters')

				            item.pop("filters")

				        return {item_type: item}

				    # Scan of dependencies twice to pick up nested required jobs

				@ -123,12 +128,12 @@ def generate_required_docker_images(items):

				    required_docker_images = set()

				    def _requires_docker_image(item_type, item):

				        requires = item.get('requires', None)

				        requires = item.get("requires", None)

				        if not isinstance(requires, list):

				            return

				        for requirement in requires:

				            requirement = requirement.replace('"', '')

				            if requirement.startswith('docker-'):

				            requirement = requirement.replace('"', "")

				            if requirement.startswith("docker-"):

				                required_docker_images.add(requirement)

				    _for_all_items(items, _requires_docker_image)

				@ -191,5 +196,4 @@ def stitch_sources(output_filehandle):

				if __name__ == "__main__":

				    stitch_sources(sys.stdout)

									
										4

.circleci/scripts/binary_checkout.sh
									
												View File
												
				@ -48,7 +48,7 @@ if [[ -n "${CIRCLE_PR_NUMBER:-}" ]]; then

				  git checkout -q -B "$CIRCLE_BRANCH"

				  git reset --hard "$CIRCLE_SHA1"

				elif [[ -n "${CIRCLE_SHA1:-}" ]]; then

				  # Scheduled workflows & "smoke" binary build on master on PR merges

				  # Scheduled workflows & "smoke" binary build on trunk on PR merges

				  DEFAULT_BRANCH="$(git remote show $CIRCLE_REPOSITORY_URL | awk '/HEAD branch/ {print $NF}')"

				  git reset --hard "$CIRCLE_SHA1"

				  git checkout -q -B $DEFAULT_BRANCH

				@ -61,7 +61,7 @@ echo "Using Pytorch from "

				git --no-pager log --max-count 1

				popd

				# Clone the Builder master repo

				# Clone the Builder main repo

				retry git clone -q https://github.com/pytorch/builder.git "$BUILDER_ROOT"

				pushd "$BUILDER_ROOT"

				echo "Using builder from "

									
										2

.circleci/scripts/binary_ios_upload.sh
									
												View File
												
				@ -33,7 +33,7 @@ fi

				cp ${PROJ_ROOT}/LICENSE ${ZIP_DIR}/

				# zip the library

				export DATE="$(date -u +%Y%m%d)"

				export IOS_NIGHTLY_BUILD_VERSION="2.0.0.${DATE}"

				export IOS_NIGHTLY_BUILD_VERSION="2.1.0.${DATE}"

				if [ "${BUILD_LITE_INTERPRETER}" == "1" ]; then

				    # libtorch_lite_ios_nightly_1.11.0.20210810.zip

				    ZIPFILE="libtorch_lite_ios_nightly_${IOS_NIGHTLY_BUILD_VERSION}.zip"

									
										2

.circleci/scripts/binary_linux_build.sh
									
												View File
												
				@ -11,7 +11,7 @@ NUM_CPUS=$(( $(nproc) - 2 ))

				# Defaults here for **binary** linux builds so they can be changed in one place

				export MAX_JOBS=${MAX_JOBS:-$(( ${NUM_CPUS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${NUM_CPUS} ))}

				if [[ "${DESIRED_CUDA}" =~ cu11[0-9] ]]; then

				if [[ "${DESIRED_CUDA}" =~ cu1[1-2][0-9] ]]; then

				  export BUILD_SPLIT_CUDA="ON"

				fi

									
										13

.circleci/scripts/binary_linux_test.sh
									
												View File
												
				@ -38,10 +38,6 @@ fi

				EXTRA_CONDA_FLAGS=""

				NUMPY_PIN=""

				PROTOBUF_PACKAGE="defaults::protobuf"

				if [[ "\$python_nodot" = *311* ]]; then

				  # Numpy is yet not avaiable on default conda channel

				  EXTRA_CONDA_FLAGS="-c=malfet"

				fi

				if [[ "\$python_nodot" = *310* ]]; then

				  # There's an issue with conda channel priority where it'll randomly pick 1.19 over 1.20

				@ -81,6 +77,7 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then

				      "numpy\${NUMPY_PIN}" \

				      mkl>=2018 \

				      ninja \

				      sympy \

				      typing-extensions \

				      ${PROTOBUF_PACKAGE}

				    if [[ "$DESIRED_CUDA" == 'cpu' ]]; then

				@ -98,7 +95,13 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then

				    conda install \${EXTRA_CONDA_FLAGS} -y "\$pkg" --offline

				  )

				elif [[ "$PACKAGE_TYPE" != libtorch ]]; then

				  pip install "\$pkg" --extra-index-url "https://download.pytorch.org/whl/nightly/${DESIRED_CUDA}"

				  if [[ "$(uname -m)" == aarch64 ]]; then

				    # Using "extra-index-url" until all needed aarch64 dependencies are

				    # added to "https://download.pytorch.org/whl/nightly/"

				    pip install "\$pkg" --extra-index-url "https://download.pytorch.org/whl/nightly/${DESIRED_CUDA}"

				  else

				    pip install "\$pkg" --index-url "https://download.pytorch.org/whl/nightly/${DESIRED_CUDA}"

				  fi

				  retry pip install -q numpy protobuf typing-extensions

				fi

				if [[ "$PACKAGE_TYPE" == libtorch ]]; then

									
										2

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -59,7 +59,7 @@ PIP_UPLOAD_FOLDER='nightly/'

				# We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it

				export DATE="$(date -u +%Y%m%d)"

				#TODO: We should be pulling semver version from the base version.txt

				BASE_BUILD_VERSION="2.0.0.dev$DATE"

				BASE_BUILD_VERSION="2.1.0.dev$DATE"

				# Change BASE_BUILD_VERSION to git tag when on a git tag

				# Use 'git -C' to make doubly sure we're in the correct directory for checking

				# the git tag

									
										57

.circleci/scripts/binary_windows_build.sh
									
												View File
												
				@ -8,62 +8,7 @@ export CUDA_VERSION="${DESIRED_CUDA/cu/}"

				export USE_SCCACHE=1

				export SCCACHE_BUCKET=ossci-compiler-cache

				export SCCACHE_IGNORE_SERVER_IO_ERROR=1

				export VC_YEAR=2022

				if [[ "${DESIRED_CUDA}" == *"cu11"* ]]; then

				    export BUILD_SPLIT_CUDA=ON

				fi

				echo "Free Space for CUDA DEBUG BUILD"

				if [[ "${CIRCLECI:-}" == 'true' ]]; then

				    export NIGHTLIES_PYTORCH_ROOT="$PYTORCH_ROOT"

				    if [[ -d "C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community" ]]; then

				        rm -rf "C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community"

				    fi

				    if [[ -d "C:\\Program Files (x86)\\Microsoft Visual Studio 14.0" ]]; then

				        rm -rf "C:\\Program Files (x86)\\Microsoft Visual Studio 14.0"

				    fi

				    if [[ -d "C:\\Program Files (x86)\\Microsoft.NET" ]]; then

				        rm -rf "C:\\Program Files (x86)\\Microsoft.NET"

				    fi

				    if [[ -d "C:\\Program Files\\dotnet" ]]; then

				        rm -rf "C:\\Program Files\\dotnet"

				    fi

				    if [[ -d "C:\\Program Files (x86)\\dotnet" ]]; then

				        rm -rf "C:\\Program Files (x86)\\dotnet"

				    fi

				    if [[ -d "C:\\Program Files (x86)\\Microsoft SQL Server" ]]; then

				        rm -rf "C:\\Program Files (x86)\\Microsoft SQL Server"

				    fi

				    if [[ -d "C:\\Program Files (x86)\\Xamarin" ]]; then

				        rm -rf "C:\\Program Files (x86)\\Xamarin"

				    fi

				    if [[ -d "C:\\Program Files (x86)\\Google" ]]; then

				        rm -rf "C:\\Program Files (x86)\\Google"

				    fi

				    set +x

				    export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}

				    export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}

				    set -x

				    if [[ -d "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages\\_Instances" ]]; then

				        mv "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages\\_Instances" .

				        rm -rf "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages"

				        mkdir -p "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages"

				        mv _Instances "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages"

				    fi

				    if [[ -d "C:\\Microsoft" ]]; then

				        # don't use quotes here

				        rm -rf /c/Microsoft/AndroidNDK*

				    fi

				fi

				export VC_YEAR=2019

				echo "Free space on filesystem before build:"

				df -h

									
										2

.circleci/scripts/binary_windows_test.sh
									
												View File
												
				@ -4,7 +4,7 @@ set -eux -o pipefail

				source "${BINARY_ENV_FILE:-/c/w/env}"

				export CUDA_VERSION="${DESIRED_CUDA/cu/}"

				export VC_YEAR=2022

				export VC_YEAR=2019

				pushd "$BUILDER_ROOT"

									
										2

.circleci/scripts/functorch_doc_push_script.sh
									
												View File
												
				@ -24,7 +24,7 @@ popd

				git clone https://github.com/pytorch/functorch -b gh-pages --depth 1 functorch_ghpages

				pushd functorch_ghpages

				if [ $version == "master" ]; then

				if [ $version == "main" ]; then

				  version=nightly

				fi

									
										74

.circleci/scripts/trigger_azure_pipeline.py
									
												View File
												
				@ -1,12 +1,13 @@

				# Documentation: https://docs.microsoft.com/en-us/rest/api/azure/devops/build/?view=azure-devops-rest-6.0

				import re

				import json

				import os

				import re

				import sys

				import requests

				import time

				import requests

				AZURE_PIPELINE_BASE_URL = "https://aiinfra.visualstudio.com/PyTorch/"

				AZURE_DEVOPS_PAT_BASE64 = os.environ.get("AZURE_DEVOPS_PAT_BASE64_SECRET", "")

				PIPELINE_ID = "911"

				@ -19,54 +20,68 @@ build_base_url = AZURE_PIPELINE_BASE_URL + "_apis/build/builds?api-version=6.0"

				s = requests.Session()

				s.headers.update({"Authorization": "Basic " + AZURE_DEVOPS_PAT_BASE64})

				def submit_build(pipeline_id, project_id, source_branch, source_version):

				    print("Submitting build for branch: " + source_branch)

				    print("Commit SHA1: ", source_version)

				    run_build_raw = s.post(build_base_url, json={

				        "definition": {"id": pipeline_id},

				        "project": {"id": project_id},

				        "sourceBranch": source_branch,

				        "sourceVersion": source_version

				    })

				    run_build_raw = s.post(

				        build_base_url,

				        json={

				            "definition": {"id": pipeline_id},

				            "project": {"id": project_id},

				            "sourceBranch": source_branch,

				            "sourceVersion": source_version,

				        },

				    )

				    try:

				        run_build_json = run_build_raw.json()

				    except json.decoder.JSONDecodeError as e:

				        print(e)

				        print("Failed to parse the response. Check if the Azure DevOps PAT is incorrect or expired.")

				        print(

				            "Failed to parse the response. Check if the Azure DevOps PAT is incorrect or expired."

				        )

				        sys.exit(-1)

				    build_id = run_build_json['id']

				    build_id = run_build_json["id"]

				    print("Submitted bulid: " + str(build_id))

				    print("Bulid URL: " + run_build_json['url'])

				    print("Bulid URL: " + run_build_json["url"])

				    return build_id

				def get_build(_id):

				    get_build_url = AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}?api-version=6.0"

				    get_build_url = (

				        AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}?api-version=6.0"

				    )

				    get_build_raw = s.get(get_build_url)

				    return get_build_raw.json()

				def get_build_logs(_id):

				    get_build_logs_url = AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}/logs?api-version=6.0"

				    get_build_logs_url = (

				        AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}/logs?api-version=6.0"

				    )

				    get_build_logs_raw = s.get(get_build_logs_url)

				    return get_build_logs_raw.json()

				def get_log_content(url):

				    resp = s.get(url)

				    return resp.text

				def wait_for_build(_id):

				    build_detail = get_build(_id)

				    build_status = build_detail['status']

				    build_status = build_detail["status"]

				    while build_status == 'notStarted':

				        print('Waiting for run to start: ' + str(_id))

				    while build_status == "notStarted":

				        print("Waiting for run to start: " + str(_id))

				        sys.stdout.flush()

				        try:

				            build_detail = get_build(_id)

				            build_status = build_detail['status']

				            build_status = build_detail["status"]

				        except Exception as e:

				            print("Error getting build")

				            print(e)

				@ -76,7 +91,7 @@ def wait_for_build(_id):

				    print("Bulid started: ", str(_id))

				    handled_logs = set()

				    while build_status == 'inProgress':

				    while build_status == "inProgress":

				        try:

				            print("Waiting for log: " + str(_id))

				            logs = get_build_logs(_id)

				@ -86,38 +101,39 @@ def wait_for_build(_id):

				            time.sleep(30)

				            continue

				        for log in logs['value']:

				            log_id = log['id']

				        for log in logs["value"]:

				            log_id = log["id"]

				            if log_id in handled_logs:

				                continue

				            handled_logs.add(log_id)

				            print('Fetching log: \n' + log['url'])

				            print("Fetching log: \n" + log["url"])

				            try:

				                log_content = get_log_content(log['url'])

				                log_content = get_log_content(log["url"])

				                print(log_content)

				            except Exception as e:

				                print("Error getting log content")

				                print(e)

				            sys.stdout.flush()

				        build_detail = get_build(_id)

				        build_status = build_detail['status']

				        build_status = build_detail["status"]

				        time.sleep(30)

				    build_result = build_detail['result']

				    build_result = build_detail["result"]

				    print("Bulid status: " + build_status)

				    print("Bulid result: " + build_result)

				    return build_status, build_result

				if __name__ == '__main__':

				if __name__ == "__main__":

				    # Convert the branch name for Azure DevOps

				    match = re.search(r'pull/(\d+)', TARGET_BRANCH)

				    match = re.search(r"pull/(\d+)", TARGET_BRANCH)

				    if match is not None:

				        pr_num = match.group(1)

				        SOURCE_BRANCH = f'refs/pull/{pr_num}/head'

				        SOURCE_BRANCH = f"refs/pull/{pr_num}/head"

				    else:

				        SOURCE_BRANCH = f'refs/heads/{TARGET_BRANCH}'

				        SOURCE_BRANCH = f"refs/heads/{TARGET_BRANCH}"

				    MAX_RETRY = 2

				    retry = MAX_RETRY

				@ -126,7 +142,7 @@ if __name__ == '__main__':

				        build_id = submit_build(PIPELINE_ID, PROJECT_ID, SOURCE_BRANCH, TARGET_COMMIT)

				        build_status, build_result = wait_for_build(build_id)

				        if build_result != 'succeeded':

				        if build_result != "succeeded":

				            retry = retry - 1

				            if retry > 0:

				                print("Retrying... remaining attempt: " + str(retry))

									
										2

.circleci/verbatim-sources/job-specs/job-specs-custom.yml
									
												View File
												
				@ -177,7 +177,7 @@

				            - run:

				                name: Archive artifacts into zip

				                command: |

				                  zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json

				                  zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json .pytorch-test-file-ratings.json

				                  cp artifacts.zip /Users/distiller/workspace

				      - persist_to_workspace:

10

.clang-format

View File

 @ -60,9 +60,6 @@ MacroBlockBegin: ''
 MacroBlockEnd:   ''
 MaxEmptyLinesToKeep: 1
 NamespaceIndentation: None
 ObjCBlockIndentWidth: 2
 ObjCSpaceAfterProperty: false
 ObjCSpaceBeforeProtocolList: false
 PenaltyBreakBeforeFirstCallParameter: 1
 PenaltyBreakComment: 300
 PenaltyBreakFirstLessLess: 120
 @ -85,4 +82,11 @@ SpacesInSquareBrackets: false
 Standard:        Cpp11
 TabWidth:        8
 UseTab:          Never
 ---
 Language: ObjC
 ColumnLimit: 120
 AlignAfterOpenBracket: Align
 ObjCBlockIndentWidth: 2
 ObjCSpaceAfterProperty: false
 ObjCSpaceBeforeProtocolList: false
 ...

3

.clang-tidy

View File

 @ -9,6 +9,7 @@ bugprone-*,
 -bugprone-lambda-function-name,
 -bugprone-reserved-identifier,
 -bugprone-swapped-arguments,
 clang-diagnostic-missing-prototypes,
 cppcoreguidelines-*,
 -cppcoreguidelines-avoid-do-while,
 -cppcoreguidelines-avoid-magic-numbers,
 @ -41,8 +42,6 @@ modernize-*,
 -modernize-use-trailing-return-type,
 -modernize-use-nodiscard,
 performance-*,
 -performance-noexcept-move-constructor,
 -performance-unnecessary-value-param,
 readability-container-size-empty,
 '
 HeaderFilterRegex: '^(c10/(?!test)|torch/csrc/(?!deploy/interpreter/cpython)).*$'

									
										34

.devcontainer/Dockerfile
									
										Normal file
									
												View File
												
				@ -0,0 +1,34 @@

				FROM mcr.microsoft.com/vscode/devcontainers/miniconda:0-3

				#  I am suprised this is needed

				RUN conda init

				# Copy environment.yml (if found) to a temp location so we update the environment. Also

				# copy "noop.txt" so the COPY instruction does not fail if no environment.yml exists.

				COPY .devcontainer/cuda/environment.yml .devcontainer/noop.txt /tmp/conda-tmp/

				RUN if [ -f "/tmp/conda-tmp/environment.yml" ]; then umask 0002 && /opt/conda/bin/conda env update -n base -f /tmp/conda-tmp/environment.yml; fi \

				    && sudo rm -rf /tmp/conda-tmp

				# Tools needed for llvm

				RUN sudo apt-get -y update

				RUN sudo apt install -y lsb-release wget software-properties-common gnupg

				# Install CLANG if version is specified

				ARG CLANG_VERSION

				RUN if [ -n "$CLANG_VERSION" ]; then \

				    sudo wget https://apt.llvm.org/llvm.sh; \

				    chmod +x llvm.sh; \

				    sudo ./llvm.sh "${CLANG_VERSION}"; \

				    echo 'export CC=clang' >> ~/.bashrc; \

				    echo 'export CXX=clang++' >> ~/.bashrc; \

				    sudo apt update; \

				    sudo apt install -y clang; \

				    sudo apt install -y libomp-dev; \

				    fi

				# Install cuda if version is specified

				ARG CUDA_VERSION

				RUN if [ -n "$CUDA_VERSION" ]; then \

				       conda install cuda -c "nvidia/label/cuda-${CUDA_VERSION}"; \

				    fi

									
										37

.devcontainer/cpu/devcontainer.json
									
										Normal file
									
												View File
												
				@ -0,0 +1,37 @@

				// For format details, see https://aka.ms/devcontainer.json. For config options, see the

				// README at: https://github.com/devcontainers/templates/tree/main/src/anaconda

				{

				  "name": "PyTorch - CPU",

				  "build": {

				    "context": "../..",

				    "dockerfile": "../Dockerfile",

				    "args": {

				      "USERNAME": "vscode",

				      "BUILDKIT_INLINE_CACHE": "0",

				      "CLANG_VERSION": ""

				    }

				  },

				  // Features to add to the dev container. More info: https://containers.dev/features.

				  "features": {

				    // This is needed for lintrunner

				    "ghcr.io/devcontainers/features/rust:1" : {}

				  },

				  // Use 'forwardPorts' to make a list of ports inside the container available locally.

				  // "forwardPorts": [],

				  // Use 'postCreateCommand' to run commands after the container is created.

				  "postCreateCommand": "bash .devcontainer/scripts/install-dev-tools.sh",

				  // Configure tool-specific properties.

				  // "customizations": {},

				  "customizations": {

				    "vscode": {

				      "extensions": ["streetsidesoftware.code-spell-checker"]

				    }

				  }

				  // Uncomment to connect as root instead. More info: https://aka.ms/dev-containers-non-root.

				  // "remoteUser": "root"

				}

									
										6

.devcontainer/cpu/environment.yml
									
										Normal file
									
												View File
												
				@ -0,0 +1,6 @@

				# This environment is specific to Debian

				name: PyTorch

				dependencies:

				  - cmake

				  - ninja

				  - libopenblas

									
										37

.devcontainer/cuda/devcontainer.json
									
										Normal file
									
												View File
												
				@ -0,0 +1,37 @@

				// For format details, see https://aka.ms/devcontainer.json. For config options, see the

				// README at: https://github.com/devcontainers/templates/tree/main/src/anaconda

				{

				  "name": "PyTorch - CUDA",

				  "build": {

				    "context": "../..",

				    "dockerfile": "../Dockerfile",

				    "args": {

				      "USERNAME": "vscode",

				      "BUILDKIT_INLINE_CACHE": "0",

				      "CUDA_VERSION": "11.8.0",

				      "CLANG_VERSION": ""

				    }

				  },

				  "runArgs": ["--gpus", "all"],

				// Use 'forwardPorts' to make a list of ports inside the container available locally.

				  // "forwardPorts": [],

				  // Use 'postCreateCommand' to run commands after the container is created.

				  "postCreateCommand": "bash .devcontainer/scripts/install-dev-tools.sh",

				  // Configure tool-specific properties.

				  // "customizations": {},

				  "customizations": {

				    "vscode": {

				      "extensions": ["streetsidesoftware.code-spell-checker"]

				    }

				  },

				  // Features to add to the dev container. More info: https://containers.dev/features.

				  "features": {

				    // This is needed for lintrunner

				    "ghcr.io/devcontainers/features/rust:1" : {}

				  }

				  // Uncomment to connect as root instead. More info: https://aka.ms/dev-containers-non-root.

				  // "remoteUser": "root"

				}

Compare commits

6626 Commits v2.0.0 ... orig/relea

3 .bazelignore Normal file Unescape Escape View File

7 .bazelrc Unescape Escape View File

2 .bazelversion Unescape Escape View File

1 .buckconfig.oss Unescape Escape View File

8 .ci/docker/README.md Unescape Escape View File

2 .ci/docker/android/build.gradle Unescape Escape View File

155 .ci/docker/build.sh Unescape Escape View File

60 .ci/docker/build_docker.sh Unescape Escape View File

4 .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

1 .ci/docker/ci_commit_pins/huggingface.txt Normal file Unescape Escape View File

1 .ci/docker/ci_commit_pins/timm.txt Normal file Unescape Escape View File

1 .ci/docker/ci_commit_pins/triton-rocm.txt Normal file Unescape Escape View File

1 .ci/docker/ci_commit_pins/triton.txt Normal file Unescape Escape View File

18 .ci/docker/common/cache_vision_models.sh Normal file Unescape Escape View File

6 .ci/docker/common/common_utils.sh Unescape Escape View File

3 .ci/docker/common/install_android.sh Unescape Escape View File

26 .ci/docker/common/install_base.sh Unescape Escape View File

13 .ci/docker/common/install_cache.sh Unescape Escape View File

11 .ci/docker/common/install_clang.sh Unescape Escape View File

40 .ci/docker/common/install_conda.sh Unescape Escape View File

6 .ci/docker/common/install_cudnn.sh Unescape Escape View File

2 .ci/docker/common/install_docs_reqs.sh Unescape Escape View File

15 .ci/docker/common/install_gcc.sh Unescape Escape View File

28 .ci/docker/common/install_inductor_benchmark_deps.sh Normal file Unescape Escape View File

51 .ci/docker/common/install_onnx.sh Normal file Unescape Escape View File

38 .ci/docker/common/install_rocm.sh Unescape Escape View File

4 .ci/docker/common/install_rocm_magma.sh Unescape Escape View File

66 .ci/docker/common/install_triton.sh Executable file Unescape Escape View File

3 .ci/docker/common/install_vision.sh Unescape Escape View File

55 .ci/docker/requirements-ci.txt Unescape Escape View File

49 .ci/docker/requirements-docs.txt Normal file Unescape Escape View File

1 .ci/docker/triton_version.txt Normal file Unescape Escape View File

23 .ci/docker/ubuntu-cuda/Dockerfile Unescape Escape View File

15 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

37 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

72 .ci/onnx/test.sh Unescape Escape View File

42 .ci/pytorch/build-asan.sh Unescape Escape View File

29 .ci/pytorch/build-tsan.sh Unescape Escape View File

29 .ci/pytorch/build.sh Unescape Escape View File

9 .ci/pytorch/common-build.sh Unescape Escape View File

4 .ci/pytorch/common.sh Unescape Escape View File

154 .ci/pytorch/common_utils.sh Unescape Escape View File

18 .circleci/scripts/cpp_doc_push_script.sh → .ci/pytorch/cpp_doc_push_script.sh Unescape Escape View File

136 .ci/pytorch/create_test_cert.py Unescape Escape View File

1 .ci/pytorch/docs-test.sh Unescape Escape View File

40 .ci/pytorch/functorch_doc_push_script.sh Executable file Unescape Escape View File

16 .ci/pytorch/macos-build.sh Unescape Escape View File

21 .ci/pytorch/macos-common.sh Unescape Escape View File

29 .ci/pytorch/macos-test.sh Unescape Escape View File

31 .ci/pytorch/multigpu-test.sh Unescape Escape View File

63 .ci/pytorch/perf_test/compare_with_baseline.py Unescape Escape View File

7 .ci/pytorch/perf_test/get_stats.py Unescape Escape View File

6 .ci/pytorch/perf_test/update_commit_hash.py Unescape Escape View File

8 .ci/pytorch/print_sccache_log.py Unescape Escape View File

28 .circleci/scripts/python_doc_push_script.sh → .ci/pytorch/python_doc_push_script.sh Unescape Escape View File

881 .ci/pytorch/test.sh View File

18 .ci/pytorch/win-build.sh Unescape Escape View File

29 .ci/pytorch/win-test-helpers/build_pytorch.bat Unescape Escape View File

19 .ci/pytorch/win-test-helpers/install_test_functorch.bat Unescape Escape View File

11 .ci/pytorch/win-test-helpers/run_python_nn_smoketests.py Unescape Escape View File

37 .ci/pytorch/win-test-helpers/setup_pytorch_env.bat Unescape Escape View File

24 .ci/pytorch/win-test-helpers/test_libtorch.bat Unescape Escape View File

1 .ci/pytorch/win-test-helpers/test_python_jit_legacy.bat Unescape Escape View File

1 .ci/pytorch/win-test-helpers/test_python_shard.bat Unescape Escape View File

29 .ci/pytorch/win-test.sh Unescape Escape View File

36 .circleci/README.md Unescape Escape View File

67 .circleci/cimodel/data/binary_build_data.py Unescape Escape View File

132 .circleci/cimodel/data/binary_build_definitions.py Unescape Escape View File

7 .circleci/cimodel/data/dimensions.py Unescape Escape View File

19 .circleci/cimodel/data/pytorch_build_data.py Unescape Escape View File

36 .circleci/cimodel/data/pytorch_build_definitions.py Unescape Escape View File

32 .circleci/cimodel/data/simple/docker_definitions.py Unescape Escape View File

46 .circleci/cimodel/data/simple/ios_definitions.py Unescape Escape View File

16 .circleci/cimodel/data/simple/mobile_definitions.py Unescape Escape View File

61 .circleci/cimodel/data/simple/nightly_ios.py Unescape Escape View File

5 .circleci/cimodel/data/simple/util/branch_filters.py Unescape Escape View File

2 .circleci/cimodel/data/simple/util/docker_constants.py Unescape Escape View File

4 .circleci/cimodel/data/simple/util/versions.py Unescape Escape View File

6626 Commits

v2.0.0 ... orig/relea

3

.bazelignore Normal file

View File

7

.bazelrc

View File

2

.bazelversion

View File

1

.buckconfig.oss

View File

8

.ci/docker/README.md

View File

2

.ci/docker/android/build.gradle

View File

155

.ci/docker/build.sh

View File

60

.ci/docker/build_docker.sh

View File

4

.ci/docker/centos-rocm/Dockerfile

View File

1

.ci/docker/ci_commit_pins/huggingface.txt Normal file

View File

1

.ci/docker/ci_commit_pins/timm.txt Normal file

View File

1

.ci/docker/ci_commit_pins/triton-rocm.txt Normal file

View File

1

.ci/docker/ci_commit_pins/triton.txt Normal file

View File

18

.ci/docker/common/cache_vision_models.sh Normal file

View File

6

.ci/docker/common/common_utils.sh

View File

3

.ci/docker/common/install_android.sh

View File

26

.ci/docker/common/install_base.sh

View File

13

.ci/docker/common/install_cache.sh

View File

11

.ci/docker/common/install_clang.sh

View File

40

.ci/docker/common/install_conda.sh

View File

6

.ci/docker/common/install_cudnn.sh

View File

2

.ci/docker/common/install_docs_reqs.sh

View File

15

.ci/docker/common/install_gcc.sh

View File

28

.ci/docker/common/install_inductor_benchmark_deps.sh Normal file

View File

51

.ci/docker/common/install_onnx.sh Normal file

View File

38

.ci/docker/common/install_rocm.sh

View File

4

.ci/docker/common/install_rocm_magma.sh

View File

66

.ci/docker/common/install_triton.sh Executable file

View File

3

.ci/docker/common/install_vision.sh

View File

55

.ci/docker/requirements-ci.txt

View File

49

.ci/docker/requirements-docs.txt Normal file

View File

1

.ci/docker/triton_version.txt Normal file

View File

23

.ci/docker/ubuntu-cuda/Dockerfile

View File

15

.ci/docker/ubuntu-rocm/Dockerfile

View File

37

.ci/docker/ubuntu/Dockerfile

View File

72

.ci/onnx/test.sh

View File

42

.ci/pytorch/build-asan.sh

View File

29

.ci/pytorch/build-tsan.sh

View File

29

.ci/pytorch/build.sh

View File

9

.ci/pytorch/common-build.sh

View File

4

.ci/pytorch/common.sh

View File

154

.ci/pytorch/common_utils.sh

View File

18

.circleci/scripts/cpp_doc_push_script.sh → .ci/pytorch/cpp_doc_push_script.sh

View File

136

.ci/pytorch/create_test_cert.py

View File

1

.ci/pytorch/docs-test.sh

View File

40

.ci/pytorch/functorch_doc_push_script.sh Executable file

View File

16

.ci/pytorch/macos-build.sh

View File

21

.ci/pytorch/macos-common.sh

View File

29

.ci/pytorch/macos-test.sh

View File

31

.ci/pytorch/multigpu-test.sh

View File

63

.ci/pytorch/perf_test/compare_with_baseline.py

View File

7

.ci/pytorch/perf_test/get_stats.py

View File

6

.ci/pytorch/perf_test/update_commit_hash.py

View File

8

.ci/pytorch/print_sccache_log.py

View File

28

.circleci/scripts/python_doc_push_script.sh → .ci/pytorch/python_doc_push_script.sh

View File

881

.ci/pytorch/test.sh

View File

18

.ci/pytorch/win-build.sh

View File

29

.ci/pytorch/win-test-helpers/build_pytorch.bat

View File

19

.ci/pytorch/win-test-helpers/install_test_functorch.bat

View File

11

.ci/pytorch/win-test-helpers/run_python_nn_smoketests.py

View File

37

.ci/pytorch/win-test-helpers/setup_pytorch_env.bat

View File

24

.ci/pytorch/win-test-helpers/test_libtorch.bat

View File

1

.ci/pytorch/win-test-helpers/test_python_jit_legacy.bat

View File

1

.ci/pytorch/win-test-helpers/test_python_shard.bat

View File

29

.ci/pytorch/win-test.sh

View File

36

.circleci/README.md

View File

67

.circleci/cimodel/data/binary_build_data.py

View File

132

.circleci/cimodel/data/binary_build_definitions.py

View File

7

.circleci/cimodel/data/dimensions.py

View File

19

.circleci/cimodel/data/pytorch_build_data.py

View File

36

.circleci/cimodel/data/pytorch_build_definitions.py

View File

32

.circleci/cimodel/data/simple/docker_definitions.py

View File

46

.circleci/cimodel/data/simple/ios_definitions.py

View File

16

.circleci/cimodel/data/simple/mobile_definitions.py

View File

61

.circleci/cimodel/data/simple/nightly_ios.py

View File

5

.circleci/cimodel/data/simple/util/branch_filters.py

View File

2

.circleci/cimodel/data/simple/util/docker_constants.py

View File

4

.circleci/cimodel/data/simple/util/versions.py

View File

26

.circleci/cimodel/lib/conf_tree.py

View File