pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
albanD	1a8af1503f	Upgrade Pybind submodule to 2.10.4 (#103989 ) This is not ready for review, this is to make sure asan is fixed. Not sure what is the most effective way to track down the bad dec_ref within deploy yet. The asan silencing is done to match this comment: `1c79003b3c/test/test_cpp_extensions_jit.py (L749-L752)` EDIT: since the final failing function is in libtorch_python.so, we would need to skip that whole lib (not ok). So now we're skipping based on the function name which should be restrictive enough to not hide any real bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103989 Approved by: https://github.com/malfet	2023-06-27 20:22:39 +00:00
Jerry Zhang	c98896b76f	[quant][pt2e] Add more precise representation for quantized add (#104130 ) Summary: The planned e2e for quantization in pytorch 2.0 export is the following: float_model -> prepare_pt2e -> calibration -> convert_pt2e -> ... inside convert_pt2e, we will first produce a q/dq representation of the quantized model, similar to the previous output of convert_to_reference_fx in fx grah mode quantization: ``` torch.ops.quantized_decomposed.dequantize_per_tensor -> torch.ops.aten.add -> torch.ops.quantized_decomopsed.quantize_per_tensor torch.ops.quantized_decomposed.dequantize_per_tensor / ``` Then we'll rewrite the above to a more precise representation that express the intention in a more precise manner, since here we actually want to do int8 addition, instead of simulating the int8 addition with fp32 operations, the representation for quantized add is: ``` def quantized_add(x_i8, x_scale, x_zero_point, y_i8, y_scale, y_zero_point, out_scale, out_zero_point): x = (x_scale / out_scale) * x_i8 y = (y_scale / out_scale) * y_i8 out = x + y out -= (x_zero_point * x_scale - y_zero_point * y_scale) / out_scale out += out_zero_point return out ``` Test Plan: ``` buck2 test caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_representation_add (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)' ``` Reviewed By: kimishpatel Differential Revision: D45628032 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104130 Approved by: https://github.com/kimishpatel	2023-06-27 20:11:30 +00:00
Yanbo Liang	7bf27cf163	[Inductor][FX passes] Remove config.split_cat_fx_passes & Add config.experimental_patterns (#104208 ) Summary: TLDR: * Remove config.split_cat_fx_passes, and move split cat passes behind config.pattern_matcher (True by default) * Add config.experimental_patterns (False by default). * In the future, general/universal patterns should behind config.pattern_matcher; customized/unmatured patterns should behind config.experimental_patterns. More details at: https://docs.google.com/document/d/1P8uJTpOTdQpUbw56UxHol40tt-EPFTq1Qu38072E9aM/edit Test Plan: Existing unit tests Reviewed By: jansel, jackiexu1992 Differential Revision: D46752606 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104208 Approved by: https://github.com/williamwen42	2023-06-27 20:08:40 +00:00
Jesse Cai	2da6cae43c	[core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135 ) This PR adds in support for semi-structured sparsity via a tensor subclass. It currently uses the CUTLASS kernels merged in PR #100881. In the future we plan to add in cuSPARSELt support (see the other PRs in the stack), which will give us larger performance gains. This PR adds in 2 things: - a Tensor subclass, `SparseSemiStructuredTensor` to store the sparse tensor in copmressed form and override `__torch_dispatch__`. - a conversion function that takes in a dense tensor and a semi-structured sparse bool mask and creates an instance of the subclass. SparseSemiStructuredTensor The subclass stores the dense tensor in a contiguous flattened tensor for future compatability with cuSPARSELt, which expects this format. Note that the CUTLASS kernels do not have this limitation, as the specified values and the metadata are passed separately in `_structured_sparse_linear`. In the future we can use the cuSPARSELT bindings [here](https://github.com/pytorch/pytorch/pull/103700) for faster matmul, better dtype converage, and relaxed shape constraints. Since we currently don't have a way to go back from the sparse representation to the dense representation, and we store the weights in compressed form, we don't have a great way to handle .t(). Instead, we keep track of how often we've called transpose on our tensor, and if it's an unexpected number we throw an error. When the first argument is sparse, we expect an even number of calls to transpose, while when the second argument is sparse, we expect an odd number of calls. This is because we support second argument sparse matrix multiplications by using transpose properties. to_sparse_semi_structured This is a conversion function to convert a dense tensor and a semi-structured sparse bool mask into a subclass. Currently, we must pass in a bool mask, since we can't infer it becuase there may be additional zero elements in the dense tensor, so `tensor !=0` is not 2:4 sparse. Once we add either a method to derive the mask from the dense tensor or cuSPARSELt, we no longer need to pass in the mask. cuSPARSELt has it's own helper functions to create the metadata mask. User Details We have implemented support for the following ops for `torch.float16` and `torch.int8`: ``` torch.addmm(bias, dense, sparse.t()) torch.mm(dense, sparse) torch.mm(sparse, dense) aten.linear.default aten.t.default aten.t.detach ``` The end user interface to accelerate a nn.Linaer module with the subclass would look like this: ``` from torch.sparse import to_sparse_semi_structured mask = torch.Tensor([0, 0, 1, 1]).tile(128, 32).cuda().bool() linear = Model(128, 128).half().cuda() linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight, mask=linear.weight.bool()) ``` This also updates tests and the `torch.sparse` module docstring to reflect these changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102135 Approved by: https://github.com/albanD	2023-06-27 19:21:06 +00:00
Logan Wendholt	39868b0578	[codemod][third-party][gtest] Migrate all fbcode gtest from tp2 to fbsource/third-party (#104255 ) Summary: ## What is this? This is a giant codemod to migrate all of fbcode from the tp2 version of gtest to the `fbsource/third-party` version. ## Why? Various parts of the monorepo use different versions of gtest which are incompatible with each other and make maintenance of C++ testing more difficult than it should be. There also doesn't seem to be much reason for this fragmentation. Shifting all `gtest` dependencies towards `fbsource/third-party` is a big step in the right direction towards cleaning this up. Also -- tp2 is deprecated, so we want to stop using that anyway. If we're going to make improvements to `gtest`, we should get away from tp2 as a first step. ## How? I used bash script to perform the majority of the codemod: P777150295 I followed up with `rg` to find additional dependencies, then simply iterated a ton until CI was (mostly) happy. This diff also includes an update to autodeps to use the `third-party/fbsource` version of gtest rather than the `tp2` version. #forcetdhashing Test Plan: CI Differential Revision: D46961576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104255 Approved by: https://github.com/huydhn	2023-06-27 19:10:08 +00:00
Xilun Wu	a66107a30c	[DTensor][Random] Introduce CudaRNGStateTracker to maintain parallel RNG state for DTensor (#103235 ) # Change This PR adds two classes to DTensor: 1. `CudaRNGStateTracker`: `CudaRNGStateTracker` stores Random Number Generator (RNG) state (a `ByteTensor` object) in a `dict`, mapping from a corresponding tag to each state tensor. It also provides a set of convenient utility methods to help access/modify the state tensors. The most important interface is `_distribute_region` which will be used when DTensor executes a random op (an operator that calls RNG). 2. `OffsetBasedRNGTracker`: This subclass of `CudaRNGStateTracker` defines the default policy of how RNG states should be shared and synchronized among all ranks to respect the semantics of DTensor random operators. # Warning - With `Multi-threaded ProcessGroup`, the global variable `_rng_tracker` will be shared among threads(ranks) and cause issue. We need to figure out a compatible solution for that. - The RNG state may be asynchronous outside of participating ranks. It is harmless in our current use case of submesh though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103235 Approved by: https://github.com/wanchaol	2023-06-27 19:00:25 +00:00
Bowen Bao	84f578dcc2	[ONNX] Cache AutoTokenizer in CI for test (#104233 ) Fixes #103950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104233 Approved by: https://github.com/malfet	2023-06-27 18:55:39 +00:00
jaiaid	93b6b17dd0	CUDA_HOST_COMPILER spelling fix in cmake build files generate method (#104126 ) Fix of CUDA_HOST_COMPILER spelling fix in generating additional build option in CMake.generate method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104126 Approved by: https://github.com/malfet	2023-06-27 18:46:12 +00:00
Te	a73ad82c8f	conditional CMAKE_CUDA_STANDARD (#104240 ) Fixes #104237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104240 Approved by: https://github.com/malfet	2023-06-27 18:41:25 +00:00
xuanqi	bf34ecd0c8	[RFC]: Integrate assertions functionalization to export (after AOT export) (#103887 ) This PR integrated the assertion functionalization logic into current export logic. NOTE: I finally decided to do the assertion functionalization after AOT export instead of before for the following reasons: * The benefit of AOT export is that the graph is already functionalized so things like method call is already transformed to function call. However, if we do it before AOT export, the graph is still in torch level and extra logic like `bab21d20eb/torch/_export/pass_base.py (L201-L204C17)` will need to be implemented. * The graph signature is kind of already incorrect after adding runtime assertions currently (this doesn't seem break logic since we already depend on positions instead of FQNs of outputs). This PR also fixed this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103887 Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan	2023-06-27 18:14:29 +00:00
Tugsbayasgalan Manlaibaatar	936cd4f2f5	Migrate exportdb to torch.export (#104260 ) Reapply of (https://github.com/pytorch/pytorch/pull/103861). Things that needed to be fixed: - Fix a bug with returning dict output type - Make pass_base work with map implementation - Fix subtle bug with dynamo not propagating "val" in node.meta - Add export_constraints field in ExportCase in ExportDB Pull Request resolved: https://github.com/pytorch/pytorch/pull/104260 Approved by: https://github.com/angelayi	2023-06-27 17:49:18 +00:00
Jean Schmidt	ab9577087a	Update accuracy for dynamo/torchbench CI - vision_maskrcnn, hf_T5_generate and dlrm (#104263 ) Fixes breaking CI jobs Pull Request resolved: https://github.com/pytorch/pytorch/pull/104263 Approved by: https://github.com/atalman, https://github.com/seemethere	2023-06-27 17:33:01 +00:00
Digant Desai	ef285faeba	[ET][XNNPACK] Add support for quantized Multiply (#104134 ) Summary: Also adds support for backend_config with relu fusion since XNNPACK allows it. We should revisit the relu fusion once we gain more clarity on quantSrcPartition or some other way to do these fusion and not having to add all combinations. We should really rename the backend config to et_xnnpack.py or something TODO Test Plan: `buck test fbcode//mode/dev-nosan fbcode//executorch/backends/xnnpack/test:` Differential Revision: D46985169 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104134 Approved by: https://github.com/mcr229, https://github.com/salilsdesai	2023-06-27 16:59:28 +00:00
Jack Taylor	80ea3422f0	[ROCm] Enable tl.reduce usage on ROCm (#104099 ) Revert aten.prod explicit fallback on ROCm and enabling the use of tl.reduce in triton codegen. This PR also enables an optimisation that was previously conditionalised out for ROCm https://github.com/pytorch/pytorch/pull/102444 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104099 Approved by: https://github.com/peterbell10, https://github.com/malfet	2023-06-27 16:21:32 +00:00
Li-Huai (Allan) Lin	99e87bb6a0	[MPS] Dispatch outer bin edges selection function (#101792 ) Dispatch the selection function to prevent using `is_mps()` in `Histogram.cpp`. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at b329a02</samp> This pull request refactors and implements the logic for inferring the bin edges of histograms from the input tensor for different device types. It introduces a dispatch stub `histogram_select_outer_bin_edges_stub` and moves the device-specific code to separate files, such as `HistogramKernel.cpp` and `HistogramKernel.mm`. This improves the modularity and readability of the histogram functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101792 Approved by: https://github.com/albanD	2023-06-27 16:17:10 +00:00
Li-Huai (Allan) Lin	217a8b4697	[MPS] Add MPSProfiler to histogram kernel (#101692 ) Apart from introducing MPSProfiler, this PR also 1. removes the synchronization call after all the commands are encoded since the stream will be synchronized along the next graph op is encountered and run. One can take a look at this [PR](https://github.com/pytorch/pytorch/pull/99810) to get some insight. 2. initialize the offset calculation kernel's thread output with 0 to ensure the subsequent offset accumulation is correct. This change makes the kernel aligned with `kernel_index_offsets` kernel. <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 4094984</samp> This change enables performance analysis of the `histogram` kernel on MPS devices by using the `MPSProfiler` class to collect and report relevant metrics. It modifies the file `HistogramKernel.mm` to add profiling calls around the kernel execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101692 Approved by: https://github.com/albanD	2023-06-27 16:17:10 +00:00
Nikita Shulga	c40f5edf7b	Change tools search order (#104214 ) Prevents following cryptic error if one attempts to use `run_tests.py` on system that also has torchaudio installed in dev mode (as `tools` from https://github.com/pytorch/audio might take precedence, but this is not how script should behave): ``` Unable to import test_selections from tools/testing. Running without test selection stats.... Reason: No module named 'tools.stats' Traceback (most recent call last): File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1673, in <module> main() File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1604, in main selected_tests = get_selected_tests(options) File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1418, in get_selected_tests path = os.path.join(str(REPO_ROOT), TEST_TIMES_FILE) NameError: name 'TEST_TIMES_FILE' is not defined ``` But make sure to remove it in the end, otherwise it will not work if torch is installed from wheel, but tests are running from clean repo checkout. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at dd52521</samp> > _Sing, O Muse, of the cunning code review_ > _That fixed the tests of the `tools` module_ > _By adding and removing the root path_ > _As a shepherd guides his flock to and fro._ Pull Request resolved: https://github.com/pytorch/pytorch/pull/104214 Approved by: https://github.com/kit1980	2023-06-27 15:54:34 +00:00
David Radley	4d613b9a5f	[doc] Improve `mps` package description (#104184 ) Fixes #104183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104184 Approved by: https://github.com/malfet	2023-06-27 15:50:36 +00:00
Pearu Peterson	ad2905ad27	Make _test_autograd_multiple_dispatch_view a view operation (#104149 ) Fixes the `test_view_copy_cuda` failure case in https://github.com/pytorch/pytorch/issues/99655 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104149 Approved by: https://github.com/soulitzer	2023-06-27 15:43:35 +00:00
Nikita Vedeneev	567b5e5b28	Multioutput backward formula: allow conditional guards against saving (#103750 ) Multi-output backward formulas break the ability of autogen to decide which variables have to be stored in a graph. This PR introduces a macro `wrap_opt_if` which could be used to hint autogen about variable interdependence. For example, the following code is being generated for `_trilinear` with this modification: ``` at::Tensor _trilinear(c10::DispatchKeySet ks, const at::Tensor & i1, const at::Tensor & i2, const at::Tensor & i3, at::IntArrayRef expand1, at::IntArrayRef expand2, at::IntArrayRef expand3, at::IntArrayRef sumdim, int64_t unroll_dim) { auto& i1_ = unpack(i1, "i1", 0); auto& i2_ = unpack(i2, "i2", 1); auto& i3_ = unpack(i3, "i3", 2); [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( i1, i2, i3 ); [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(i1) \|\| isFwGradDefined(i2) \|\| isFwGradDefined(i3)); std::shared_ptr<TrilinearBackward0> grad_fn; if (_any_requires_grad) { grad_fn = std::shared_ptr<TrilinearBackward0>(new TrilinearBackward0(), deleteNode); grad_fn->set_next_edges(collect_next_edges( i1, i2, i3 )); grad_fn->expand1 = expand1.vec(); grad_fn->expand2 = expand2.vec(); grad_fn->expand3 = expand3.vec(); if (grad_fn->should_compute_output(1) \|\| grad_fn->should_compute_output(2)) { grad_fn->i1_ = SavedVariable(i1, false); } if (grad_fn->should_compute_output(0) \|\| grad_fn->should_compute_output(2)) { grad_fn->i2_ = SavedVariable(i2, false); } if (grad_fn->should_compute_output(0) \|\| grad_fn->should_compute_output(1)) { grad_fn->i3_ = SavedVariable(i3, false); } grad_fn->sumdim = sumdim.vec(); } ``` with the following backward modifications: ``` - name: _trilinear(Tensor i1, Tensor i2, Tensor i3, int[] expand1, int[] expand2, int[] expand3, int[] sumdim, int unroll_dim=1) -> Tensor - i1, i2, i3: _trilinear_backward(grad, i1, i2, i3, expand1, expand2, expand3, sumdim, grad_input_mask) + i1, i2, i3: "_trilinear_backward(grad, + wrap_opt_if(i1, grad_input_mask[1] \|\| grad_input_mask[2]), + wrap_opt_if(i2, grad_input_mask[0] \|\| grad_input_mask[2]), + wrap_opt_if(i3, grad_input_mask[0] \|\| grad_input_mask[1]), + expand1, expand2, expand3, sumdim, grad_input_mask)" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103750 Approved by: https://github.com/soulitzer	2023-06-27 15:12:09 +00:00
Jack Khuu	18dacf7e79	[Specialized Kernel] Update yaml syntax to use kernel instead of dispatch (#104070 ) Based on this [code search](https://fburl.com/code/gjcnw8ly) (.yaml with `dispatch: CPU:`), update all files found to use ``` kernels: - arg_meta: None kernel_name: ``` instead of ``` dispatch: CPU: ``` --- ## Code changes: - `fbcode/executorch/codegen/tools/gen_oplist.py` - Strip ET specific fields prior to calling parse_native_yaml_struct --- ## Files edited that are not `functions.yaml` or `custom_ops.yaml` - fbcode/executorch/kernels/optimized/optimized.yaml - fbcode/executorch/kernels/quantized/quantized.yaml - fbcode/executorch/kernels/test/custom_kernel_example/my_functions.yaml --- ## Found Files that were not edited Dispatched to more than just CPU - fbcode/caffe2/aten/src/ATen/native/native_functions.yaml - xplat/caffe2/aten/src/ATen/native/native_functions.yaml - xros/third-party/caffe2/caffe2/aten/src/ATen/native/native_functions.yaml Grouped ops.yaml path - fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/ops.yaml --- Design Doc: https://docs.google.com/document/d/1gq4Wz2R6verKJ2EFseLyPdAF0wqomnCrVDDJpRkYsRw/edit?kh_source=GDOCS#heading=h.8raqyft9y50 Differential Revision: [D46952067](https://our.internmc.facebook.com/intern/diff/D46952067/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D46952067/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/104070 Approved by: https://github.com/larryliu0820	2023-06-27 09:53:20 +00:00
Wanchao Liang	95707ac964	[fake_pg] allow fake_pg allgather to do some simple validation (#104213 ) Note that in general it's not good form to try to make FakePG work with 'real data', but the reasoning here is that we want FakePG to work with DeviceMesh's init code that have the data validation, which makes it worth the tradeoff. In general user should use MTPG or normal PG for cases where they may care about real data from collectives Pull Request resolved: https://github.com/pytorch/pytorch/pull/104213 Approved by: https://github.com/wconstab, https://github.com/voznesenskym	2023-06-27 09:39:16 +00:00
Xu Han	6c1ccccf21	Enable mimalloc on pytorch Windows (#102595 ) This PR is implemention of [#102534](https://github.com/pytorch/pytorch/issues/102534), option 2. Major changes: 1. Add mimalloc to the submodule. 2. Add build option "USE_MIMALLOC". 3. It is only enabled on Windows build, And it would improve pytorch memory allocation performance. Additional Test: <img width="953" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/4b2ec2dc-16f1-4ad9-b457-cfeb37e489d3"> This PR also build & static link mimalloc on Linux well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102595 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-06-27 08:53:26 +00:00
Michael Voznesensky	803c14490b	Specialize storage_offset - Does not cover automatic dynamic (#104204 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104204 Approved by: https://github.com/wconstab	2023-06-27 05:51:42 +00:00
Nikita Shulga	c3e4a67905	Refactor multigpu tests to `test_cuda_multigpu` (#104059 ) Mostly refactor, that moves all the tests from `test_cuda` that benefit from multiGPU environment into its own file. - Add `TestCudaMallocAsync` class for Async tests ( to separate them from `TestCudaComm`) - Move individual tests from `TestCuda` to `TestCudaMultiGPU` - Move `_create_scaling_models_optimizers` and `_create_scaling_case` to `torch.testing._internal.common_cuda` - Add newly created `test_cuda_multigpu` to the multigpu periodic test <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at f4d46fa</samp> This pull request fixes a flaky test and improves the testing of gradient scaling on multiple GPUs. It adds verbose output for two CUDA tests, and refactors some common code into helper functions in `torch/testing/_internal/common_cuda.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104059 Approved by: https://github.com/huydhn	2023-06-27 05:32:05 +00:00
Pritam Damania	572ff2779b	[RESUBMIT] Ensure ncclCommAbort can abort stuck ncclCommInitRank (#103925 ) https://github.com/pytorch/pytorch/pull/95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior. However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op. To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`. I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/103925 Approved by: https://github.com/osalpekar	2023-06-27 04:22:03 +00:00
PyTorch MergeBot	b76a040b18	Revert "[core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135 )" This reverts commit aea771de30427998e83010459b69da1ab66f0879. Reverted https://github.com/pytorch/pytorch/pull/102135 on behalf of https://github.com/huydhn due to test_sparse_semi_structured.py::TestSparseSemiStructuredCUDA::test_mm_sparse_first_NT_cuda_int8 is still failing CUDA trunk jobs `aea771de30` ([comment](https://github.com/pytorch/pytorch/pull/102135#issuecomment-1608744110))	2023-06-27 03:49:31 +00:00
nihuini	7157dfdd4a	[jit] fix duplicated module input and output values in tracing module (#102510 ) remap shall record the original inp pointers instead of remapped ones testcase ```python import torch import torch.nn as nn import torch.nn.functional as F class Normalize(nn.Module): def __init__(self): super().__init__() self.norm = nn.GroupNorm(num_groups=32, num_channels=32) def forward(self, x, y): if y is None: y = x else: y = self.norm(y) y = y * 2 return y class G(nn.Module): def __init__(self): super().__init__() self.norm = Normalize() def forward(self, x): A = self.norm(x, None) B = F.relu(A) return A, B class Net(nn.Module): def __init__(self): super().__init__() self.g = G() self.norm_1 = Normalize() def forward(self, x): hs = self.g(x) A, B = hs h = self.norm_1(B, A) return h net = Net() net = net.eval() x = torch.randn(1, 32, 16, 16) traced = torch.jit.trace(net, x) print(traced.graph) ``` without this patch, there are duplicated lifted values, %80, %81, %82, %83, %84, %85 ``` graph(%self.1 : __torch__.Net, %x : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu)): %norm_1 : __torch__.___torch_mangle_1.Normalize = prim::GetAttr[name="norm_1"](%self.1) %g : __torch__.G = prim::GetAttr[name="g"](%self.1) %86 : (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor) = prim::CallMethod[name="forward"](%g, %x) %79 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %80 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %81 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %82 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %83 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %84 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %85 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu) = prim::TupleUnpack(%86) %87 : Tensor = prim::CallMethod[name="forward"](%norm_1, %79, %80, %81, %82, %83, %84, %85) return (%87) ``` with this patch ``` graph(%self.1 : __torch__.Net, %x : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu)): %norm_1 : __torch__.___torch_mangle_1.Normalize = prim::GetAttr[name="norm_1"](%self.1) %g : __torch__.G = prim::GetAttr[name="g"](%self.1) %71 : Tensor = prim::CallMethod[name="forward"](%g, %x) %72 : Tensor = prim::CallMethod[name="forward"](%norm_1, %71) return (%72) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102510 Approved by: https://github.com/davidberard98	2023-06-27 03:43:06 +00:00
Jesse Cai	aea771de30	[core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135 ) This PR adds in support for semi-structured sparsity via a tensor subclass. It currently uses the CUTLASS kernels merged in PR #100881. In the future we plan to add in cuSPARSELt support (see the other PRs in the stack), which will give us larger performance gains. This PR adds in 2 things: - a Tensor subclass, `SparseSemiStructuredTensor` to store the sparse tensor in copmressed form and override `__torch_dispatch__`. - a conversion function that takes in a dense tensor and a semi-structured sparse bool mask and creates an instance of the subclass. SparseSemiStructuredTensor The subclass stores the dense tensor in a contiguous flattened tensor for future compatability with cuSPARSELt, which expects this format. Note that the CUTLASS kernels do not have this limitation, as the specified values and the metadata are passed separately in `_structured_sparse_linear`. In the future we can use the cuSPARSELT bindings [here](https://github.com/pytorch/pytorch/pull/103700) for faster matmul, better dtype converage, and relaxed shape constraints. Since we currently don't have a way to go back from the sparse representation to the dense representation, and we store the weights in compressed form, we don't have a great way to handle .t(). Instead, we keep track of how often we've called transpose on our tensor, and if it's an unexpected number we throw an error. When the first argument is sparse, we expect an even number of calls to transpose, while when the second argument is sparse, we expect an odd number of calls. This is because we support second argument sparse matrix multiplications by using transpose properties. to_sparse_semi_structured This is a conversion function to convert a dense tensor and a semi-structured sparse bool mask into a subclass. Currently, we must pass in a bool mask, since we can't infer it becuase there may be additional zero elements in the dense tensor, so `tensor !=0` is not 2:4 sparse. Once we add either a method to derive the mask from the dense tensor or cuSPARSELt, we no longer need to pass in the mask. cuSPARSELt has it's own helper functions to create the metadata mask. User Details We have implemented support for the following ops for `torch.float16` and `torch.int8`: ``` torch.addmm(bias, dense, sparse.t()) torch.mm(dense, sparse) torch.mm(sparse, dense) aten.linear.default aten.t.default aten.t.detach ``` The end user interface to accelerate a nn.Linaer module with the subclass would look like this: ``` from torch.sparse import to_sparse_semi_structured mask = torch.Tensor([0, 0, 1, 1]).tile(128, 32).cuda().bool() linear = Model(128, 128).half().cuda() linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight, mask=linear.weight.bool()) ``` This also updates tests and the `torch.sparse` module docstring to reflect these changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102135 Approved by: https://github.com/albanD	2023-06-27 02:37:00 +00:00
Amr Elshennawy	968b7b5e0f	Initial commit of collective_utils (#101037 ) Summary: Details in T133020932 First commit of collective utils library. Ported over from model store, removed scuba logging, error_trait and all dependencies on modelstore. Test Plan: In the following diffs. Differential Revision: D45545970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101037 Approved by: https://github.com/H-Huang	2023-06-27 02:15:16 +00:00
Ashok Kumar Kannan	41866a2ead	Fix missing mandatory device_type argument in autocast docstring (#97223 ) Fixes #[92803](https://github.com/pytorch/pytorch/issues/92803) ![Screenshot from 2023-03-21 12-28-14](https://user-images.githubusercontent.com/100136654/226538769-141f3b9e-0de2-4e86-8e42-d5a4a7413c6f.png) ![Screenshot from 2023-03-21 12-28-29](https://user-images.githubusercontent.com/100136654/226538777-9e719090-75c0-46f7-8594-5efcb0a46df6.png) ![Screenshot from 2023-03-21 12-29-36](https://user-images.githubusercontent.com/100136654/226538783-399a0e60-ffc9-4d73-801c-8cfce366d142.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97223 Approved by: https://github.com/albanD, https://github.com/malfet	2023-06-27 01:54:54 +00:00
Tarun Karuturi	6d2da6106d	Raise AttributeError in _OpsNamespace if __self__ attribute is requested (#104096 ) Summary: Trying to get the `__self__` attribute on any `_OpNamespace` object should be an invalid operation. The `__self__` attribute only exists on instance method object and not on class objects. In [dynamo](`a152b3e3b8/torch/_dynamo/variables/torch.py (L164)`) there is code that tries to access the `__self__` attribute on `TorchVariable`, this currently results in an expensive call to `torch._C._jit_get_operation` [here](`a152b3e3b8/torch/_ops.py (L740)`) which ultimately fails and throws an exception. For cases where it fails the operation turns out to be quite expensive on the order of ~0.03s. For edge use cases when exporting large models with quantized ops this exception is thrown 100's of times resulting in a lot of time wasted. By preventing the call to `torch._C._jit_get_operation` we can quickly return from this function and significantly reduce export times. On a large ASR model for example export currently takes ~405 seconds. With this change we can reduce it to ~340s. Overall this should also be a harmless change as no one should mostly ever try to access the `__self__` attribute on any `_OpNamespace` object. Test Plan: Added test case. Differential Revision: D46959879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104096 Approved by: https://github.com/larryliu0820, https://github.com/ezyang, https://github.com/zou3519	2023-06-27 01:42:06 +00:00
leslie-fang-intel	f8ac569365	[Inductor][Quant]Fix tile2d code generation issue with uint8 data type (#104074 ) Summary The previous vectorized code generation of tile2d doesn't support input data type of uint8, which still takes it as float and generate wrong result. This PR fixes this issue. Take UT `test_tile2d_load_decomposed_dequant_add_relu_quant` in this PR as example: The previous generated code is: ``` #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(192L); i1+=static_cast<long>(16L)) { unsigned char tmp0[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<unsigned char,16,16>(in_ptr0 + static_cast<long>(i0 + (1024Li1)), static_cast<long>(1024L), tmp0, 16); unsigned char tmp7[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<unsigned char,16,16>(in_ptr1 + static_cast<long>(i0 + (1024Li1)), static_cast<long>(1024L), tmp7, 16); for (long i0_inner = 0; i0_inner < 16; i0_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16Li0_inner)); auto tmp8 = at::vec::Vectorized<float>::loadu(tmp7 + static_cast<long>(16Li0_inner)); auto tmp2 = (tmp1); auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(1.0)); auto tmp4 = tmp2 - tmp3; auto tmp5 = at::vec::Vectorized<float>(static_cast<float>(0.01)); auto tmp6 = tmp4 * tmp5; auto tmp9 = (tmp8); auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(2.0)); auto tmp11 = tmp9 - tmp10; auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(0.02)); auto tmp13 = tmp11 * tmp12; auto tmp14 = tmp6 + tmp13; auto tmp15 = at::vec::clamp_min(tmp14, decltype(tmp14)(0)); auto tmp16 = at::vec::Vectorized<float>(static_cast<float>(33.333333333333336)); auto tmp17 = tmp15 * tmp16; auto tmp18 = tmp17.round(); auto tmp19 = at::vec::Vectorized<float>(static_cast<float>(3.0)); auto tmp20 = tmp18 + tmp19; auto tmp21 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp22 = at::vec::maximum(tmp20, tmp21); auto tmp23 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp24 = at::vec::minimum(tmp22, tmp23); auto tmp25 = (tmp24); at::vec::store_float_as_uint8(tmp25, out_ptr0 + static_cast<long>(i1 + (196Li0) + (196Li0_inner))); } } ``` After this PR, the generated code is: ``` #pragma GCC ivdep for(long i1=static_cast<long>(0L); i1<static_cast<long>(192L); i1+=static_cast<long>(16L)) { unsigned char tmp0[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<unsigned char,16,16>(in_ptr0 + static_cast<long>(i0 + (1024Li1)), static_cast<long>(1024L), tmp0, 16); unsigned char tmp7[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<unsigned char,16,16>(in_ptr1 + static_cast<long>(i0 + (1024Li1)), static_cast<long>(1024L), tmp7, 16); for (long i0_inner = 0; i0_inner < 16; i0_inner++) { auto tmp1 = at::vec::load_uint8_as_float(tmp0 + static_cast<long>(16Li0_inner)); auto tmp8 = at::vec::load_uint8_as_float(tmp7 + static_cast<long>(16Li0_inner)); auto tmp2 = (tmp1); auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(1.0)); auto tmp4 = tmp2 - tmp3; auto tmp5 = at::vec::Vectorized<float>(static_cast<float>(0.01)); auto tmp6 = tmp4 * tmp5; auto tmp9 = (tmp8); auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(2.0)); auto tmp11 = tmp9 - tmp10; auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(0.02)); auto tmp13 = tmp11 * tmp12; auto tmp14 = tmp6 + tmp13; auto tmp15 = at::vec::clamp_min(tmp14, decltype(tmp14)(0)); auto tmp16 = at::vec::Vectorized<float>(static_cast<float>(33.333333333333336)); auto tmp17 = tmp15 * tmp16; auto tmp18 = tmp17.round(); auto tmp19 = at::vec::Vectorized<float>(static_cast<float>(3.0)); auto tmp20 = tmp18 + tmp19; auto tmp21 = at::vec::Vectorized<float>(static_cast<float>(0.0)); auto tmp22 = at::vec::maximum(tmp20, tmp21); auto tmp23 = at::vec::Vectorized<float>(static_cast<float>(255.0)); auto tmp24 = at::vec::minimum(tmp22, tmp23); auto tmp25 = (tmp24); at::vec::store_float_as_uint8(tmp25, out_ptr0 + static_cast<long>(i1 + (196Li0) + (196Li0_inner))); } } ``` Test Plan ``` python -m pytest test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant python -m pytest test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104074 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-06-27 00:59:05 +00:00
Yang Chen	d2281e38ae	Adds the initial support for AOTInductor model and interface (#104202 ) This PR combines the C++ code for the AOTInductor's model and interface with Bin Bao's changes to AOTInductor codegen. It adds a number of AOTInductor C interfaces that can be used by an inference runtime. Under the hood of the interfaces, the model code generated by the AOTInductor's codegen is wrapped into a class, AOTInductorModel, which manages tensors and run the model inference. On top of AOTInductorModel, we provide one more abstract layer, AOTInductorModelContainer, which allows the user to have multiple inference runs concurrently for the same model. This PR also adjusts the compilation options for AOT codegen, particularly some fbcode-related changes such as libs to be linked and header-file search paths. Note that this is the very first version of the AOTInductor model and interface, so many features (e.g. dynamic shape) are incomplete. We will support those missing features in in future PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104202 Approved by: https://github.com/desertfire	2023-06-27 00:37:26 +00:00
Peter Stefek	d8a2e7461b	Fix incorrect distribution of randperm with device mps (#104171 ) Fixes #104170 As noted in the above issue it seems that the code for randperm basically boils down to: `torch.argsort(torch.rand(size, device="mps"), dim = 0)` However it seems like in the fused(?) pytorch version the type of tensor we were drawing `torch.rand(size, device="mps")` from was int64 with an inclusive(?) upper bound of 1. This caused everything to be sorted into two groups (if you drew 0 or 1) each monotonically ascending due to sort tie breaking. One way to fix this is to just generate the random tensor as float64s with an upper bound of 1.0 instead of int64s. An alternative to to just set the upper bound to max int 64. ~I choose the float64 one basically on a coin flip b/c I couldn't tell the original contributor's intent (due to mixed up upper bounds and type) but would be happy to change to use int64 and max int 64 as an upper bound instead if that's better.~ Edit on second thought I don't like using floats from 0.0 to 1.0 as there are fewer of them in that range than int64s from 0 to int 64 max_value. I also suspect integer math might be faster but need to benchmark this tomorrow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104171 Approved by: https://github.com/malfet	2023-06-27 00:36:15 +00:00
drisspg	994b98b78b	Add language server support for vscode (#104160 ) Makes it so clangd support can work with with vscode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104160 Approved by: https://github.com/seemethere	2023-06-27 00:20:53 +00:00
Mikayla Gawarecki	981f24e806	Add docstring to torch.serialization.register_package (#104046 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104046 Approved by: https://github.com/albanD	2023-06-26 23:28:32 +00:00
Driss Guessous	4a008d268a	REDO of dropout support for mem eff #102038 (#103704 ) THIS IS A new PR with the changes from #102038 + #103201 + plus namespacing changes to fix bug. # Summary This PR builds off of: - https://github.com/pytorch/pytorch/pull/101847 - https://github.com/pytorch/pytorch/pull/100583 It specifically adds dropout support to the memory efficient attention kernel. In the process of doing so roughly 3 changes were made: - Update sdpa dispatching to allow for inputs requiring grad to be sent to efficient attention - Update how memory efficient attention handles passing the rng state from forward to backward in order to enable cuda_graph support - Fix a bug in the kernel that was causing incorrect gradients to be produced for num_keys > 64 with dropout and causal masking set. https://github.com/facebookresearch/xformers/pull/755 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103704 Approved by: https://github.com/cpuhrsch	2023-06-26 23:05:03 +00:00
PyTorch MergeBot	bfa08a1c67	Revert "[core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135 )" This reverts commit cf5262a84f815c1e574883bc244333d0d211c7a2. Reverted https://github.com/pytorch/pytorch/pull/102135 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but test_sparse_semi_structured.py::TestSparseSemiStructuredCUDA::test_mm_sparse_first_NT_cuda_int8 is failing CUDA trunk jobs `cf5262a84f`. This looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/102135#issuecomment-1608423849))	2023-06-26 22:54:16 +00:00
cyy	d4a98280a8	[Reland] Use missing-prototypes in torch_cpu (#104138 ) This PR enables Wmissing-prototypes in torch_cpu except some generated cpp files and the mps and metal,vulkan backends and caffe2 sources. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104138 Approved by: https://github.com/albanD, https://github.com/malfet	2023-06-26 22:53:43 +00:00
PyTorch MergeBot	436d035dc7	Revert "DDP + C10D sparse all_reduce changes (#103916 )" This reverts commit fed5fba6e4ee3f221bac481798c5a31f785ba75e. Reverted https://github.com/pytorch/pytorch/pull/103916 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/103916#issuecomment-1608412325))	2023-06-26 22:37:58 +00:00
Eli Uriegas	a69f427f95	aten: Ensure dim is size_t (#104201 ) Attempts to fix failures introduced in https://github.com/pytorch/pytorch/pull/103930 (example failures: https://github.com/pytorch/pytorch/actions/runs/5363450214/jobs/9731034104) <!-- copilot:all --> ### <samp>🤖 Generated by Copilot at 67d5076</samp> ### Summary 🔧🚨🚦 <!-- 1. 🔧 (wrench) - This emoji can be used to indicate a bug fix or a minor improvement to the code quality or performance. 2. 🚨 (rotating light) - This emoji can be used to indicate a change that affects the error handling or validation logic of the code, or that adds or modifies a test case. 3. 🚦 (vertical traffic light) - This emoji can be used to indicate a change that affects the control flow or branching logic of the code, or that adds or modifies a condition or assertion. --> Fix a compiler warning in `Expand.cpp` by casting a tensor dimension to `size_t`. This improves the code quality and correctness of the `expand` function for the Vulkan backend. > _`expand` tensor_ > _cast `dim()` to `size_t`_ > _autumn leaves warning_ ### Walkthrough * Cast `self.dim()` to `size_t` to avoid signed-unsigned comparison warning in `expand` function ([link](https://github.com/pytorch/pytorch/pull/104201/files?diff=unified&w=0#diff-c175e908cbcb8595b22696e672b526202ed3a4a11341603c1522397e499b5c2bL29-R29)) <details> <summary> Fix done using chatgpt </summary> ![Screenshot 2023-06-26 at 11 52 14 AM](https://github.com/pytorch/pytorch/assets/1700823/95c141e5-36b6-4916-85ca-85415bcc507f) </details> Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104201 Approved by: https://github.com/lucylq, https://github.com/huydhn, https://github.com/malfet	2023-06-26 22:01:27 +00:00
Mikayla Gawarecki	b93ed8164e	Add non-recursive module.to_empty option (#104197 ) Fixes https://github.com/pytorch/pytorch/issues/97049, related to https://github.com/pytorch/pytorch/issues/104187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104197 Approved by: https://github.com/albanD	2023-06-26 21:47:22 +00:00
Jesse Cai	cf5262a84f	[core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135 ) This PR adds in support for semi-structured sparsity via a tensor subclass. It currently uses the CUTLASS kernels merged in PR #100881. In the future we plan to add in cuSPARSELt support (see the other PRs in the stack), which will give us larger performance gains. This PR adds in 2 things: - a Tensor subclass, `SparseSemiStructuredTensor` to store the sparse tensor in copmressed form and override `__torch_dispatch__`. - a conversion function that takes in a dense tensor and a semi-structured sparse bool mask and creates an instance of the subclass. SparseSemiStructuredTensor The subclass stores the dense tensor in a contiguous flattened tensor for future compatability with cuSPARSELt, which expects this format. Note that the CUTLASS kernels do not have this limitation, as the specified values and the metadata are passed separately in `_structured_sparse_linear`. In the future we can use the cuSPARSELT bindings [here](https://github.com/pytorch/pytorch/pull/103700) for faster matmul, better dtype converage, and relaxed shape constraints. Since we currently don't have a way to go back from the sparse representation to the dense representation, and we store the weights in compressed form, we don't have a great way to handle .t(). Instead, we keep track of how often we've called transpose on our tensor, and if it's an unexpected number we throw an error. When the first argument is sparse, we expect an even number of calls to transpose, while when the second argument is sparse, we expect an odd number of calls. This is because we support second argument sparse matrix multiplications by using transpose properties. to_sparse_semi_structured This is a conversion function to convert a dense tensor and a semi-structured sparse bool mask into a subclass. Currently, we must pass in a bool mask, since we can't infer it becuase there may be additional zero elements in the dense tensor, so `tensor !=0` is not 2:4 sparse. Once we add either a method to derive the mask from the dense tensor or cuSPARSELt, we no longer need to pass in the mask. cuSPARSELt has it's own helper functions to create the metadata mask. User Details We have implemented support for the following ops for `torch.float16` and `torch.int8`: ``` torch.addmm(bias, dense, sparse.t()) torch.mm(dense, sparse) torch.mm(sparse, dense) aten.linear.default aten.t.default aten.t.detach ``` The end user interface to accelerate a nn.Linaer module with the subclass would look like this: ``` from torch.sparse import to_sparse_semi_structured mask = torch.Tensor([0, 0, 1, 1]).tile(128, 32).cuda().bool() linear = Model(128, 128).half().cuda() linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight, mask=linear.weight.bool()) ``` This also updates tests and the `torch.sparse` module docstring to reflect these changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102135 Approved by: https://github.com/albanD	2023-06-26 21:30:43 +00:00
Sam Larsen	f7f415eb2d	[inductor] add cpp randint implementation to ir.py (#103079 ) (#104124 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104124 Approved by: https://github.com/desertfire	2023-06-26 21:26:25 +00:00
Howard Huang	fed5fba6e4	DDP + C10D sparse all_reduce changes (#103916 ) Summary: ## Changes prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function. prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...` ## Example script ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py import torch import torch.distributed as dist def main(): dist.init_process_group(backend="nccl") rank = dist.get_rank() a = torch.tensor([[0, 2.], [3, 0]]).to(rank) a = a.to_sparse() print(f"rank {rank} - a: {a}") dist.all_reduce(a) if __name__ == "__main__": main() ``` output: ``` rank 1 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse rank 0 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse ``` Test Plan: Testing commands (OSS): ``` # python pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops # c++ build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Testing commands (internal, ondemand GPU): ddp tests: ``` buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output # Get the .par file from the previous command and use it below TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata ``` c10d tests: ``` # build tests and run with log output (python) buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops # python NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)' # c++ NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Differential Revision: D46724856 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/103916 Approved by: https://github.com/rohan-varma	2023-06-26 20:42:17 +00:00
kshitij12345	8a08733218	update test_higher_order_op: grad test (#104179 ) With https://github.com/pytorch/pytorch/pull/103597, `config.dynamic_shapes` is always `True` and we never check the generated graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104179 Approved by: https://github.com/zou3519	2023-06-26 19:32:59 +00:00
Sergii Dymchenko	adf9595c2f	Update CODEOWNERS (#103934 ) Remove users that no longer have write access to the repo, resolving CODEOWNERS errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103934 Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet	2023-06-26 19:29:29 +00:00
Jacob Szwejbka	fb8aa721e2	[Pytorch Edge][BE] Delete Sparse Qnnpack test failing since 2022 jul (#104073 ) Summary: According to https://www.internalfb.com/omh/view/ai_infra_mobile_platform/tests these have been failing since jul 2022. Just going to delete unless someone thinks they actually do matter and should be made green https://www.internalfb.com/intern/test/562949996115570/ <- failing test I ran locally and got errors like xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h:483: Failure Expected equality of these values: c[mIndex * cStride() + nIndex] Which is: -872.50446 acc[mIndex * n() + nIndex] Which is: -872.50488 at 0, 0: reference = -872.5048828125, optimized = -872.50445556640625, Mr x Nr = 8 x 4, M x N x K = 7 x 1 x 13 xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h:483: Failure Expected equality of these values: c[mIndex * cStride() + nIndex] Which is: -67.246628 acc[mIndex * n() + nIndex] Which is: -67.24707 at 3, 0: reference = -67.2470703125, optimized = -67.246627807617188, Mr x Nr = 8 x 4, M x N x K = 4 x 1 x 15 [ FAILED ] Q8GEMM_8x4c1x4__SSE2.packedA_k_gt_8_subtile (148 ms) Test Plan: ci Reviewed By: kimishpatel Differential Revision: D46950966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104073 Approved by: https://github.com/kimishpatel	2023-06-26 18:27:20 +00:00
zhxchen17	100aff9d4f	[export] Deserialize subgraphs. (#103991 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/103991 Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri	2023-06-26 18:17:44 +00:00

1 2 3 4 5 ...

61538 Commits