Commit Graph

61538 Commits

Author SHA1 Message Date
1a8af1503f Upgrade Pybind submodule to 2.10.4 (#103989)
This is not ready for review, this is to make sure asan is fixed.
Not sure what is the most effective way to track down the bad dec_ref within deploy yet.

The asan silencing is done to match this comment:
1c79003b3c/test/test_cpp_extensions_jit.py (L749-L752)

EDIT: since the final failing function is in libtorch_python.so, we would need to skip that whole lib (not ok). So now we're skipping based on the function name which should be restrictive enough to not hide any real bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103989
Approved by: https://github.com/malfet
2023-06-27 20:22:39 +00:00
c98896b76f [quant][pt2e] Add more precise representation for quantized add (#104130)
Summary:
The planned e2e for quantization in pytorch 2.0 export is the following:

float_model -> prepare_pt2e -> calibration -> convert_pt2e -> ...

inside convert_pt2e, we will first produce a q/dq representation of the quantized model, similar to the previous output of
convert_to_reference_fx in fx grah mode quantization:

```
torch.ops.quantized_decomposed.dequantize_per_tensor -> torch.ops.aten.add -> torch.ops.quantized_decomopsed.quantize_per_tensor
torch.ops.quantized_decomposed.dequantize_per_tensor   /
```

Then we'll rewrite the above to a more precise representation that express the intention in a more precise manner, since
here we actually want to do int8 addition, instead of simulating the int8 addition with fp32 operations, the representation for
quantized add is:

```
def quantized_add(x_i8, x_scale, x_zero_point, y_i8, y_scale, y_zero_point, out_scale, out_zero_point):
    x = (x_scale / out_scale) * x_i8
    y = (y_scale / out_scale) * y_i8
    out = x + y
    out -= (x_zero_point * x_scale - y_zero_point * y_scale) / out_scale
    out += out_zero_point
    return out
```

Test Plan:
```
buck2 test caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_representation_add (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)'
```

Reviewed By: kimishpatel

Differential Revision: D45628032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104130
Approved by: https://github.com/kimishpatel
2023-06-27 20:11:30 +00:00
7bf27cf163 [Inductor][FX passes] Remove config.split_cat_fx_passes & Add config.experimental_patterns (#104208)
Summary:
TLDR:
* Remove config.split_cat_fx_passes, and move split cat passes behind config.pattern_matcher (True by default)
* Add config.experimental_patterns (False by default).
* In the future, general/universal patterns should behind config.pattern_matcher; customized/unmatured patterns should behind config.experimental_patterns.

More details at:
https://docs.google.com/document/d/1P8uJTpOTdQpUbw56UxHol40tt-EPFTq1Qu38072E9aM/edit

Test Plan: Existing unit tests

Reviewed By: jansel, jackiexu1992

Differential Revision: D46752606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104208
Approved by: https://github.com/williamwen42
2023-06-27 20:08:40 +00:00
2da6cae43c [core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135)
This PR adds in support for semi-structured sparsity via a tensor
subclass. It currently uses the CUTLASS kernels merged in PR #100881.

In the future we plan to add in cuSPARSELt support (see the other PRs in
the stack), which will give us larger performance gains.

This PR adds in 2 things:
- a Tensor subclass, `SparseSemiStructuredTensor` to store the
  sparse tensor in copmressed form and override `__torch_dispatch__`.
- a conversion function that takes in a dense tensor and a
  semi-structured sparse bool mask and creates an instance of the
  subclass.

**SparseSemiStructuredTensor**

The subclass stores the dense tensor in a contiguous flattened tensor
for future compatability with cuSPARSELt, which expects this format.
Note that the CUTLASS kernels do not have this limitation, as the
specified values and the metadata are passed separately in
`_structured_sparse_linear`. In the future we can use the cuSPARSELT bindings
[here](https://github.com/pytorch/pytorch/pull/103700) for faster matmul, better dtype converage, and relaxed shape
constraints.

Since we currently don't have a way to go back from the sparse
representation to the dense representation, and we store the weights in
compressed form, we don't have a great way to handle .t().

Instead, we keep track of how often we've called transpose on our
tensor, and if it's an unexpected number we throw an error. When the first
argument is sparse, we expect an even number of calls to transpose,
while when the second argument is sparse, we expect an odd number of
calls. This is because we support second argument sparse matrix
multiplications by using transpose properties.

**to_sparse_semi_structured**

This is a conversion function to convert a dense tensor and a
semi-structured sparse bool mask into a subclass. Currently, we must
pass in a bool mask, since we can't infer it becuase there may be
additional zero elements in the dense tensor, so `tensor !=0` is not 2:4
sparse.

Once we add either a method to derive the mask from the dense tensor or
cuSPARSELt, we no longer need to pass in the mask. cuSPARSELt has it's
own helper functions to create the metadata mask.

**User Details**

We have implemented support for the following ops for `torch.float16`
and `torch.int8`:
```
torch.addmm(bias, dense, sparse.t())
torch.mm(dense, sparse)
torch.mm(sparse, dense)
aten.linear.default
aten.t.default
aten.t.detach
```

The end user interface to accelerate a nn.Linaer module with the
subclass would look like this:

```
from torch.sparse import to_sparse_semi_structured

mask = torch.Tensor([0, 0, 1, 1]).tile(128, 32).cuda().bool()
linear = Model(128, 128).half().cuda()

linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight,
                                                       mask=linear.weight.bool())

```

This also updates tests and the `torch.sparse` module docstring to
reflect these changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102135
Approved by: https://github.com/albanD
2023-06-27 19:21:06 +00:00
39868b0578 [codemod][third-party][gtest] Migrate all fbcode gtest from tp2 to fbsource/third-party (#104255)
Summary:
## What is this?
This is a giant codemod to migrate all of fbcode from the tp2 version of gtest to the `fbsource/third-party` version.

## Why?
Various parts of the monorepo use different versions of gtest which are incompatible with each other and make maintenance of C++ testing more difficult than it should be. There also doesn't seem to be much reason for this fragmentation. Shifting all `gtest` dependencies towards `fbsource/third-party` is a big step in the right direction towards cleaning this up.

Also -- tp2 is deprecated, so we want to stop using that anyway. If we're going to make improvements to `gtest`, we should get away from tp2 as a first step.

## How?

I used bash script to perform the majority of the codemod: P777150295

I followed up with `rg` to find additional dependencies, then simply iterated a ton until CI was (mostly) happy.

This diff also includes an update to autodeps to use the `third-party/fbsource` version of gtest rather than the `tp2` version.

#forcetdhashing

Test Plan: CI

Differential Revision: D46961576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104255
Approved by: https://github.com/huydhn
2023-06-27 19:10:08 +00:00
a66107a30c [DTensor][Random] Introduce CudaRNGStateTracker to maintain parallel RNG state for DTensor (#103235)
# Change
This PR adds two classes to DTensor:

1. `CudaRNGStateTracker`:  `CudaRNGStateTracker` stores Random Number Generator (RNG) state (a `ByteTensor` object) in a `dict`, mapping from a corresponding tag to each state tensor. It also provides a set of convenient utility methods to help access/modify the state tensors. The most important interface is `_distribute_region` which will be used when DTensor executes a random op (an operator that calls RNG).

2. `OffsetBasedRNGTracker`: This subclass of `CudaRNGStateTracker` defines the default policy of how RNG states should be shared and synchronized among all ranks to respect the semantics of DTensor random operators.

# Warning

- With `Multi-threaded ProcessGroup`, the global variable `_rng_tracker` will be shared among threads(ranks) and cause issue. We need to figure out a compatible solution for that.

- The RNG state may be asynchronous outside of participating ranks. It is harmless in our current use case of submesh though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103235
Approved by: https://github.com/wanchaol
2023-06-27 19:00:25 +00:00
84f578dcc2 [ONNX] Cache AutoTokenizer in CI for test (#104233)
Fixes #103950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104233
Approved by: https://github.com/malfet
2023-06-27 18:55:39 +00:00
93b6b17dd0 CUDA_HOST_COMPILER spelling fix in cmake build files generate method (#104126)
Fix of CUDA_HOST_COMPILER spelling fix in generating additional build option in CMake.generate method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104126
Approved by: https://github.com/malfet
2023-06-27 18:46:12 +00:00
Te
a73ad82c8f conditional CMAKE_CUDA_STANDARD (#104240)
Fixes #104237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104240
Approved by: https://github.com/malfet
2023-06-27 18:41:25 +00:00
bf34ecd0c8 [RFC]: Integrate assertions functionalization to export (after AOT export) (#103887)
This PR integrated the assertion functionalization logic into current export logic.

**NOTE:**
I finally decided to do the assertion functionalization after AOT export instead of before for the following reasons:
* The benefit of AOT export is that the graph is already functionalized so things like method call is already transformed to function call. However, if we do it before AOT export, the graph is still in torch level and extra logic like bab21d20eb/torch/_export/pass_base.py (L201-L204C17) will need to be implemented.
* The graph signature is kind of already incorrect after adding runtime assertions currently (this doesn't seem break logic since we already depend on positions instead of FQNs of outputs). This PR also fixed this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103887
Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
2023-06-27 18:14:29 +00:00
936cd4f2f5 Migrate exportdb to torch.export (#104260)
Reapply of (https://github.com/pytorch/pytorch/pull/103861). Things that needed to be fixed:

- Fix a bug with returning dict output type
- Make pass_base work with map implementation
- Fix subtle bug with dynamo not propagating "val" in node.meta
- Add export_constraints field in ExportCase in ExportDB

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104260
Approved by: https://github.com/angelayi
2023-06-27 17:49:18 +00:00
ab9577087a Update accuracy for dynamo/torchbench CI - vision_maskrcnn, hf_T5_generate and dlrm (#104263)
Fixes breaking CI jobs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104263
Approved by: https://github.com/atalman, https://github.com/seemethere
2023-06-27 17:33:01 +00:00
ef285faeba [ET][XNNPACK] Add support for quantized Multiply (#104134)
Summary:
Also adds support for backend_config with relu fusion since XNNPACK allows it.

We should revisit the relu fusion once we gain more clarity on quantSrcPartition or some other way to do these fusion and not having to add all combinations.

We should really rename the backend config to et_xnnpack.py or something TODO

Test Plan: `buck test fbcode//mode/dev-nosan fbcode//executorch/backends/xnnpack/test:`

Differential Revision: D46985169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104134
Approved by: https://github.com/mcr229, https://github.com/salilsdesai
2023-06-27 16:59:28 +00:00
80ea3422f0 [ROCm] Enable tl.reduce usage on ROCm (#104099)
Revert aten.prod explicit fallback on ROCm and enabling the use of tl.reduce in triton codegen. This PR also enables an optimisation that was previously conditionalised out for ROCm https://github.com/pytorch/pytorch/pull/102444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104099
Approved by: https://github.com/peterbell10, https://github.com/malfet
2023-06-27 16:21:32 +00:00
99e87bb6a0 [MPS] Dispatch outer bin edges selection function (#101792)
Dispatch the selection function to prevent using `is_mps()` in `Histogram.cpp`.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at b329a02</samp>

This pull request refactors and implements the logic for inferring the bin edges of histograms from the input tensor for different device types. It introduces a dispatch stub `histogram_select_outer_bin_edges_stub` and moves the device-specific code to separate files, such as `HistogramKernel.cpp` and `HistogramKernel.mm`. This improves the modularity and readability of the histogram functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101792
Approved by: https://github.com/albanD
2023-06-27 16:17:10 +00:00
217a8b4697 [MPS] Add MPSProfiler to histogram kernel (#101692)
Apart from introducing MPSProfiler, this PR also
1. removes the synchronization call after all the commands are encoded since the stream will be synchronized along the next graph op is encountered and run. One can take a look at this [PR](https://github.com/pytorch/pytorch/pull/99810) to get some insight.
2. initialize the offset calculation kernel's thread output with 0 to ensure the subsequent offset accumulation is correct. This change makes the kernel aligned with `kernel_index_offsets` kernel.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 4094984</samp>

This change enables performance analysis of the `histogram` kernel on MPS devices by using the `MPSProfiler` class to collect and report relevant metrics. It modifies the file `HistogramKernel.mm` to add profiling calls around the kernel execution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101692
Approved by: https://github.com/albanD
2023-06-27 16:17:10 +00:00
c40f5edf7b Change tools search order (#104214)
Prevents following cryptic error if one attempts to use `run_tests.py` on system that also has torchaudio installed in dev mode (as `tools` from https://github.com/pytorch/audio might take precedence, but this is not how script should behave):
```
Unable to import test_selections from tools/testing. Running without test selection stats.... Reason: No module named 'tools.stats'
Traceback (most recent call last):
  File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1673, in <module>
    main()
  File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1604, in main
    selected_tests = get_selected_tests(options)
  File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1418, in get_selected_tests
    path = os.path.join(str(REPO_ROOT), TEST_TIMES_FILE)
NameError: name 'TEST_TIMES_FILE' is not defined
```

But make sure to remove it in the end, otherwise it will not work if torch is installed from wheel, but tests are running from clean repo checkout.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at dd52521</samp>

> _Sing, O Muse, of the cunning code review_
> _That fixed the tests of the `tools` module_
> _By adding and removing the root path_
> _As a shepherd guides his flock to and fro._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104214
Approved by: https://github.com/kit1980
2023-06-27 15:54:34 +00:00
4d613b9a5f [doc] Improve mps package description (#104184)
Fixes #104183

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104184
Approved by: https://github.com/malfet
2023-06-27 15:50:36 +00:00
ad2905ad27 Make _test_autograd_multiple_dispatch_view a view operation (#104149)
Fixes the `test_view_copy_cuda` failure case in https://github.com/pytorch/pytorch/issues/99655

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104149
Approved by: https://github.com/soulitzer
2023-06-27 15:43:35 +00:00
567b5e5b28 Multioutput backward formula: allow conditional guards against saving (#103750)
Multi-output backward formulas break the ability of autogen to decide which variables have to be stored in a graph.
This PR introduces a macro `wrap_opt_if` which could be used to hint autogen about variable interdependence.

For example, the following code is being generated for `_trilinear` with this modification:
```
at::Tensor _trilinear(c10::DispatchKeySet ks, const at::Tensor & i1, const at::Tensor & i2, const at::Tensor & i3, at::IntArrayRef expand1, at::IntArrayRef expand2, at::IntArrayRef expand3, at::IntArrayRef sumdim, int64_t unroll_dim) {
  auto& i1_ = unpack(i1, "i1", 0);
  auto& i2_ = unpack(i2, "i2", 1);
  auto& i3_ = unpack(i3, "i3", 2);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( i1, i2, i3 );

  [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(i1) || isFwGradDefined(i2) || isFwGradDefined(i3));
  std::shared_ptr<TrilinearBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<TrilinearBackward0>(new TrilinearBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( i1, i2, i3 ));
    grad_fn->expand1 = expand1.vec();
    grad_fn->expand2 = expand2.vec();
    grad_fn->expand3 = expand3.vec();
    if (grad_fn->should_compute_output(1) || grad_fn->should_compute_output(2)) {
      grad_fn->i1_ = SavedVariable(i1, false);
    }
    if (grad_fn->should_compute_output(0) || grad_fn->should_compute_output(2)) {
      grad_fn->i2_ = SavedVariable(i2, false);
    }
    if (grad_fn->should_compute_output(0) || grad_fn->should_compute_output(1)) {
      grad_fn->i3_ = SavedVariable(i3, false);
    }
    grad_fn->sumdim = sumdim.vec();
  }

```

with the following backward modifications:
```
 - name: _trilinear(Tensor i1, Tensor i2, Tensor i3, int[] expand1, int[] expand2, int[] expand3, int[] sumdim, int unroll_dim=1) -> Tensor
  - i1, i2, i3: _trilinear_backward(grad, i1, i2, i3, expand1, expand2, expand3, sumdim, grad_input_mask)
  + i1, i2, i3: "_trilinear_backward(grad,
  +             wrap_opt_if(i1, grad_input_mask[1] || grad_input_mask[2]),
  +             wrap_opt_if(i2, grad_input_mask[0] || grad_input_mask[2]),
  +             wrap_opt_if(i3, grad_input_mask[0] || grad_input_mask[1]),
  +             expand1, expand2, expand3, sumdim, grad_input_mask)"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103750
Approved by: https://github.com/soulitzer
2023-06-27 15:12:09 +00:00
18dacf7e79 [Specialized Kernel] Update yaml syntax to use kernel instead of dispatch (#104070)
Based on this [code search](https://fburl.com/code/gjcnw8ly) (*.yaml with `dispatch: CPU:`), update all files found to use

```
kernels:
    - arg_meta: None
      kernel_name:
```
instead of
```
dispatch:
    CPU:
```
---
## Code changes:

- `fbcode/executorch/codegen/tools/gen_oplist.py`
  - Strip ET specific fields prior to calling parse_native_yaml_struct
---
## Files edited that are not `*functions.yaml` or `custom_ops.yaml`

- fbcode/executorch/kernels/optimized/optimized.yaml
- fbcode/executorch/kernels/quantized/quantized.yaml
- fbcode/executorch/kernels/test/custom_kernel_example/my_functions.yaml

---
## Found Files that were not edited

**Dispatched to more than just CPU**
- fbcode/caffe2/aten/src/ATen/native/native_functions.yaml
- xplat/caffe2/aten/src/ATen/native/native_functions.yaml
- xros/third-party/caffe2/caffe2/aten/src/ATen/native/native_functions.yaml

**Grouped ops.yaml path**
- fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/ops.yaml

---
**Design Doc:** https://docs.google.com/document/d/1gq4Wz2R6verKJ2EFseLyPdAF0wqomnCrVDDJpRkYsRw/edit?kh_source=GDOCS#heading=h.8raqyft9y50

Differential Revision: [D46952067](https://our.internmc.facebook.com/intern/diff/D46952067/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D46952067/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104070
Approved by: https://github.com/larryliu0820
2023-06-27 09:53:20 +00:00
95707ac964 [fake_pg] allow fake_pg allgather to do some simple validation (#104213)
Note that in general it's not good form to try to make FakePG work with 'real data',
but the reasoning here is that we want FakePG to work with DeviceMesh's init code
that have the data validation, which makes it worth the tradeoff.

In general user should use MTPG or normal PG for cases where they may care about
real data from collectives
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104213
Approved by: https://github.com/wconstab, https://github.com/voznesenskym
2023-06-27 09:39:16 +00:00
6c1ccccf21 Enable mimalloc on pytorch Windows (#102595)
This PR is implemention of [#102534](https://github.com/pytorch/pytorch/issues/102534), option 2.
Major changes:
1. Add mimalloc to the submodule.
2. Add build option "USE_MIMALLOC".
3. It is only enabled on Windows build, And it would improve pytorch memory allocation performance.

Additional Test:
<img width="953" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/4b2ec2dc-16f1-4ad9-b457-cfeb37e489d3">
This PR also build & static link mimalloc on Linux well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102595
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-06-27 08:53:26 +00:00
803c14490b Specialize storage_offset - Does not cover automatic dynamic (#104204)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104204
Approved by: https://github.com/wconstab
2023-06-27 05:51:42 +00:00
c3e4a67905 Refactor multigpu tests to test_cuda_multigpu (#104059)
Mostly refactor, that moves all the tests from `test_cuda` that benefit from multiGPU environment into its own file.

- Add `TestCudaMallocAsync` class for Async tests ( to separate them from `TestCudaComm`)
- Move individual tests from `TestCuda` to `TestCudaMultiGPU`
- Move `_create_scaling_models_optimizers` and `_create_scaling_case` to `torch.testing._internal.common_cuda`
- Add newly created `test_cuda_multigpu` to the multigpu periodic test

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at f4d46fa</samp>

This pull request fixes a flaky test and improves the testing of gradient scaling on multiple GPUs. It adds verbose output for two CUDA tests, and refactors some common code into helper functions in `torch/testing/_internal/common_cuda.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104059
Approved by: https://github.com/huydhn
2023-06-27 05:32:05 +00:00
572ff2779b [RESUBMIT] Ensure ncclCommAbort can abort stuck ncclCommInitRank (#103925)
https://github.com/pytorch/pytorch/pull/95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior.

However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op.

To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`.

I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103925
Approved by: https://github.com/osalpekar
2023-06-27 04:22:03 +00:00
b76a040b18 Revert "[core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135)"
This reverts commit aea771de30427998e83010459b69da1ab66f0879.

Reverted https://github.com/pytorch/pytorch/pull/102135 on behalf of https://github.com/huydhn due to test_sparse_semi_structured.py::TestSparseSemiStructuredCUDA::test_mm_sparse_first_NT_cuda_int8 is still failing CUDA trunk jobs aea771de30 ([comment](https://github.com/pytorch/pytorch/pull/102135#issuecomment-1608744110))
2023-06-27 03:49:31 +00:00
7157dfdd4a [jit] fix duplicated module input and output values in tracing module (#102510)
remap shall record the original inp pointers instead of remapped ones

testcase

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class Normalize(nn.Module):
    def __init__(self):
        super().__init__()

        self.norm = nn.GroupNorm(num_groups=32, num_channels=32)

    def forward(self, x, y):
        if y is None:
            y = x
        else:
            y = self.norm(y)

        y = y * 2

        return y

class G(nn.Module):
    def __init__(self):
        super().__init__()

        self.norm = Normalize()

    def forward(self, x):

        A = self.norm(x, None)
        B = F.relu(A)

        return A, B

class Net(nn.Module):
    def __init__(self):
        super().__init__()

        self.g = G()

        self.norm_1 = Normalize()

    def forward(self, x):
        hs = self.g(x)

        A, B = hs

        h = self.norm_1(B, A)
        return h

net = Net()
net = net.eval()

x = torch.randn(1, 32, 16, 16)

traced = torch.jit.trace(net, x)

print(traced.graph)
```

without this patch, there are duplicated lifted values, %80, %81, %82, %83, %84, %85
```
graph(%self.1 : __torch__.Net,
      %x : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu)):
  %norm_1 : __torch__.___torch_mangle_1.Normalize = prim::GetAttr[name="norm_1"](%self.1)
  %g : __torch__.G = prim::GetAttr[name="g"](%self.1)
  %86 : (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor) = prim::CallMethod[name="forward"](%g, %x)
  %79 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %80 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %81 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %82 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %83 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %84 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %85 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu) = prim::TupleUnpack(%86)
  %87 : Tensor = prim::CallMethod[name="forward"](%norm_1, %79, %80, %81, %82, %83, %84, %85)
  return (%87)

```

with this patch
```
graph(%self.1 : __torch__.Net,
      %x : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu)):
  %norm_1 : __torch__.___torch_mangle_1.Normalize = prim::GetAttr[name="norm_1"](%self.1)
  %g : __torch__.G = prim::GetAttr[name="g"](%self.1)
  %71 : Tensor = prim::CallMethod[name="forward"](%g, %x)
  %72 : Tensor = prim::CallMethod[name="forward"](%norm_1, %71)
  return (%72)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102510
Approved by: https://github.com/davidberard98
2023-06-27 03:43:06 +00:00
aea771de30 [core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135)
This PR adds in support for semi-structured sparsity via a tensor
subclass. It currently uses the CUTLASS kernels merged in PR #100881.

In the future we plan to add in cuSPARSELt support (see the other PRs in
the stack), which will give us larger performance gains.

This PR adds in 2 things:
- a Tensor subclass, `SparseSemiStructuredTensor` to store the
  sparse tensor in copmressed form and override `__torch_dispatch__`.
- a conversion function that takes in a dense tensor and a
  semi-structured sparse bool mask and creates an instance of the
  subclass.

**SparseSemiStructuredTensor**

The subclass stores the dense tensor in a contiguous flattened tensor
for future compatability with cuSPARSELt, which expects this format.
Note that the CUTLASS kernels do not have this limitation, as the
specified values and the metadata are passed separately in
`_structured_sparse_linear`. In the future we can use the cuSPARSELT bindings
[here](https://github.com/pytorch/pytorch/pull/103700) for faster matmul, better dtype converage, and relaxed shape
constraints.

Since we currently don't have a way to go back from the sparse
representation to the dense representation, and we store the weights in
compressed form, we don't have a great way to handle .t().

Instead, we keep track of how often we've called transpose on our
tensor, and if it's an unexpected number we throw an error. When the first
argument is sparse, we expect an even number of calls to transpose,
while when the second argument is sparse, we expect an odd number of
calls. This is because we support second argument sparse matrix
multiplications by using transpose properties.

**to_sparse_semi_structured**

This is a conversion function to convert a dense tensor and a
semi-structured sparse bool mask into a subclass. Currently, we must
pass in a bool mask, since we can't infer it becuase there may be
additional zero elements in the dense tensor, so `tensor !=0` is not 2:4
sparse.

Once we add either a method to derive the mask from the dense tensor or
cuSPARSELt, we no longer need to pass in the mask. cuSPARSELt has it's
own helper functions to create the metadata mask.

**User Details**

We have implemented support for the following ops for `torch.float16`
and `torch.int8`:
```
torch.addmm(bias, dense, sparse.t())
torch.mm(dense, sparse)
torch.mm(sparse, dense)
aten.linear.default
aten.t.default
aten.t.detach
```

The end user interface to accelerate a nn.Linaer module with the
subclass would look like this:

```
from torch.sparse import to_sparse_semi_structured

mask = torch.Tensor([0, 0, 1, 1]).tile(128, 32).cuda().bool()
linear = Model(128, 128).half().cuda()

linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight,
                                                       mask=linear.weight.bool())

```

This also updates tests and the `torch.sparse` module docstring to
reflect these changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102135
Approved by: https://github.com/albanD
2023-06-27 02:37:00 +00:00
968b7b5e0f Initial commit of collective_utils (#101037)
Summary:
Details in T133020932
First commit of collective utils library. Ported over from model store, removed scuba logging, error_trait and all dependencies on modelstore.

Test Plan: In the following diffs.

Differential Revision: D45545970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101037
Approved by: https://github.com/H-Huang
2023-06-27 02:15:16 +00:00
41866a2ead Fix missing mandatory device_type argument in autocast docstring (#97223)
Fixes #[92803](https://github.com/pytorch/pytorch/issues/92803)
![Screenshot from 2023-03-21 12-28-14](https://user-images.githubusercontent.com/100136654/226538769-141f3b9e-0de2-4e86-8e42-d5a4a7413c6f.png)
![Screenshot from 2023-03-21 12-28-29](https://user-images.githubusercontent.com/100136654/226538777-9e719090-75c0-46f7-8594-5efcb0a46df6.png)
![Screenshot from 2023-03-21 12-29-36](https://user-images.githubusercontent.com/100136654/226538783-399a0e60-ffc9-4d73-801c-8cfce366d142.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97223
Approved by: https://github.com/albanD, https://github.com/malfet
2023-06-27 01:54:54 +00:00
6d2da6106d Raise AttributeError in _OpsNamespace if __self__ attribute is requested (#104096)
Summary:
Trying to get the `__self__` attribute on any `_OpNamespace` object should be an invalid operation. The `__self__` attribute only exists on instance method object and not on class objects.

In [dynamo](a152b3e3b8/torch/_dynamo/variables/torch.py (L164)) there is code that tries to access the `__self__` attribute on `TorchVariable`, this currently results in an expensive call to `torch._C._jit_get_operation` [here](a152b3e3b8/torch/_ops.py (L740)) which ultimately fails and throws an exception. For cases where it fails the operation turns out to be quite expensive on the order of ~0.03s.

For edge use cases when exporting large models with quantized ops this exception is thrown 100's of times resulting in a lot of time wasted. By preventing the call to `torch._C._jit_get_operation` we can quickly return from this function and significantly reduce export times. On a large ASR model for example export currently takes **~405** seconds. With this change we can reduce it to **~340s**.

Overall this should also be a harmless change as no one should mostly ever try to access the `__self__` attribute on any `_OpNamespace` object.

Test Plan: Added test case.

Differential Revision: D46959879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104096
Approved by: https://github.com/larryliu0820, https://github.com/ezyang, https://github.com/zou3519
2023-06-27 01:42:06 +00:00
f8ac569365 [Inductor][Quant]Fix tile2d code generation issue with uint8 data type (#104074)
**Summary**
The previous vectorized code generation of tile2d doesn't support input data type of uint8, which still takes it as float and generate wrong result. This PR fixes this issue. Take UT `test_tile2d_load_decomposed_dequant_add_relu_quant` in this PR as example:
The previous generated code is:
```
#pragma GCC ivdep
for(long i1=static_cast<long>(0L); i1<static_cast<long>(192L); i1+=static_cast<long>(16L))
{
    unsigned char tmp0[16*16] __attribute__ ((aligned (16)));
    at::vec::transpose_mxn<unsigned char,16,16>(in_ptr0 + static_cast<long>(i0 + (1024L*i1)), static_cast<long>(1024L), tmp0, 16);
    unsigned char tmp7[16*16] __attribute__ ((aligned (16)));
    at::vec::transpose_mxn<unsigned char,16,16>(in_ptr1 + static_cast<long>(i0 + (1024L*i1)), static_cast<long>(1024L), tmp7, 16);
    for (long i0_inner = 0; i0_inner < 16; i0_inner++)
    {
        auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16L*i0_inner));
        auto tmp8 = at::vec::Vectorized<float>::loadu(tmp7 + static_cast<long>(16L*i0_inner));
        auto tmp2 = (tmp1);
        auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(1.0));
        auto tmp4 = tmp2 - tmp3;
        auto tmp5 = at::vec::Vectorized<float>(static_cast<float>(0.01));
        auto tmp6 = tmp4 * tmp5;
        auto tmp9 = (tmp8);
        auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(2.0));
        auto tmp11 = tmp9 - tmp10;
        auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(0.02));
        auto tmp13 = tmp11 * tmp12;
        auto tmp14 = tmp6 + tmp13;
        auto tmp15 = at::vec::clamp_min(tmp14, decltype(tmp14)(0));
        auto tmp16 = at::vec::Vectorized<float>(static_cast<float>(33.333333333333336));
        auto tmp17 = tmp15 * tmp16;
        auto tmp18 = tmp17.round();
        auto tmp19 = at::vec::Vectorized<float>(static_cast<float>(3.0));
        auto tmp20 = tmp18 + tmp19;
        auto tmp21 = at::vec::Vectorized<float>(static_cast<float>(0.0));
        auto tmp22 = at::vec::maximum(tmp20, tmp21);
        auto tmp23 = at::vec::Vectorized<float>(static_cast<float>(255.0));
        auto tmp24 = at::vec::minimum(tmp22, tmp23);
        auto tmp25 = (tmp24);
        at::vec::store_float_as_uint8(tmp25, out_ptr0 + static_cast<long>(i1 + (196L*i0) + (196L*i0_inner)));
    }
}
```

After this PR, the generated code is:
```
#pragma GCC ivdep
for(long i1=static_cast<long>(0L); i1<static_cast<long>(192L); i1+=static_cast<long>(16L))
{
    unsigned char tmp0[16*16] __attribute__ ((aligned (16)));
    at::vec::transpose_mxn<unsigned char,16,16>(in_ptr0 + static_cast<long>(i0 + (1024L*i1)), static_cast<long>(1024L), tmp0, 16);
    unsigned char tmp7[16*16] __attribute__ ((aligned (16)));
    at::vec::transpose_mxn<unsigned char,16,16>(in_ptr1 + static_cast<long>(i0 + (1024L*i1)), static_cast<long>(1024L), tmp7, 16);
    for (long i0_inner = 0; i0_inner < 16; i0_inner++)
    {
        auto tmp1 = at::vec::load_uint8_as_float(tmp0 + static_cast<long>(16L*i0_inner));
        auto tmp8 = at::vec::load_uint8_as_float(tmp7 + static_cast<long>(16L*i0_inner));
        auto tmp2 = (tmp1);
        auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(1.0));
        auto tmp4 = tmp2 - tmp3;
        auto tmp5 = at::vec::Vectorized<float>(static_cast<float>(0.01));
        auto tmp6 = tmp4 * tmp5;
        auto tmp9 = (tmp8);
        auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(2.0));
        auto tmp11 = tmp9 - tmp10;
        auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(0.02));
        auto tmp13 = tmp11 * tmp12;
        auto tmp14 = tmp6 + tmp13;
        auto tmp15 = at::vec::clamp_min(tmp14, decltype(tmp14)(0));
        auto tmp16 = at::vec::Vectorized<float>(static_cast<float>(33.333333333333336));
        auto tmp17 = tmp15 * tmp16;
        auto tmp18 = tmp17.round();
        auto tmp19 = at::vec::Vectorized<float>(static_cast<float>(3.0));
        auto tmp20 = tmp18 + tmp19;
        auto tmp21 = at::vec::Vectorized<float>(static_cast<float>(0.0));
        auto tmp22 = at::vec::maximum(tmp20, tmp21);
        auto tmp23 = at::vec::Vectorized<float>(static_cast<float>(255.0));
        auto tmp24 = at::vec::minimum(tmp22, tmp23);
        auto tmp25 = (tmp24);
        at::vec::store_float_as_uint8(tmp25, out_ptr0 + static_cast<long>(i1 + (196L*i0) + (196L*i0_inner)));
    }
}
```

**Test Plan**
```
python -m pytest test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant
python -m pytest test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104074
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-06-27 00:59:05 +00:00
d2281e38ae Adds the initial support for AOTInductor model and interface (#104202)
This PR combines the C++ code for the AOTInductor's model and interface with Bin Bao's changes to AOTInductor codegen.

It adds a number of AOTInductor C interfaces that can be used by an inference runtime. Under the hood of the interfaces, the model code generated by the AOTInductor's codegen is wrapped into a class, AOTInductorModel, which manages tensors and run the model inference.

On top of AOTInductorModel, we provide one more abstract layer, AOTInductorModelContainer, which allows the user to have multiple inference runs concurrently for the same model.

This PR also adjusts the compilation options for AOT codegen, particularly some fbcode-related changes such as libs to be linked and header-file search paths.

Note that this is the very first version of the AOTInductor model and interface, so many features (e.g. dynamic shape) are incomplete. We will support those missing features in in future PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104202
Approved by: https://github.com/desertfire
2023-06-27 00:37:26 +00:00
d8a2e7461b Fix incorrect distribution of randperm with device mps (#104171)
Fixes #104170

As noted in the above issue it seems that the code for randperm basically boils down to:
`torch.argsort(torch.rand(size, device="mps"), dim = 0)`

However it seems like in the fused(?) pytorch version the type of tensor we were drawing `torch.rand(size, device="mps")` from was int64 with an inclusive(?) upper bound of 1. This caused everything to be sorted into two groups (if you drew 0 or 1) each monotonically ascending due to sort tie breaking.

One way to fix this is to  just generate the random tensor as float64s with an upper bound of 1.0 instead of int64s. An alternative to to just set the upper bound to max int 64.

~I choose the float64 one basically on a coin flip b/c I couldn't tell the original contributor's intent (due to mixed up upper bounds and type) but would be happy to change to use int64 and max int 64 as an upper bound instead if that's better.~

Edit on second thought I don't like using floats from 0.0 to 1.0 as there are fewer of them in that range than int64s from 0 to int 64 max_value. I also suspect integer math might be faster but need to benchmark this tomorrow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104171
Approved by: https://github.com/malfet
2023-06-27 00:36:15 +00:00
994b98b78b Add language server support for vscode (#104160)
Makes it so clangd support can work with with vscode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104160
Approved by: https://github.com/seemethere
2023-06-27 00:20:53 +00:00
981f24e806 Add docstring to torch.serialization.register_package (#104046)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104046
Approved by: https://github.com/albanD
2023-06-26 23:28:32 +00:00
4a008d268a REDO of dropout support for mem eff #102038 (#103704)
THIS IS A new PR with the changes from #102038 + #103201 +  plus namespacing changes to fix bug.

# Summary
This PR builds off of:
- https://github.com/pytorch/pytorch/pull/101847
- https://github.com/pytorch/pytorch/pull/100583

It specifically adds dropout support to the memory efficient attention kernel. In the process of doing so roughly 3 changes were made:
- Update sdpa dispatching to allow for inputs requiring grad to be sent to efficient attention
- Update how memory efficient attention handles passing the rng state from forward to backward in order to enable cuda_graph support
- Fix a bug in the kernel that was causing incorrect gradients to be produced for num_keys > 64 with dropout and causal masking set. https://github.com/facebookresearch/xformers/pull/755

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103704
Approved by: https://github.com/cpuhrsch
2023-06-26 23:05:03 +00:00
bfa08a1c67 Revert "[core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135)"
This reverts commit cf5262a84f815c1e574883bc244333d0d211c7a2.

Reverted https://github.com/pytorch/pytorch/pull/102135 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but test_sparse_semi_structured.py::TestSparseSemiStructuredCUDA::test_mm_sparse_first_NT_cuda_int8 is failing CUDA trunk jobs cf5262a84f. This looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/102135#issuecomment-1608423849))
2023-06-26 22:54:16 +00:00
cyy
d4a98280a8 [Reland] Use missing-prototypes in torch_cpu (#104138)
This PR enables Wmissing-prototypes in torch_cpu except some generated cpp files and the mps and metal,vulkan backends and caffe2 sources.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104138
Approved by: https://github.com/albanD, https://github.com/malfet
2023-06-26 22:53:43 +00:00
436d035dc7 Revert "DDP + C10D sparse all_reduce changes (#103916)"
This reverts commit fed5fba6e4ee3f221bac481798c5a31f785ba75e.

Reverted https://github.com/pytorch/pytorch/pull/103916 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/103916#issuecomment-1608412325))
2023-06-26 22:37:58 +00:00
a69f427f95 aten: Ensure dim is size_t (#104201)
Attempts to fix failures introduced in https://github.com/pytorch/pytorch/pull/103930 (example failures: https://github.com/pytorch/pytorch/actions/runs/5363450214/jobs/9731034104)

<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at 67d5076</samp>

### Summary
🔧🚨🚦

<!--
1.  🔧 (wrench) - This emoji can be used to indicate a bug fix or a minor improvement to the code quality or performance.
2.  🚨 (rotating light) - This emoji can be used to indicate a change that affects the error handling or validation logic of the code, or that adds or modifies a test case.
3.  🚦 (vertical traffic light) - This emoji can be used to indicate a change that affects the control flow or branching logic of the code, or that adds or modifies a condition or assertion.
-->
Fix a compiler warning in `Expand.cpp` by casting a tensor dimension to `size_t`. This improves the code quality and correctness of the `expand` function for the Vulkan backend.

> _`expand` tensor_
> _cast `dim()` to `size_t`_
> _autumn leaves warning_

### Walkthrough
*  Cast `self.dim()` to `size_t` to avoid signed-unsigned comparison warning in `expand` function ([link](https://github.com/pytorch/pytorch/pull/104201/files?diff=unified&w=0#diff-c175e908cbcb8595b22696e672b526202ed3a4a11341603c1522397e499b5c2bL29-R29))

<details>
<summary> Fix done using chatgpt </summary>

![Screenshot 2023-06-26 at 11 52 14 AM](https://github.com/pytorch/pytorch/assets/1700823/95c141e5-36b6-4916-85ca-85415bcc507f)

</details>
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104201
Approved by: https://github.com/lucylq, https://github.com/huydhn, https://github.com/malfet
2023-06-26 22:01:27 +00:00
b93ed8164e Add non-recursive module.to_empty option (#104197)
Fixes https://github.com/pytorch/pytorch/issues/97049, related to https://github.com/pytorch/pytorch/issues/104187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104197
Approved by: https://github.com/albanD
2023-06-26 21:47:22 +00:00
cf5262a84f [core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135)
This PR adds in support for semi-structured sparsity via a tensor
subclass. It currently uses the CUTLASS kernels merged in PR #100881.

In the future we plan to add in cuSPARSELt support (see the other PRs in
the stack), which will give us larger performance gains.

This PR adds in 2 things:
- a Tensor subclass, `SparseSemiStructuredTensor` to store the
  sparse tensor in copmressed form and override `__torch_dispatch__`.
- a conversion function that takes in a dense tensor and a
  semi-structured sparse bool mask and creates an instance of the
  subclass.

**SparseSemiStructuredTensor**

The subclass stores the dense tensor in a contiguous flattened tensor
for future compatability with cuSPARSELt, which expects this format.
Note that the CUTLASS kernels do not have this limitation, as the
specified values and the metadata are passed separately in
`_structured_sparse_linear`. In the future we can use the cuSPARSELT bindings
[here](https://github.com/pytorch/pytorch/pull/103700) for faster matmul, better dtype converage, and relaxed shape
constraints.

Since we currently don't have a way to go back from the sparse
representation to the dense representation, and we store the weights in
compressed form, we don't have a great way to handle .t().

Instead, we keep track of how often we've called transpose on our
tensor, and if it's an unexpected number we throw an error. When the first
argument is sparse, we expect an even number of calls to transpose,
while when the second argument is sparse, we expect an odd number of
calls. This is because we support second argument sparse matrix
multiplications by using transpose properties.

**to_sparse_semi_structured**

This is a conversion function to convert a dense tensor and a
semi-structured sparse bool mask into a subclass. Currently, we must
pass in a bool mask, since we can't infer it becuase there may be
additional zero elements in the dense tensor, so `tensor !=0` is not 2:4
sparse.

Once we add either a method to derive the mask from the dense tensor or
cuSPARSELt, we no longer need to pass in the mask. cuSPARSELt has it's
own helper functions to create the metadata mask.

**User Details**

We have implemented support for the following ops for `torch.float16`
and `torch.int8`:
```
torch.addmm(bias, dense, sparse.t())
torch.mm(dense, sparse)
torch.mm(sparse, dense)
aten.linear.default
aten.t.default
aten.t.detach
```

The end user interface to accelerate a nn.Linaer module with the
subclass would look like this:

```
from torch.sparse import to_sparse_semi_structured

mask = torch.Tensor([0, 0, 1, 1]).tile(128, 32).cuda().bool()
linear = Model(128, 128).half().cuda()

linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight,
                                                       mask=linear.weight.bool())

```

This also updates tests and the `torch.sparse` module docstring to
reflect these changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102135
Approved by: https://github.com/albanD
2023-06-26 21:30:43 +00:00
f7f415eb2d [inductor] add cpp randint implementation to ir.py (#103079) (#104124)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104124
Approved by: https://github.com/desertfire
2023-06-26 21:26:25 +00:00
fed5fba6e4 DDP + C10D sparse all_reduce changes (#103916)
Summary:
## Changes

prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function.

prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...`

## Example script

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py

import torch
import torch.distributed as dist

def main():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    a = torch.tensor([[0, 2.], [3, 0]]).to(rank)
    a = a.to_sparse()
    print(f"rank {rank} - a: {a}")
    dist.all_reduce(a)

if __name__ == "__main__":
    main()
```

output:
```
rank 1 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
rank 0 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
```

Test Plan:
Testing commands (OSS):

```
# python
pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops

# c++
build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Testing commands (internal, ondemand GPU):
ddp tests:
```
buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output

# Get the .par file from the previous command and use it below
TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata
```

c10d tests:
```
# build tests and run with log output (python)
buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output
NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops

# python
NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)'

# c++
NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Differential Revision: D46724856

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103916
Approved by: https://github.com/rohan-varma
2023-06-26 20:42:17 +00:00
8a08733218 update test_higher_order_op: grad test (#104179)
With https://github.com/pytorch/pytorch/pull/103597, `config.dynamic_shapes` is always `True` and we never check the generated graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104179
Approved by: https://github.com/zou3519
2023-06-26 19:32:59 +00:00
adf9595c2f Update CODEOWNERS (#103934)
Remove users that no longer have write access to the repo, resolving CODEOWNERS errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103934
Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet
2023-06-26 19:29:29 +00:00
fb8aa721e2 [Pytorch Edge][BE] Delete Sparse Qnnpack test failing since 2022 jul (#104073)
Summary:
According to https://www.internalfb.com/omh/view/ai_infra_mobile_platform/tests these have been failing since jul 2022.

Just going to delete unless someone thinks they actually do matter and should be made green

https://www.internalfb.com/intern/test/562949996115570/ <- failing test

I ran locally and got errors like

  xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h:483: Failure
  Expected equality of these values:
  c[mIndex * cStride() + nIndex]
    Which is: -872.50446
  acc[mIndex * n() + nIndex]
    Which is: -872.50488
  at 0, 0: reference = -872.5048828125, optimized = -872.50445556640625, Mr x Nr = 8 x 4, M x N x K = 7 x 1 x 13
  xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h:483: Failure
  Expected equality of these values:
  c[mIndex * cStride() + nIndex]
    Which is: -67.246628
  acc[mIndex * n() + nIndex]
    Which is: -67.24707
  at 3, 0: reference = -67.2470703125, optimized = -67.246627807617188, Mr x Nr = 8 x 4, M x N x K = 4 x 1 x 15
  [  FAILED  ] Q8GEMM_8x4c1x4__SSE2.packedA_k_gt_8_subtile (148 ms)

Test Plan: ci

Reviewed By: kimishpatel

Differential Revision: D46950966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104073
Approved by: https://github.com/kimishpatel
2023-06-26 18:27:20 +00:00
100aff9d4f [export] Deserialize subgraphs. (#103991)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103991
Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri
2023-06-26 18:17:44 +00:00