140 Commits

Author SHA1 Message Date
d3ff2d42c2 [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-18 22:30:07 +00:00
a02e88d19c [miniz] Bump miniz version to 3.0.2 and add patch for zip64 (#140041)
Summary:
Bump miniz version from 2.1.0 to 3.0.2 and apply these patches:

* #79636 patches internal BUCK and bazel build
* #138959 adds `bool compute_crc32` argument
* miniz PR: https://github.com/richgel999/miniz/pull/324 to support
  zip64

Anyone bumping miniz version again, please apply these patches as well.

Test Plan:
Rely on unit test

Imported from OSS

Differential Revision: D65586230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140041
Approved by: https://github.com/mikaylagawarecki
2024-11-09 00:13:16 +00:00
370d66d7dd aten/buck | Appropriately convert clang => msvc compiler_flags. (#137944)
Summary:
fPIC is not available in clang on Windows - filter it out.
Also configure the flags appropriately for MSVC.

Reviewed By: rameshviswanathan

Differential Revision: D64365660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137944
Approved by: https://github.com/mwdavis84, https://github.com/ChristianK275, https://github.com/boguscoder
2024-10-15 20:21:01 +00:00
fa1d7b0262 Revert "Remove unused Caffe2 macros (#132979)"
This reverts commit da65cfbdea4f1f2176f6242004bda940a24f9ddb.

Reverted https://github.com/pytorch/pytorch/pull/132979 on behalf of https://github.com/ezyang due to these are apparently load bearing internally ([comment](https://github.com/pytorch/pytorch/pull/132979#issuecomment-2284666332))
2024-08-12 18:34:56 +00:00
cyy
da65cfbdea Remove unused Caffe2 macros (#132979)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132979
Approved by: https://github.com/ezyang
2024-08-09 04:48:20 +00:00
25d8a0480b [lint] Remove unnecessary BUCKRESTRICTEDSYNTAX suppressions
Differential Revision: D59935630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131187
2024-07-19 07:19:11 -07:00
a21d4363d2 [Profiler] Remove all instances of TMP_USE_TSC_AS_TIMESTAMP (#129973)
Summary: Now that D56584521 is in, we can remove all insteances of TMP_USE_TSC_AS_TIMESTAMP

Test Plan:
Ran resnet. Trace looks good
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jun_27_14_46_01.1967733.pt.trace.json.gz&bucket=gpu_traces

Reviewed By: aaronenyeshi, swolchok

Differential Revision: D59132793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129973
Approved by: https://github.com/aaronenyeshi
2024-07-03 19:28:52 +00:00
1d0efedc85 [Profiler] Add TSC Clock Callback to CUPTI (#125036)
Summary:
Right now we use the default clock for CUPTI which is not monotonic nor particularly fast. We have already added the Kineto side of the implementation here: https://www.internalfb.com/diff/D56525885

This diff only adds the compile flags such that the TSC format is used and sets the converter using a libkineto call in the profiler

Test Plan:
Obtained following trace using resnet test:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Apr_25_11_03_18.3862943.pt.trace.json.gz&bucket=gpu_traces

TBD: Add benchmarks

Differential Revision: D56584521

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125036
Approved by: https://github.com/aaronenyeshi
2024-06-27 21:07:43 +00:00
ff8042bcfb Enable AOTI shim v2 build and add into libtorch (#125211)
Summary:
Follow up of https://github.com/pytorch/pytorch/pull/125087

This diff will create shim v2 header and cpp file and corresponding build

Differential Revision: D56617546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125211
Approved by: https://github.com/desertfire
2024-05-31 23:56:11 +00:00
cyy
d44daebdbc [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-31 01:20:45 +00:00
67739d8c6f Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)"
This reverts commit 699db7988d84d163ebb6919f78885e4630182a7a.

Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2138496995))
2024-05-30 01:16:57 +00:00
cyy
699db7988d [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-29 11:58:03 +00:00
cdbb2c9acc Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)"
This reverts commit 4fdbaa794f9d5af2f171f772a51cb710c51c925f.

Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2136428735))
2024-05-29 03:02:35 +00:00
cyy
4fdbaa794f [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-27 03:54:03 +00:00
b9e7b35912 Remove caffe2 from more build files (#125898)
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125898
Approved by: https://github.com/Skylion007
2024-05-13 18:37:59 +00:00
63d4dc5a80 Remove TMP_LIBKINETO_NANOSECOND flag from Compilation (#124734)
Summary: Now that we have reached nanosecond granularity, we can now remove the temporary guards that were previously required for nanosecond precision.

Test Plan: Regression should cover this change

Reviewed By: aaronenyeshi

Differential Revision: D56444570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124734
Approved by: https://github.com/aaronenyeshi
2024-04-26 06:57:03 +00:00
3ebbeb75fd [Profiler] Make Kineto traces export ns granularity for finer timestamps (#122425) (#123650)
Summary:

Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.

This diff contains profiler changes only. Libkineto changes found in D54964435.

Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Zoomer: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=796886748550189
Ran key_averages() to make sure FunctionEvent code working as expected:
--  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls

                                          ProfilerStep*         0.74%       3.976ms        64.40%     346.613ms      69.323ms       0.000us         0.00%      61.710ms      12.342ms             5
                      Optimizer.zero_grad#SGD.zero_grad         0.76%       4.109ms         0.76%       4.109ms     821.743us       0.000us         0.00%       0.000us       0.000us             5
                                          ## forward ##         6.89%      37.057ms        27.19%     146.320ms      29.264ms       0.000us         0.00%      58.708ms      11.742ms             5
                                           aten::conv2d         0.22%       1.176ms         7.74%      41.658ms     157.199us       0.000us         0.00%      27.550ms     103.962us           265
                                      aten::convolution         0.79%       4.273ms         7.52%      40.482ms     152.762us       0.000us         0.00%      27.550ms     103.962us           265
                                     aten::_convolution         0.69%       3.688ms         6.73%      36.209ms     136.637us       0.000us         0.00%      27.550ms     103.962us           265
                                aten::cudnn_convolution         6.04%      32.520ms         6.04%      32.520ms     122.719us      27.550ms         8.44%      27.550ms     103.962us           265
                                             aten::add_         2.42%      13.045ms         2.42%      13.045ms      30.694us      12.700ms         3.89%      12.700ms      29.882us           425
                                       aten::batch_norm         0.19%       1.027ms         8.12%      43.717ms     164.971us       0.000us         0.00%      16.744ms      63.185us           265
                           aten::_batch_norm_impl_index         0.31%       1.646ms         7.93%      42.691ms     161.096us       0.000us         0.00%      16.744ms      63.185us           265
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

Differential Revision: D55925068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123650
Approved by: https://github.com/aaronenyeshi
2024-04-11 04:29:20 +00:00
c66d503194 Revert "[Profiler][submodule] Make Kineto traces export ns granularity for finer timestamps (#122425)"
This reverts commit 6f7dd2f84a4237b31eac29054b86a5284ef6cb6b.

Reverted https://github.com/pytorch/pytorch/pull/122425 on behalf of https://github.com/malfet due to Breaks ROCM builds ([comment](https://github.com/pytorch/pytorch/pull/122425#issuecomment-2041129241))
2024-04-06 16:19:00 +00:00
6f7dd2f84a [Profiler][submodule] Make Kineto traces export ns granularity for finer timestamps (#122425)
Summary:
Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself.

This diff contains profiler changes only. Libkineto changes found in D54964435.

Test Plan:
Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master.
Tracing with flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_37_22.4155151.pt.trace.json.gz&bucket=gpu_traces
Tracing without flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_39_15.4166047.pt.trace.json.gz&bucket=gpu_traces
Tracing on main: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_42_43.4177559.pt.trace.json.gz&bucket=gpu_traces

Ran key_averages() to make sure FunctionEvent code working as expected:
--  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls

                                          ProfilerStep*         0.74%       3.976ms        64.40%     346.613ms      69.323ms       0.000us         0.00%      61.710ms      12.342ms             5
                      Optimizer.zero_grad#SGD.zero_grad         0.76%       4.109ms         0.76%       4.109ms     821.743us       0.000us         0.00%       0.000us       0.000us             5
                                          ## forward ##         6.89%      37.057ms        27.19%     146.320ms      29.264ms       0.000us         0.00%      58.708ms      11.742ms             5
                                           aten::conv2d         0.22%       1.176ms         7.74%      41.658ms     157.199us       0.000us         0.00%      27.550ms     103.962us           265
                                      aten::convolution         0.79%       4.273ms         7.52%      40.482ms     152.762us       0.000us         0.00%      27.550ms     103.962us           265
                                     aten::_convolution         0.69%       3.688ms         6.73%      36.209ms     136.637us       0.000us         0.00%      27.550ms     103.962us           265
                                aten::cudnn_convolution         6.04%      32.520ms         6.04%      32.520ms     122.719us      27.550ms         8.44%      27.550ms     103.962us           265
                                             aten::add_         2.42%      13.045ms         2.42%      13.045ms      30.694us      12.700ms         3.89%      12.700ms      29.882us           425
                                       aten::batch_norm         0.19%       1.027ms         8.12%      43.717ms     164.971us       0.000us         0.00%      16.744ms      63.185us           265
                           aten::_batch_norm_impl_index         0.31%       1.646ms         7.93%      42.691ms     161.096us       0.000us         0.00%      16.744ms      63.185us           265
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------

Differential Revision: D55087993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122425
Approved by: https://github.com/aaronenyeshi
2024-04-06 06:04:28 +00:00
ffe45a8188 [ATen-vulkan] Implement global shader registry (#121088)
Differential Revision: D54447700

## Context

This changeset updates Vulkan SPIR-V codegen to introduce a global SPIR-V shader registry and register shaders dynamically at static initialization time. This change makes it possible to define and link custom shader libraries to the ATen-Vulkan runtime.

Before:

* `gen_vulkan_spv.py` generated two files, `spv.h` and `spv.cpp` which would contain the definition and initialization of Vulkan shader registry variables.

After:

* Introduce the `ShaderRegistry` class in `api/`, which encapsulates functionality of the `ShaderRegistry` class previously defined in the generated `spv.h` file
* Introduce a global shader registry (defined as a static variable in the `api::shader_registry() function`
* Define a `ShaderRegisterInit` class (taking inspiration from `TorchLibraryInit`) that allows for dynamic shader registration
* `gen_vulkan_spv.py` now only generates `spv.cpp`, which defines a static `ShaderRegisterInit` instance that triggers registration of the compiled shaders to the global shader registry.

Benefits:

* Cleaner code base; we no longer have `ShaderRegistry` defined in a generated file, and don't need a separate implementation file (`impl/Registry.*`) to handle shader lookup. All that logic now lives under `api/ShaderRegistry.*`
* Makes it possible to compile and link separate shader libraries, providing similar flexibility as defining and linking custom ATen operators

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121088
Approved by: https://github.com/manuelcandales, https://github.com/jorgep31415
2024-03-05 03:56:57 +00:00
9ec8dd2467 Reify view_func() closures as ViewFuncs (#118404)
Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on.

```cpp
/// Base class for view functions, providing reapplication of a view on a new base.
/// Each view op should get a codegenerated subclass of this class containing
/// any state needed to reconstruct the view. The class also provides convenience
/// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification,
/// where we want to use symbolic values or fake tensors instead.
struct TORCH_API ViewFunc {
  virtual ~ViewFunc() {}
  /// Returns any SymInts in the saved state.
  virtual std::vector<c10::SymInt> get_symints() const { return {}; }
  /// Returns the number of SymInts in the saved state.
  virtual size_t num_symints() const { return 0; }
  /// Returns any tensors in the saved state.
  virtual std::vector<at::Tensor> get_tensors() const { return {}; }
  /// Returns the number of tensors in the saved state.
  virtual size_t num_tensors() const { return 0; }
  /// Reapplies the view on the given base using the saved state.
  virtual at::Tensor operator()(const at::Tensor&) const = 0;
  /// Returns a clone of this ViewFunc, optionally with the specified saved state.
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0;

protected:
  /// Sets the values of any SymInts in the saved state. The input vector size must
  /// match the number of SymInts in the saved state (i.e. the size of the list
  /// returned by get_symints()).
  virtual void set_symints(std::vector<c10::SymInt>) {}
  /// Sets the values of any Tensors in the saved state. The input vector size must
  /// match the number of Tensors in the saved state (i.e. the size of the list
  /// returned by get_tensors()).
  virtual void set_tensors(std::vector<at::Tensor>) {}
};
```

New codegen files:
* `torch/csrc/autograd/generated/ViewFunc.h`
* `torch/csrc/autograd/generated/ViewFuncs.cpp`

The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd.

Example codegen for `slice.Tensor`:
```cpp
// torch/csrc/autograd/generated/ViewFuncs.h
#define SLICE_TENSOR_VIEW_FUNC_AVAILABLE
struct SliceTensorViewFunc : public torch::autograd::ViewFunc {
  SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step)
  {};
  virtual ~SliceTensorViewFunc() override {};
  virtual std::vector<c10::SymInt> get_symints() const override;
  virtual size_t num_symints() const override;
  virtual std::vector<at::Tensor> get_tensors() const override;
  virtual size_t num_tensors() const override;
  virtual at::Tensor operator()(const at::Tensor&) const override;
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const override;

protected:
  virtual void set_symints(std::vector<c10::SymInt>) override;
  virtual void set_tensors(std::vector<at::Tensor>) override;

private:
  int64_t dim;
  c10::optional<c10::SymInt> start;
  c10::optional<c10::SymInt> end;
  c10::SymInt step;
};
...

// torch/csrc/autograd/generated/ViewFuncs.cpp
std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const {
  ::std::vector<c10::SymInt> symints;
  symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
  if(start.has_value()) symints.insert(symints.end(), *(start));
  if(end.has_value()) symints.insert(symints.end(), *(end));
  symints.push_back(step);
  return symints;
}

size_t SliceTensorViewFunc::num_symints() const {
  return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
}

void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) {
  TORCH_INTERNAL_ASSERT(symints.size() == num_symints());
  auto i = 0;
  if(start.has_value()) start = symints[i];
  i += (start.has_value() ? 1 : 0);
  if(end.has_value()) end = symints[i];
  i += (end.has_value() ? 1 : 0);
  step = symints[i];
}

std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const {
  ::std::vector<at::Tensor> tensors;
  return tensors;
}

size_t SliceTensorViewFunc::num_tensors() const {
  return static_cast<size_t>(0);
}

void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) {
  TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors());

}

at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const {
  return at::_ops::slice_Tensor::call(input_base, dim, start, end, step);
}

std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set(
    std::optional<std::vector<c10::SymInt>> symints,
    std::optional<std::vector<at::Tensor>> tensors) const {
  auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step);
  if (symints.has_value()) {
    output->set_symints(std::move(*(symints)));
  }
  if (tensors.has_value()) {
    output->set_tensors(std::move(*(tensors)));
  }
  return output;
}
```

The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification.

For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly.
```sh
python test/test_autograd.py -k test_view_func_replay
python test/test_ops.py -k test_view_replay
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404
Approved by: https://github.com/ezyang
2024-02-14 22:00:43 +00:00
24bdd03d23 Revert "Reify view_func() closures as ViewFuncs (#118404)"
This reverts commit d5a6762263a98e5153bc057c8ba4f377542c7e55.

Reverted https://github.com/pytorch/pytorch/pull/118404 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/118404#issuecomment-1938600260))
2024-02-12 12:38:51 +00:00
d5a6762263 Reify view_func() closures as ViewFuncs (#118404)
Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on.

```cpp
/// Base class for view functions, providing reapplication of a view on a new base.
/// Each view op should get a codegenerated subclass of this class containing
/// any state needed to reconstruct the view. The class also provides convenience
/// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification,
/// where we want to use symbolic values or fake tensors instead.
struct TORCH_API ViewFunc {
  virtual ~ViewFunc() {}
  /// Returns any SymInts in the saved state.
  virtual std::vector<c10::SymInt> get_symints() const { return {}; }
  /// Returns the number of SymInts in the saved state.
  virtual size_t num_symints() const { return 0; }
  /// Returns any tensors in the saved state.
  virtual std::vector<at::Tensor> get_tensors() const { return {}; }
  /// Returns the number of tensors in the saved state.
  virtual size_t num_tensors() const { return 0; }
  /// Reapplies the view on the given base using the saved state.
  virtual at::Tensor operator()(const at::Tensor&) const = 0;
  /// Returns a clone of this ViewFunc, optionally with the specified saved state.
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0;

protected:
  /// Sets the values of any SymInts in the saved state. The input vector size must
  /// match the number of SymInts in the saved state (i.e. the size of the list
  /// returned by get_symints()).
  virtual void set_symints(std::vector<c10::SymInt>) {}
  /// Sets the values of any Tensors in the saved state. The input vector size must
  /// match the number of Tensors in the saved state (i.e. the size of the list
  /// returned by get_tensors()).
  virtual void set_tensors(std::vector<at::Tensor>) {}
};
```

New codegen files:
* `torch/csrc/autograd/generated/ViewFunc.h`
* `torch/csrc/autograd/generated/ViewFuncs.cpp`

The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd.

Example codegen for `slice.Tensor`:
```cpp
// torch/csrc/autograd/generated/ViewFuncs.h
#define SLICE_TENSOR_VIEW_FUNC_AVAILABLE
struct SliceTensorViewFunc : public torch::autograd::ViewFunc {
  SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step)
  {};
  virtual ~SliceTensorViewFunc() override {};
  virtual std::vector<c10::SymInt> get_symints() const override;
  virtual size_t num_symints() const override;
  virtual std::vector<at::Tensor> get_tensors() const override;
  virtual size_t num_tensors() const override;
  virtual at::Tensor operator()(const at::Tensor&) const override;
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const override;

protected:
  virtual void set_symints(std::vector<c10::SymInt>) override;
  virtual void set_tensors(std::vector<at::Tensor>) override;

private:
  int64_t dim;
  c10::optional<c10::SymInt> start;
  c10::optional<c10::SymInt> end;
  c10::SymInt step;
};
...

// torch/csrc/autograd/generated/ViewFuncs.cpp
std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const {
  ::std::vector<c10::SymInt> symints;
  symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
  if(start.has_value()) symints.insert(symints.end(), *(start));
  if(end.has_value()) symints.insert(symints.end(), *(end));
  symints.push_back(step);
  return symints;
}

size_t SliceTensorViewFunc::num_symints() const {
  return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
}

void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) {
  TORCH_INTERNAL_ASSERT(symints.size() == num_symints());
  auto i = 0;
  if(start.has_value()) start = symints[i];
  i += (start.has_value() ? 1 : 0);
  if(end.has_value()) end = symints[i];
  i += (end.has_value() ? 1 : 0);
  step = symints[i];
}

std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const {
  ::std::vector<at::Tensor> tensors;
  return tensors;
}

size_t SliceTensorViewFunc::num_tensors() const {
  return static_cast<size_t>(0);
}

void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) {
  TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors());

}

at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const {
  return at::_ops::slice_Tensor::call(input_base, dim, start, end, step);
}

std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set(
    std::optional<std::vector<c10::SymInt>> symints,
    std::optional<std::vector<at::Tensor>> tensors) const {
  auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step);
  if (symints.has_value()) {
    output->set_symints(std::move(*(symints)));
  }
  if (tensors.has_value()) {
    output->set_tensors(std::move(*(tensors)));
  }
  return output;
}
```

The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification.

For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly.
```sh
python test/test_autograd.py -k test_view_func_replay
python test/test_ops.py -k test_view_replay
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404
Approved by: https://github.com/ezyang
2024-02-09 18:51:36 +00:00
545d2126f6 [pt-vulkan] Enable Python code blocks in shader templates and upgrade shader template generation (#115948)
Summary:
This change makes two major improvements to PyTorch Vulkan's shader authoring workflow.

## Review Guide

There are a lot of changed files because every GLSL shader had to be touched. The majority of changes is changing

```
#define PRECISION $precision
#define FORMAT $format
```

to

```
#define PRECISION ${PRECISION}
#define FORMAT ${FORMAT}
```

due to changes in how shader templates are processed.

For reviewers, the primary functional changes to review are:

* `gen_vulkan_spv.py`
  * Majority of functional changes are in this file, which controls how shader templates are processed.
* `shader_params.yaml`
  * controls how shader variants are generated

## Python Codeblocks in Shader Templates

From now on, every compute shader (i.e. `.glsl`) is treated as a shader template. To this effect, the `templates/` folder has been removed and there is now a global `shader_params.yaml` file to describe the shader variants that should be generated for all shader templates.

**Taking inspiration from XNNPACK's [`xngen` tool](https://github.com/google/XNNPACK/blob/master/tools/xngen.py), shader templates can now use Python codeblocks**.  One example is:

```
$if not INPLACE:
  layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D uOutput;
  layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput;
  layout(set = 0, binding = 2) uniform PRECISION sampler3D uOther;
  layout(set = 0, binding = 3) uniform PRECISION restrict Block {
    ivec4 output_sizes;
    ivec4 input_sizes;
    ivec4 other_sizes;
    float alpha;
  }
  uArgs;
$else:
  layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict image3D uOutput;
  layout(set = 0, binding = 1) uniform PRECISION sampler3D uOther;
  layout(set = 0, binding = 2) uniform PRECISION restrict Block {
    ivec4 output_sizes;
    ivec4 other_sizes;
    float alpha;
  }
  uArgs;
```

Another is:

```
  // PYTHON CODEBLOCK
  $if not IS_DIV:
    const int c_index = (pos.z % ((uArgs.output_sizes.z + 3) / 4)) * 4;
    if (uArgs.other_sizes.z != 1 && c_index + 3 >= uArgs.output_sizes.z) {
      ivec4 c_ind = ivec4(c_index) + ivec4(0, 1, 2, 3);
      vec4 mask = vec4(lessThan(c_ind, ivec4(uArgs.output_sizes.z)));
      other_texel = other_texel * mask + vec4(1, 1, 1, 1) - mask;
    }

  // PYTHON CODEBLOCK
  $if not INPLACE:
    ivec3 input_pos =
        map_output_pos_to_input_pos(pos, uArgs.output_sizes, uArgs.input_sizes);
    const vec4 in_texel =
        load_texel(input_pos, uArgs.output_sizes, uArgs.input_sizes, uInput);

    imageStore(uOutput, pos, OP(in_texel, other_texel, uArgs.alpha));
  $else:
    const vec4 in_texel = imageLoad(uOutput, pos);
    imageStore(uOutput, pos, OP(in_texel, other_texel, uArgs.alpha));
```

In addition to making it easier and clearer to write shader templates, this enables shaders that were previously unable to be consolidated into a single template to now be represented using a single template, such as non inplace and inplace variants of the same shader.

## `generate_variant_forall` in shader variant YAML configuration

YAML files that describe how shader variants should be generated can now use a `generate_variant_forall` field to iterate over various settings for a specific parameter for each variant defined. Example:

```
unary_op:
  parameter_names_with_default_values:
    OPERATOR: exp(X)
    INPLACE: 0
  generate_variant_forall:
    INPLACE:
      - VALUE: 0
        SUFFIX: ""
      - VALUE: 1
        SUFFIX: "inplace"
  shader_variants:
    - NAME: exp
      OPERATOR: exp(X)
    - NAME: sqrt
      OPERATOR: sqrt(X)
    - NAME: log
      OPERATOR: log(X)
```

Previously, the `inplace` variants would need to have separate `shader_variants` entries. If there are multiple variables that need to be iterated across, then all possible combinations will be generated. Would be good to take a look to see how the new YAML configuration works.

Test Plan:
There is no functional change to this diff; we only need to make sure that the generated shaders are still correct. Therefore, we only need to run `vulkan_api_test`.

```
# On Mac Laptop
buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*"
```

Reviewed By: digantdesai

Differential Revision: D52087084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115948
Approved by: https://github.com/manuelcandales
2023-12-20 05:47:33 +00:00
21a1d31ed8 [caffe2] update Meta-internal googletest references (#115407)
Summary: Update test dependencies to point to the new internal googletest location.

Test Plan: CI

Differential Revision: D51951643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115407
Approved by: https://github.com/cccclai
2023-12-10 20:37:13 +00:00
13410d0eda Moving target/code path to non-pytorch repo (#114095)
Differential Revision: D51460806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114095
Approved by: https://github.com/digantdesai
2023-12-02 19:27:09 +00:00
f4c0ef95bc [AMD] Fix broken build from nested transformer utils (#110245)
Summary: D49374910 breaks internal amd build because we didn't hipify the header file in nested/cuda. Maybe it's just easier to move it outside.

Reviewed By: nmacchioni

Differential Revision: D49743234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110245
Approved by: https://github.com/drisspg
2023-10-03 08:05:10 +00:00
8124a6c40c [TORCH_LIBRARY] Add impl_abstract_pystub (#109529)
We want users to be able to define custom ops in C++ but put the
abstract impl in Python (since it is easier to write them in Python and
the abstract impl better models device semantics and data-dependent
operators).

`m.impl_abstract_pystub(opname, python_module, context)` declares the
abstract_impl of the operator to exist in the given python module.
When the abstract_impl needs to be accessed (either via FakeTensor or
Meta), and it does not exist, the PyTorch Dispatcher will yell
with a descriptive error message.

Some details:
- We construct a new global AbstractImplPyStub mapping in
  Dispatcher.cpp. Read/write to this map is protected by the Dispatcher
  lock.
- We add a new Meta Tensor fallback kernel. The fallback errors out if there is
  no meta kernel, but also offers a nicer error message if we see that there is
  a pystub.
- We create a `torch._utils_internal.throw_abstract_impl_not_imported_error`
  helper function to throw errors. This way, we can throw different error
  messages in OSS PyTorch vs internal PyTorch. To invoke this from C++, we
  added a PyInterpreter::throw_abstract_impl_not_imported_error.

Differential Revision: [D49464753](https://our.internmc.facebook.com/intern/diff/D49464753/)

Differential Revision: [D49464753](https://our.internmc.facebook.com/intern/diff/D49464753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109529
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-09-22 04:55:36 +00:00
cyy
ba0362a09e Remove unused build system checks and definitions (#109711)
Remove some outdated checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109711
Approved by: https://github.com/ezyang
2023-09-21 16:52:16 +00:00
3add22b716 Created nested utils.cpp (#109304)
# Summary
This refactors the preprocessing for nestedtensors that glue into SDPA. This is done in order to aid with reviewing:
https://github.com/pytorch/pytorch/pull/97485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109304
Approved by: https://github.com/cpuhrsch
2023-09-20 19:38:34 +00:00
15202cc80c [caffe2] Remove cxx override to c++17 (#108687)
Summary: Allow the user to specify the cxx version to use when compiling. For applications that compile with C++20, we wish to also compile this library with C++20 to avoid subtle ODR violations with using different library standards.

Test Plan: Built the project successfully.

Reviewed By: smeenai

Differential Revision: D48636406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108687
Approved by: https://github.com/davidberard98
2023-09-12 02:54:59 +00:00
cyy
efc7c366f4 Remove auto_gil.h (#108492)
auto_gil.h has been deprecated for a long time. We can switch to pybind11.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108492
Approved by: https://github.com/Skylion007
2023-09-05 08:26:13 +00:00
7047d132fd add context support for custom device (#105056)
Fixes #ISSUE_NUMBER
as the title, add context support for custom device and testcase.
And in the future, we may want to refactor these hooks for different device to unify the APIs, would you agree my
idea? @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105056
Approved by: https://github.com/albanD
2023-07-29 12:56:03 +00:00
b0a93c851c Fix BUCK build after #103185 (#103446)
Per title.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at c64f0c0</samp>

> _`torch_headers` grows_
> _to include profiler files_
> _autumn of code change_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103446
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet
2023-06-13 05:12:07 +00:00
a1d041728b Back out "[aarch64][tools/build_defs/third_party/fbcode_defs.bzl] Fix dep handling in cross-builds"
Differential Revision: D45415678nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100294
2023-05-01 16:27:51 -07:00
0b1b063158 [buckbuild.bzl] Fix dep handling in cross-builds
Differential Revision: D44960349nnPull Request resolved: https://github.com/pytorch/pytorch/pull/99826
2023-04-25 20:53:28 -07:00
2630144786 Call to mkldnn_matmul from aten::addmm on AArch64 (#91763)
We have noticed that on BERT_pytorch in torchbenchmark majority of time is spent in running GEMM in aten:addmm. At the moment this calls into BLAS routine, but on AArch64 it will be faster if it calls into mkldnn_matmul. Performance wise compared to build with OpenBLAS it runs faster 1.2x faster on 16 cores with batch size of 8 on Graviton3, while if fast math mode (mkldnn_matmul exposes through oneDNN and Arm Compute Library option to run GEMM with FP32 inputs using BBF16 operations) is enabled then it is 2.3x

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91763
Approved by: https://github.com/jgong5, https://github.com/ngimel, https://github.com/malfet
2023-04-01 04:25:57 +00:00
b32afbbdb6 [Kineto] Improve Config Options Part 2 - update to new Kineto Submodule (#97556)
Summary: Remove the old client code, and replace with the new client interface after updating Kineto submodule.

Test Plan: CI Tests

Reviewed By: chaekit

Differential Revision: D44314909

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97556
Approved by: https://github.com/anupambhatnagar, https://github.com/davidberard98
2023-03-25 00:52:23 +00:00
197434df96 [Kineto] Improve Config Options for Input Shapes, Memory, Stack, Flops, and Modules - Part 1 (#97380)
Summary:
Improve On-Demand Kineto config to enable toggling of [profiler options](https://pytorch.org/docs/stable/profiler.html) via the config file. New config strings:
- PROFILE_REPORT_INPUT_SHAPES
- PROFILE_PROFILE_MEMORY
- PROFILE_WITH_STACK
- PROFILE_WITH_FLOPS
- PROFILE_WITH_MODULES

Also marked for deprecation, but still valid, old config options:
- CLIENT_INTERFACE_ENABLE_OP_INPUTS_COLLECTION
- PYTHON_STACK_TRACE

Test Plan: CI Tests (internal testing)

Reviewed By: leitian

Differential Revision: D44275220

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97380
Approved by: https://github.com/davidberard98
2023-03-24 19:12:19 +00:00
a1d7014c0f Hooking backward for QNNPACK (#94432)
Summary: Enabling quantized gradient.

Test Plan:
Algorithmic correctness - Dequantized matmul vs QNNPACK matmul for gradient - P616202766

```
dequantized matmul : [1.5463, -0.2917, -2.1735, 0.5689, -1.0795]
QNNPACK matmul : tensor([[ 1.5463, -0.2917, -2.1735,  0.5689, -1.0795]])
```

Differential Revision: D42593235

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94432
Approved by: https://github.com/malfet, https://github.com/kimishpatel
2023-03-08 10:21:32 +00:00
53b4f6c0f6 Revert "[jit] Add c++ stacktraces for jit::ErrorReport (#94842)" (#95886)
This reverts commit 70029214f300f611e7dd816b5f64426224f6ab96.

It broke some internal tests.

Differential Revision: [D43735833](https://our.internmc.facebook.com/intern/diff/D43735833)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95886
Approved by: https://github.com/malfet, https://github.com/qihqi
2023-03-03 05:49:40 +00:00
70029214f3 [jit] Add c++ stacktraces for jit::ErrorReport (#94842)
**Summary**: This PR adds C++ stacktraces to jit::ErrorReports. After this PR, if you run with `TORCH_SHOW_CPP_STACKTRACES=1` environment variable and a jit::ErrorReport is thrown, then the C++ stacktrace should be displayed.

**More background**: This behavior already occurs for c10::Error; but not for jit::ErrorReport. jit::ErrorReport _does_ usually have a python stacktrace for the python source, but it is sometimes still helpful to know where in the C++ codebase the error came from.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94842
Approved by: https://github.com/qihqi
2023-02-28 22:37:51 +00:00
f92348e13d Clean up mentions of removed torch/csrc/generic/*.cpp (#94107)
Summary: The dir was removed in https://github.com/pytorch/pytorch/pull/82373.

Test Plan: Sandcastlle + GitHub CI.

Differential Revision: D43016100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94107
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi
2023-02-07 07:17:16 +00:00
2064fa9f10 Clean-up removed TH from BUCK (#94022)
Differential Revision: D42981979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94022
Approved by: https://github.com/huydhn, https://github.com/izaitsevfb, https://github.com/malfet
2023-02-04 08:16:43 +00:00
b847ac227f Fix typo in buckbuild.bzl (#92751)
accomodate -> accommodate

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92751
Approved by: https://github.com/Skylion007
2023-01-22 17:35:38 +00:00
cd62ad5f88 [Vulkan] Enable including GLSL files from custom locations in gen_vulkan_spv (#91913)
@bypass-github-export-checks

To include custom locations when building with buck, use a ```-c gen_vulkan_spv.additional_glsl_paths="..."``` flag where ... is a list of filegroups and source directory paths separated by spaces,

ex. to include the sources added in D41413913, you would use

```
buck build ... -c gen_vulkan_spv.additional_glsl_paths="//xplat/caffe2:test_glsl_src_path_a test_src/a //xplat/caffe2:test_glsl_src_path_b test_src/b"
```

(as shown in the test plan)

Differential Revision: [D41413914](https://our.internmc.facebook.com/intern/diff/D41413914/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41413914/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91913
Approved by: https://github.com/mcr229
2023-01-10 20:35:23 +00:00
64a3738fcd [vulkan] Remove external dependencies in core API and introduce torch_vulkan_api target (#91021)
This diff isolates the core components of the Pytorch Vulkan backend into its own target (`//xplat/caffe2:torch_vulkan_api`). The main motivation for this is to create a library that does not have a dependency on the ATen library which can then be used to build a graph mode runtime for Vulkan for Executorch.

In addition to introducing the new target, this diff also removes some references to external dependencies in the `api/` folder so that files in that folder are completely self contained.

Differential Revision: [D42038817](https://our.internmc.facebook.com/intern/diff/D42038817/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D42038817/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91021
Approved by: https://github.com/kirklandsign
2023-01-05 00:41:23 +00:00
25eb7c3ae3 Clean up dependancy for flatbuffer_loader (#86041)
Test Plan: waitforsandcastle

Differential Revision: D38445936

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86041
Approved by: https://github.com/cccclai
2022-12-08 03:48:04 +00:00
0ad6715b7b [aarch64] add sleef_arm dependency (#89988)
Reviewed By: kimishpatel, psaab

Differential Revision: D41601965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89988
Approved by: https://github.com/soumith
2022-12-01 21:10:53 +00:00
969a7d09f6 Revert "[aarch64] add SLEEF dependency for aten_cpu (#89475)"
This reverts commit 3cef87f9fd59adb681d910b8edbc1f33e0be5ad2.

Reverted https://github.com/pytorch/pytorch/pull/89475 on behalf of https://github.com/jeanschmidt due to breaking internal builds
2022-11-30 12:06:18 +00:00