# Motivation
According to [[RFC] Intel GPU Upstreaming](https://github.com/pytorch/pytorch/issues/114723), we would like to upstream amp autocast policy to facilitate the functionality and accuracy of `torch.compile` on e2e benchmarks.
# Solution
The first PR aims to make macro `KERNEL` to be generic. It accepts two types of inputs, like `(DISPATCH, OP, POLICY)` and `(DISPATCH, OP, OVERLOAD, POLICY)`.
The second PR intends to refactor CUDA's autocast policy to make it can be shared with `XPU` backend.
The final PR would like to support XPU autocast policy which shares the same recipe with `CUDA` backend.
# Additional Context
Another motivation is we would like to unify autocast API and provide the generic APIs, like:
- `torch.get_autocast_dtype(device_type)`
- `torch.set_autocast_dtype(device_type)`
- `torch.is_autocast_enabled(device_type)`
- `torch.set_autocast_enabled(device_type)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124050
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
Removed a bunch of skips, I also updated test_forloop_goes_right_direction to *not* use the closure when dynamo is tracing. The reason for this is that testing the disabled optimizer doesn't actually test anything.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123322
Approved by: https://github.com/janeyx99
ghstack dependencies: #123498
Fix part of https://github.com/pytorch/pytorch/issues/123603.
Example traceback on branch https://github.com/pytorch/vision/compare/main...wwen/custom_ops_test:
```
running my_custom_op!
Traceback (most recent call last):
File "/data/users/williamwen/torchvision/playground.py", line 13, in <module>
print(opt_fn1(torch.randn(3, 3)))
File "/data/users/williamwen/pytorch2/torch/_dynamo/eval_frame.py", line 387, in _fn
return fn(*args, **kwargs)
File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 977, in catch_errors
return callback(frame, cache_entry, hooks, frame_state, skip=1)
File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 818, in _convert_frame
result = inner_convert(
File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 411, in _convert_frame_assert
return _compile(
File "/data/users/williamwen/pytorch2/torch/_utils_internal.py", line 70, in wrapper_function
return function(*args, **kwargs)
File "/data/users/williamwen/py310-env/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 700, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 266, in time_wrapper
r = func(*args, **kwargs)
File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 568, in compile_inner
out_code = transform_code_object(code, transform)
File "/data/users/williamwen/pytorch2/torch/_dynamo/bytecode_transformation.py", line 1116, in transform_code_object
transformations(instructions, code_options)
File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 173, in _fn
return fn(*args, **kwargs)
File "/data/users/williamwen/pytorch2/torch/_dynamo/convert_frame.py", line 515, in transform
tracer.run()
File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 2237, in run
super().run()
File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 875, in run
while self.step():
File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 790, in step
self.dispatch_table[inst.opcode](self, inst)
File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 492, in wrapper
return inner_fn(self, inst)
File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 1260, in CALL_FUNCTION
self.call_function(fn, args, {})
File "/data/users/williamwen/pytorch2/torch/_dynamo/symbolic_convert.py", line 730, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/data/users/williamwen/pytorch2/torch/_dynamo/variables/torch.py", line 747, in call_function
tensor_variable = wrap_fx_proxy(
File "/data/users/williamwen/pytorch2/torch/_dynamo/variables/builder.py", line 1425, in wrap_fx_proxy
return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
File "/data/users/williamwen/pytorch2/torch/_dynamo/variables/builder.py", line 1510, in wrap_fx_proxy_cls
example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1804, in get_fake_value
raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1736, in get_fake_value
ret_val = wrap_fake_exception(
File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1251, in wrap_fake_exception
return fn()
File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1737, in <lambda>
lambda: run_node(tx.output, node, args, kwargs, nnmodule)
File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1872, in run_node
raise RuntimeError(make_error_message(e)).with_traceback(
File "/data/users/williamwen/pytorch2/torch/_dynamo/utils.py", line 1854, in run_node
return node.target(*args, **kwargs)
File "/data/users/williamwen/pytorch2/torch/_ops.py", line 870, in __call__
return self_._op(*args, **(kwargs or {}))
torch._dynamo.exc.TorchRuntimeError: Failed running call_function torchvision.my_custom_op1(*(FakeTensor(..., size=(3, 3)),), **{}):
The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
If you're using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel. To fix this, please wrap the custom kernel into an opaque custom op. Please see the following for details: https://docs.google.com/document/d/1W--T6wz8IY8fOI0Vm8BF44PdBgs283QvpelJZWieQWQ
from user code:
File "/data/users/williamwen/torchvision/playground.py", line 5, in fn1
return torch.ops.torchvision.my_custom_op1(x)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124240
Approved by: https://github.com/zou3519
Motivations:
- This makes things more consistent: using a Library object, you should
be able to do all of the registration APIs that tie registrations to
the lifetime of the Library.
- I need this for the next PR up in the stack, where we will have
torch.library.register_fake support both CustomOpDef (from the new
custom ops API) and other custom ops.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124065
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064
Previously, if someone used `register_fake` to add a fake impl for an
operator defined in C++, we would require them to add a
`m.set_python_module(<module>)` call to C++. This was to avoid
situations where a user imported the C++ operator without importing the
fake impl.
This "breaks" open registration: there's no way to add a fake impl
outside of a repository that defines an operator, so we want to turn
this behavior off by default in open source.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124064
Approved by: https://github.com/albanD
ghstack dependencies: #123937
# Motivation
This PR is a part of RFC #114848, and it is a successor PR of #116249 and #116019. This PR would depend on oneDNN compilation in #116249. Some runtime support is needed in #116019.
Aten operators like `addmm`, `baddmm` is defined in `Blas.cpp` in `aten/src/ATen/native/mkldnn/xpu/`.
Accompanied with these files provide core functionaliy, `BlasImpl.h`, `Utils.h` and other file provide basic utilities for them. For instance, `Utils.h` provide common memory descriptor query utils for `Matmul.h` and these utility function will also be used in other primitive, like `convolution`. `BlasImpl.h` is a header file that provide helper for handling shape info processing in matmul related operators. It would not only help basic GEMM operator like `addmm, baddmm` but also help fusion operators used in `torch.compile` like `linear_pointwise` in #117824.
In next stage, we would continually complete the oneDNN support through enabling `matmul fusion` and `convolution` related code.
Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117202
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #117098, #117112
# Motivation
As proposed in https://github.com/pytorch/pytorch/issues/114848 and https://github.com/pytorch/pytorch/issues/114723, oneDNN library is an important component for Intel GPU software ecosystem.
Current PR is based on #117098, where oneDNN library for Intel GPU should be ready. This PR is the integration code from aten to oneDNN. GEMM integration code is the core part in this PR. Accompanied with GEMM, more basic support like runtime (device, stream), primitive attr is also included.
We put the oneDNN integration code in directory `aten/src/ATen/native/mkldnn/xpu/detail`. We add a namespace `at::native::xpu::onednn` for oneDNN integration.
The code in this PR would be used in following PRs, where aten operators would call the functions in these integration code.. We separate the prs due to onednn integration is logically separable with aten operator implementation. Also, this can ease the burden of reviewing by avoid too much codes in single PR.
Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117112
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD
TF32 causes issues with the tolerances here; we might also consider migrating some of the `with_tf32_off` tests in this file to `tf32_on_and_off` in case it would be useful to get signal for TF32.
CC @malfet @atalman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124104
Approved by: https://github.com/zou3519
By unrolling middle loop by 16 elements and using neon to decode packed int4 to float32.
Unrolling entire `n` loop actually makes it a tad slower, probably because ARM has smaller register file that x86
Before/after performance running stories110M on M2Pro
| eager (before) | eager (after) | compile(before) | compile (after) |
| ---- | --- | -- | -- |
| 28 | 57 | 31 | 104 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124257
Approved by: https://github.com/mikekgfb
A unit test within test_cutlass_backend.py can fail with CUDA illegal memory accesses due to the fact that some CUTLASS Kernels contain bugs.
By using autotuning in subprocesses, this CUDA illegal memory access simply
leads to the buggy Cutlass Kernels being filtered out, instead of causing it
to bring down the entire process.
Test Plan:
This is a change to a unit test. It's recommended to use autotune_in_subproc when using the Cutlass backend anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124106
Approved by: https://github.com/eellison
This PR:
- adds a new torch.library.register_fake and deprecates
torch.library.impl_abstract. The motivation is that we have a lot of
confusion around the naming so we are going to align the naming with
the actual subsystem (FakeTensor).
- renames `m.impl_abstract_pystub("fbgemm_gpu.sparse_ops")` to
`m.has_python_registration("fbgemm_gpu.sparse_ops")`. No deprecation
here yet; I need to test how this works with static initialization.
- Renames a bunch of internals to match (e.g. abstractimplpystub ->
pystub)
I'm scared to rename the Python-side internal APIs (e.g.
torch._library.abstract_impl) because of torch.package concerns. I'll do
that in its own isolated PR next just in case it causes problems.
DEPRECATION NOTE: torch.library.impl_abstract was renamed to to
torch.library.register_fake. Please use register_fake. We'll delete
impl_abstract in a future version of PyTorch.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123937
Approved by: https://github.com/albanD
We can't get information about `ami-id`, `instance-id`, `instance-type` for the ARC runners:
```
2024-04-16T11:10:17.0098276Z curl: (22) The requested URL returned error: 401
2024-04-16T11:10:17.0110775Z ami-id:
2024-04-16T11:10:17.0159131Z curl: (22) The requested URL returned error: 401
2024-04-16T11:10:17.0167378Z instance-id:
2024-04-16T11:10:17.0219464Z curl: (22) The requested URL returned error: 401
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124171
Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/zxiiro