Citing @malfet's [comment](https://github.com/pytorch/pytorch/pull/136343#pullrequestreview-2318792396) in https://github.com/pytorch/pytorch/pull/136343
> It would be great, if users do not have to modify their programs for every new backend, but rather use with torch.device('xpu'): and keep rest of the code unchanged.
This PR makes the backend specification ("nccl", "gloo") optional when user provides a `devce_id` to `init_process_group` (the acceptance of `device_id` has been previously supported for the purpose of eager init).
New user experience:
```
device = torch.device(device_type, rank % device_count)
dist.init_process_group(device_id=device)
```
The line of `device = torch.device(...)` is anyway needed because user would use it for tensor creation etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140963
Approved by: https://github.com/wconstab
Per discussion with @malfet, only allow weights_only unpickler to load NJT if `torch.nested` and `torch._dynamo` are imported
(this is slightly weird as technically `torch.nested` is actually imported by default and `torch._dynamo.decorators._DimRange` is actually what needs to be imported)
we can't import this from `torch.nested` as this would
- undo dynamo lazy import
- cause circular import
===========================
Redo of https://github.com/pytorch/pytorch/pull/140304 caused issues as `torch.nested._internal.foo` needs to be imported, which causes issues like
```python
torch/_weights_only_unpickler.py", line 339, in load
if full_path in _get_allowed_globals():
torch/_weights_only_unpickler.py", line 188, in _get_allowed_globals
torch.nested._internal.nested_tensor.NestedTensor
AttributeError: module 'torch.nested' has no attribute '_internal'
```
**This likely wasn't caught in our CI because imports are global during unit tests(?), so we use subprocess to properly test this time**
Differential Revision: [D65961691](https://our.internmc.facebook.com/intern/diff/D65961691)
@jbschlosser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140739
Approved by: https://github.com/malfet
Summary:
Bug in quantizer when Conv + ReLU is fused even when the preceeding conv has more than one user. Conv and ReLU can not be fused in this case because the result of Conv must be used elsewhere.
XNNPACK Delegate naturally handles this by inserting a clamp node for ReLU.
Test Plan: CI
Reviewed By: digantdesai
Differential Revision: D65989599
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140846
Approved by: https://github.com/digantdesai
Doc updates:
* This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as https://github.com/pytorch/rfcs/pull/71 .
* It also does some general cleanups to simplify the distributed.rst by using `:methods`.
* It adds `__init__` definitions for the Stores
* I've reordered things so the collective APIs are before the Store/PG apis
Test plan:
```
lintrunner -a
cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140853
Approved by: https://github.com/kwen2501
# Overview
Currently monitor.py produces error only result, this pr introduct disable-monitor option to all *-test.yml. We also like to explore how the monitor code affect benchmark results.
# next steps
- fix the monitor.py
- enable non-benchmark tests with monitor
- investigate benchmark test behavior with monitor background job
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140857
Approved by: https://github.com/huydhn
Tracking issue: #138399
This PR fixes a number of reference implementations (which are also used as meta
functions), making them more consistent with CPU device. More specifically, it fixes those
operations that use `_make_elementwise_unary_reference` decorator, and don't error on
mismatching out argument dtype while they error when using concrete devices (e.g. CPU).
The fixed operations are:
- `abs`
- `ceil`
- `floor`
- `frac`
- `isneginf`
- `isposinf`
- `sgn`
- `sign`
- `signbit`
- `trunc`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140288
Approved by: https://github.com/ezyang
ghstack dependencies: #140186, #140286
Fixes#139320
### Summary:
#### (1) Add `_rename_dynamic_shapes_with_model_inputs` for dynamic_shapes to play along with input_names
* Use model forward signature to rename dynamic_shapes when dynamic_shapes is not nested and dynamic_shapes is directly using the customized name. This solves the issue that torch.export.export expects dynamic_shapes only uses the model input names.
* If the dynamic_shapes is nested, we do nothing.
#### (2) Add `_from_dynamic_shapes_to_dynamic_axes` for fallback
* We flatten dynamic_shapes with leaf defined _pytree.tree_leaves()
~~* If a dynamic_shapes is not nested, and defined in dict. We can use the key as the input_names, since it should be renamed by `_rename_dynamic_shapes_with_model_inputs` already.~~
* If a dynamic_shapes is provided, input_names is required to assign the names, because dynamic_axes needs it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139532
Approved by: https://github.com/justinchuby
I.e. replace `at::detail::getMPSHooks().isOnMacOSorNewer` with `is_macos_13_or_newer`, which is a direct function call instead of going thru a virtual method call
Hooks are only needed to provide a feature-agnostic inteface to query something even on the platforms that might not have support for the featuee, while functions implemented in `ATen/native/xxx` should be able to call those platform specific methods directly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140950
Approved by: https://github.com/Skylion007
ghstack dependencies: #140896
Summary: When AOT_PARTITIONER_DEBUG is set to 1 and debug logging is turned on we can now log the full input and output for each knapsack problem.
Differential Revision: D65633086
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140757
Approved by: https://github.com/jansel
Which is a variadic template that automates tedious (and error prone) process of pasing the arguments via series of
```cpp
mtl_setBuffer(encoder, b1, 0);
mtl_setBuffer(encoder, b2, 1);
mtl_setBytes(encoder, param, 2);
```
into a compact
```
mtl_setArgs(encoder, b1, b2, param);
```
Introduce few more specialization of `mps_setArg`, such as:
- Call `setBuffer` for `id<MTLBuffer>`
- Copy double as float (as MPS does not support double precision types)
- Accept `std::optional<at::Tensor>` that will not call setBuffet, if optional is empty
Also, re-metaprogramm `mtl_setBytes` to make it usable with any trivially copiable structs, but keep separate implementation for containers, as uploading `c10:SmallVector`, which is trivially copiable would overwrite next arguments, which luckily resulted in test failures of `test_cross_entropy_label_smoothing_weight_ignore_indices_mps`
Introduce `has_size_type_v` which could be used to diferrentiate between trivially copiable `std::array` and `c10::ArrayRef` vs other trivially copiable structs.
```cpp
template <typename T>
class has_size_type {
template <typename U>
static constexpr std::true_type check(typename U::size_type*);
template <typename>
static constexpr std::false_type check(...);
public:
static constexpr bool value = decltype(check<T>(nullptr))::value;
};
template <typename T>
constexpr bool has_size_type_v = has_size_type<T>::value;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140896
Approved by: https://github.com/Skylion007
Fixes#140598
Allows ragged structures for query and key+value sequence lengths to differ (i.e. supports cross attention for Flex + NJT).
Technically, this is BC-breaking thanks to arg renaming and positional arg reordering in `create_nested_block_mask()`, but Flex + NJT support isn't in a major release yet so I'm hoping we can just do it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140723
Approved by: https://github.com/drisspg
# Motivation
This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types.
It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(https://github.com/pytorch/pytorch/pull/131758#discussion_r1762422784)
This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.
The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support
The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases.
The following changes have been made to test_functional_api.py:
-Functionality has been added to test for non cuda devices using intel HPU as an example
-Multiple set up steps previously required by MultiProcessTestCase have been abstracted out
-Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs
-Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs
NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator.
I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR.
This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138216
Approved by: https://github.com/kwen2501, https://github.com/guangyey
Here's are some explanations of this PR.
1. Changes in `aten/src/ATen/core/Tensor.cpp` and `c10/core/DispatchKey.cpp`: Support toString method for `QuantizedPrivateUse1` backend, make pytorch print out correct backend string for it.
2. Add header `DispatchStub.h` in `aten/src/ATen/native/quantized/IndexKernel.h`: If this header is not included, we can't utilize `masked_fill_kernel_quantized_stub` even we include this `IndexKernel.h` header, it would throw an error during compilation.
3. Add multiple `TORCH_API`s in `aten/src/ATen/native/quantized/AffineQuantizer.h`: these functions is useful for other privateuse1 backends supporting quantization functions, if these `TORCH_API` are missed, it would throw an error during runtime (undefined symbol)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139860
Approved by: https://github.com/bdhirsh
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">
My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.
Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).
But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.
Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin