### Before this PR:
`torch.utils.swap_tensors(a, b)` required the `use_count` of `a` and `b` to be 1
```python
a = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, 4)
out = a * 2
out.sum().backward()
# Calling swap_tensors here would fail due to the reference held by AccumulateGrad node, which is not cleaned up after backward
# torch.utils.swap_tensors(a, b)
del out
# Calling swap_tensors here would pass
torch.utils.swap_tensors(a, b)
```
### After this PR:
`torch.utils.swap_tensors(a, b)` requires the `use_count` of `a` and `b` to be 1 or 2 IF the second reference is held by `AccumulateGrad`
A pre-hook will be registered on the `AccumulateGrad` node so that it will fail if it is called (i.e. if user attempts to backward through the graph).
```python
a = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, 4)
out = a * 2
out.sum().backward()
# Calling swap_tensors here is ok
torch.utils.swap_tensors(a, b)
# If we ever backward to the AccumulateGrad node it will error that it was poisoned by swap_tensors
```
### Application to `nn.Module`
This issue is especially pertinent in context of `nn.Module` where parameters will have `AccumulateGrad` nodes initialized after forward. Specifically, this is intended to address https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127777866. Previously, this would fail at the `m.cpu()` but we want users to be able to do something like the following, and instead raise an error if the user ever attempts to backward through the poisoned `AccumulateGrad` node
```python
import torch
import torch.nn as nn
m = nn.Linear(3, 5)
inp = torch.randn(2, 3)
out = m(inp)
out.sum().backward()
m.cpu()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127313
Approved by: https://github.com/soulitzer
This updates the nesting of if statements in `nn.Module._apply` such that if
`torch.__future__.set_swap_module_params_on_conversion(True)`, we always try to swap regardless of whether
- `torch._has_compatible_shallow_copy_type(param, fn(param)`
- `torch.__future__.set_overwrite_module_params_on_conversion` is set
This means that `meta_module.to_empty('device')` can now use the swap_tensors path cc @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120659
Approved by: https://github.com/albanD
Added `torch.__future__.{get/set}_swap_module_params_on_conversion` that defaults to `False` for now, but we probably want to modify to override this and default to `True` in `nn.Module._apply` if input is a tensor subclass.
From offline discussion, for now we are **not** allowing `swap_tensor` after the first module forward has been run*** if the autograd graph is still alive. The reason being that `torch.utils.swap_tensors(t1, t2)` requires the `use_count` of both `TensorImpl`s associated with `t1` and `t2` to be 1. The first forward pass will install `AccumulateGrad` nodes on each param, which [bump the refcount of the associated TensorImpl](6cf1fc66e3/torch/csrc/autograd/variable.cpp (L307)). **Future work might be to swap the refs that the `AccumulateGrad` nodes hold if it is necessary.**
***From this, it might seem like we don't need to handle gradients. However, I still handle the grads for the edge case that the grads are set via `p.grad = grad` OR the autograd graph is no longer alive because the output has been garbage collected.
If any `swap_tensors` fails on any of the parameters in the `nn.Module` we raise an error.
**`RNNBase` overrides `nn.Module._apply()` and installs weakrefs on some parameters. As a result, all modules that inherit from `RNNBase` (`RNN`, `GRU` and `LSTM`) cannot use the`swap_tensors` path as of now**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117167
Approved by: https://github.com/albanD
ghstack dependencies: #118028
Adds the unit tests requested in #95810
This PR also addresses a gap in unit testing of gradients, as `gradcheck` always performs total derivatives w.r.t. all arguments and module parameters. Some modules have different code paths for partial derivatives, e.g. `LayerNorm`, and those should be tested separately.
The PR has the following limitations:
- it does not test partial derivatives w.r.t. every combination of arguments, which would exponentially increase CI time.
- it does not implement the same logic for Hessians, where the increase in CI time would be quadratic in the number of arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103809
Approved by: https://github.com/kit1980
Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676
Approved by: https://github.com/ezyang
Fixes#52664. Checks if the attribute is a property that defines a setter and uses fset in __setattr__ rather than registering an inaccessible module / parameter.
This is BC-breaking as the attribute setters on nn.Module properties used to be ignored and now will be called properly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92044
Approved by: https://github.com/albanD
### Changelist
* Change Windows TORCH_CUDA_ARCH_LIST from `7.0` to `8.6` to compatible with NVIDIA A10G TPU
* Correctly disable some tests that requires flash attention, which is not available on Windows at the moment. This has been fixed by https://github.com/pytorch/pytorch/pull/91979
* G5 runner has `AMD EPYC 7R32` CPU, not an Intel one
* This seems to change the behavior of `GetDefaultMobileCPUAllocator` in `cpu_profiling_allocator_test`. This might need to be investigated further (TODO: TRACKING ISSUE). In the meantime, the test has been updated accordingly to use `GetDefaultCPUAllocator` correctly instead of `GetDefaultMobileCPUAllocator` for mobile build
* Also one periodic test `test_cpu_gpu_parity_nn_Conv3d_cuda_float32` fails with Tensor not close error when comparing grad tensors between CPU and GPU. This is fixed by turning off TF32 for the test.
### Performance gain
* (CURRENT) p3.2xlarge - https://hud.pytorch.org/tts shows each Windows CUDA shards (1-5 + functorch) takes about 2 hours to finish (duration)
* (NEW RUNNER) g5.4xlarge - The very rough estimation of the duration is 1h30m for each shard, meaning a half an hour gain (**25%**)
### Pricing
On demand hourly rate:
* (CURRENT) p3.2xlarge: $3.428. Total = Total hours spent on Windows CUDA tests * 3.428
* (NEW RUNNER) g5.4xlarge: $2.36. Total = Total hours spent on Windows CUDA tests * Duration gain (0.75) * 2.36
So the current runner is not only more expensive but is also slower. Switching to G5 runners for Windows should cut down the cost by (3.428 - 0.75 * 2.36) / 3.428 = **~45%**
### Rolling out
https://github.com/pytorch/test-infra/pull/1376 needs to be reviewed and approved to ensure the capacity of the runner before PR can be merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91727
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/seemethere
The intention of this assert is to check the input tensor's version
counter has increased, indicating it was mutated by `m_inplace`.
However, the cloning step to create `input_arg_clone` restarts the
version counter to zero, so this test may fail if the sample input
was ever mutated during its creation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85734
Approved by: https://github.com/albanD
Fixes#80247
This PR:
* Refactors the skip logic as done for OpInfo in #62713, fixing the logic error
* For tests that were wrongly skipped before and now fail:
* Fix `TestModule.test_cpu_gpu_parity` to support Lazy modules - this was affecting `LazyConv*`
* Adds `@expectedFailure` decorators and a follow-up message to address `Conv*` failures on `TestModule.test_memory_format`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80471
Approved by: https://github.com/mruberry