Rewrite inplace addcdiv to a div, mul and inplace add to avoid graph break
Rewrite inplace add to a mul and inplace add to avoid graph break
Needed to close optimizer graph breaks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90330
Approved by: https://github.com/jansel
Summary: To get source for a particular module, the "correct" thing to do is to check the module's spec and use `get_source` if it's a SourceFileLoader, since subclasses may look elsewhere than the `__file__`, and the spec will give the source of truth. For torch packager, however, we prefer to use linecache, but the loader could still change the file, so we figure out the file for the module using the spec's loader rather than using `module.__file__`, if possible.
Test Plan: This code path will get exercised by CI. Also added a test for remapped files.
Differential Revision: D41412983
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90258
Approved by: https://github.com/PaliC
Summary:
We should not fork in deploy when initializing torch.
Traceback (most recent call last):
File "<string>", line 38, in <module>
File "<string>", line 36, in __run
File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/users/zyan/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/multipy/runtime/__test_py__/test_py#link-tree/multipy/runtime/test_py.py", line 61, in <module>
import torch # has to be done serially otherwise things will segfault
File "/data/users/zyan/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/multipy/runtime/__test_py__/test_py#link-tree/torch/__init__.py", line 158, in <module>
platform.system() != 'Windows':
File "/usr/local/fbcode/platform010/lib/python3.8/platform.py", line 891, in system
return uname().system
File "/usr/local/fbcode/platform010/lib/python3.8/platform.py", line 857, in uname
processor = _syscmd_uname('-p', '')
File "/usr/local/fbcode/platform010/lib/python3.8/platform.py", line 613, in _syscmd_uname
output = subprocess.check_output(('uname', option),
Test Plan: override a local script run trigger init and set `subprocess.check_output` to None
Reviewed By: yinghai, houseroad
Differential Revision: D41848592
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90492
Approved by: https://github.com/PaliC
Happy to split this PR more if it helps.
This PR adds functorch.grad support for autograd.Function. There's a lot
going on; here is the high level picture and there are more details as
comments in the code.
Mechanism (PyOperator)
- Somehow, autograd.Function needs to dispatch with functorch. This is
necessary because every layer of functorch needs to see the
autograd.Function; grad layers need to preserve the backward pass.
- The mechanism for this is via PyOperator. If functorch transforms are
active, then we wrap the autograd.Function in a `custom_function_call`
PyOperator where we are able to define various rules for functorch
transforms.
- `custom_function_call` has a rule for the functorch grad transform.
autograd.Function changes
- I needed to make some changes to autograd.Function to make this work.
- First, this PR splits autograd.Function into a _SingleLevelFunction
(that works with a single level of functorch transform) and
autograd.Function (which works with multiple levels). This is necessary
because functorch's grad rule needs some way of specifying a backward
pass for that level only.
- This PR changes autograd.Function's apply to eitehr call
`custom_function_call` (if functorch is active) or super().apply (if
functorch isn't active).
Testing
- Most of this PR is just testing. It creates an autograd.Function
OpInfo database that then gets passed to the functorch grad-based tests
(grad, vjp, vjpvjp).
- Since functorch transform tests are autogenerated from OpInfo tests,
this is the easiest way to test various autograd.Function with
functorch.
Future
- jvp and vmap support coming next
- better error message (functorch only supports autograd.Function that
have the optional setup_context staticmethod)
- documentation to come when we remove the feature flag
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89860
Approved by: https://github.com/soulitzer
Adds a setup_context staticmethod to autograd.Function.
If it exists, then the user splits the ctx-specific logic from the
forward() and puts it in the setup_context staticmethod.
Docs will come later when we remove the feature flag.
Test Plan:
- some light tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89859
Approved by: https://github.com/soulitzer
This PR adds a private runtime feature flag for the feature work we're going
to do with extending autograd.Function. The motivation of the feature flag
is:
- to guard the feature against unsuspecting users
- control the release of the feature to when we are ready to release it
We might not even need the feature flag (because we hope to have the
work done in the next month), but it is good practice and it does touch
currently public API (autograd.Function).
Concretely, "autograd.Function extension" refers to:
- adding an optional `setup_context` staticmethod to autograd.Function
- adding an optional `vmap` staticmethod to autograd.Function
- autograd.Function support for functorch
Test Plan:
- new test that the feature flag works
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89858
Approved by: https://github.com/soulitzer
Summary:
This pull request makes some tweaks on LazyTensor class such that it's easier for XLATensor to inherit.
1. It replaces data_ptr() with data() which now returns a const shared_ptr& type.
2. It adds a temporary ctor to LazyTensor::Data such that XLATensor::Data can easily inherits it.
3. It moves LazyTensor(std::shared_ptr<Data>) and SetTensorData(at::Tensor) to protected for XLATensor to access.
Test Plan:
CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90363
Approved by: https://github.com/JackCaoG
The function signature in its current state is ambiguous.
Its an inline function that is also declared to be imported from the DLL.
which leaves it subject to compilers decision to choose one or the other and depending on what the compiler/linker may choose we may get one of the two behaviors for the `aten::init_num_threads` call:
1. Once-per-dll-in-a-thread (if its inlined)
2. Once-per-thread (if its imported)
I suspect once-per-dll-in-a-thread is already the case currently because it being tagged inline
So removing the inline will simply make it a little more consistent and clear.
The function exists to avoid repeated calls to aten::init_num_threads.
Being in an "internal" namespace, the function isnt expected to be called by external plugins which means that the "once-per-dll-in-a-thread" behavior isn't that much of a problem anyway
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89511
Approved by: https://github.com/malfet
Adds 2 new hybrid sharding strategy to FSDP:
1. HYBRID_SHARD: applies zero-3 style sharding within a node, and data parallel across
2. HYBRID_SHARD_ZERO2: applies zero-2 style sharding within a node, and data parallel across
These are useful for medium sized models and aim to decrease communication volume, tests and benchmarks will be run to understand which workloads are optimal under which sharding strategy.
Hybrid sharding in general works by sharding the model using a process group within a single node, and creating intra-node process groups for replication / data parallelism. The user either needs to pass in a tuple of these process groups, or None, and we generate the process groups appropriately.
** Acknowledgements **
- @awgu 's excellent prototype: 5ad3a16d48
- @liangluofb For ideation, feedback, and initial implementation and experimentation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89915
Approved by: https://github.com/awgu
This makes the signature of `torch.masked.std` and `var` more consistent with the global namespace variant and also updates the sample inputs to repurpose the existing `sample_inputs_std_var` inputs which fully exercise the `correction` argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87118
Approved by: https://github.com/cpuhrsch
I believe that @mrshenli used `ModuleWrapPolicy({UnitModule})` when applying `fully_shard` to `UnitModule`s because `policy=None` was not supported. However, he added that support in a previous PR, so this PR simplifies to using `policy=None` to make the intention more clear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90400
Approved by: https://github.com/mrshenli
- This PR introduces a new concept, the _communication module_ (denoted `comm_module`), that represents the module responsible for the unshard/reshard pair for a `FlatParamHandle`. This is well-defined because the current design assumes that each `FlatParamHandle` only has _one_ unshard/reshard pair for either the forward or backward pass.
- For the wrapper code path, the `comm_module` is exactly the module already being passed to the `FlatParamHandle` constructor.
- For the composable code path, the `comm_module` is not necessarily the module already being passed to the `FlatParamHandle`. This is because the module already being passed is always the local FSDP root module to give complete FQNs, instead of local FQNs. Distinguishing the communication module from the local FSDP root module can provide more flexibility for non-recursive wrapping designs in the future.
- This PR adds a unit test `test_unshard_reshard_order` that explicitly checks that `_unshard` and `_reshard` are called in the exactly the same order across the two code paths.
- This PR does not fix `test_checkpoint_fsdp_submodules_use_reentrant`. However, the error message changes, so this PR accommodates that.
- The error is now the same as if we used the equivalent wrapper FSDP:
```
test_model.u1 = FSDP(test_model.u1, use_orig_params=True)
test_model.u2 = FSDP(test_model.u2, use_orig_params=True)
```
- The error is also the same as if we used wrapper FSDP with `use_orig_params=False`, so it is not unique to `use_orig_params=True`.
---
**`comm_module` Example**
```
model = Model(
seq1: nn.Sequential(
nn.Linear
nn.ReLU
nn.Linear
nn.ReLU
)
seq2: nn.Sequential(
nn.Linear
nn.ReLU
nn.Linear
nn.ReLU
)
)
policy = ModuleWrapPolicy({nn.Sequential})
fully_shard(model, policy=policy)
FullyShardedDataParallel(model, auto_wrap_policy=policy)
```
- This policy constructs two `FlatParamHandle`s, one for `seq1` and one for `seq2`.
- `FullyShardedDataParallel` will pass `seq1` and `seq2` as the `module` argument to the two `FlatParamHandle`s, respectively.
- `fully_shard()` will pass `model` as the `module` argument to every `FlatParamHandle`.
- `FullyShardedDataParallel` will pass `seq1` and `seq2` as the `comm_module` argument to the two `FlatParamHandle`s, respectively.
- `fully_shard()` will pass `seq1` and `seq2` as the `comm_module` argument to the two `FlatParamHandle`s, respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90387
Approved by: https://github.com/mrshenli
Unlike for FSDP, where we already diverged to using per-test-file models, let us try to use the same set of models for the composable API effort. This can improve debugging efficiency because we know which module structures we support and which we do not _across all of our composable APIs_.
This PR had to perform some surgery for `test_materialize_meta_module`. Writing a correct parameter initialization function for meta device initialization is not easy, and we should revisit this. The old implementation, which followed the style of the previous unit tests--namely, using `module.to_empty()`--is actually incorrect for nested FSDP applications because `module.to_empty()` will re-initialize already materialized parameters and the module materialization proceeds bottom up. The existing unit test in `test_fsdp_meta.py` passes because it sets every parameter to ones (`self.weight.fill_(1)`), which is idempotent to re-initialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90386
Approved by: https://github.com/mrshenli
The `PositiveDefiniteTransform` is required to transform from an unconstrained space to positive definite matrices, e.g. to support testing the Wishart mode in #76690. It is a simple extension of the `LowerCholeskyTransform`.
I've also added a small test that ensures the generated data belong to the domain of the associated transform. Previously, the data generated for the inverse transform of the `LowerCholeskyTransform` wasn't part of the domain, and the test only passed because the comparison uses `equal_nan=True`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76777
Approved by: https://github.com/lezcano, https://github.com/fritzo, https://github.com/soumith
Summary:
This PR implements `prune` in BaseStructuredSparsifier:
`prune` is a function that takes in a model with structured sparsity parametritizations (the result of `prepare`) and will return a resized model with the masked out weights removed.
`prune` is defined by a mapping from **patterns** to different **pruning functions**.
- **patterns** are just sequences of operations, for example `(nn.Linear, activation, nn.Linear)`
- **pruning functions** are functions that take in an matched pattern as args and will resize the appropriate layer sizes and weights.
```
def prune_linear_activation_linear(linear1, activation, linear2):
pass
```
- This is one line in the pattern config `(nn.Linear, activation, nn.Linear): prune_linear_activation_linear`
At a high level `prune` works by finding instances of the graph that match different patterns and then calling the mapped pruning functions on those matched patterns.
This is unlike the previous code which attempted to do both at the same time.
There may be some gaps in the patterns compared to the previous implementation, but the conversion functionality support should be the same.
Currently we have pruning functions for the following patterns:
- linear -> linear
- linear -> activation -> linear
- conv2d -> conv2d
- conv2d -> activation -> conv2d
- conv2d -> activation -> pool -> conv2d
- conv2d -> pool -> activation -> conv2d
- conv2d -> adaptive pool -> flatten -> linear
Added in MyPy type hints as well for the prune_functions.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89777
Approved by: https://github.com/vkuzo
Prior to this change, the symbolic_fn `layer_norm` (before ONNX version 17) always lose precision when eps is smaller than Float type, while PyTorch always take eps as Double. This PR adds `onnx::Cast` into eps related operations to prevent losing precision during the calculation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89869
Approved by: https://github.com/BowenBao
This is the last PR for integrating 2D into core distributed.
This PR does the following:
1. Add optimizer.py: this adds ability to load a state_dict in conjunction with FSDP sharded optimzer state.
2. Update default_planner.py to support 2D checkpoint.
3. Add test_fsdp_optim_state.py as a unit test for No. 1.
4. Fix bug in torch/testing/_internal/distributed/checkpoint_utils.py
5. Rename the filename for the APIs that should be private. Will organize and cleanup further in following PRs. #90328
Docstring and integration test will be added in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90212
Approved by: https://github.com/wanchaol