Summary:
Addresses https://github.com/pytorch/pytorch/issues/91888
We use wait as the amount you wait in between cycles when profiling and skip_first to delay the start of said profiling. However, once skip_first steps are completed, we immediately go to the wait phase. This is not problematic if wait is smaller than skip_first because we can just lower the values of skip_first, but if it is larger then we end up starting the first profile much later than desired. For example imagine a skip first of 1 and a wait of 100 with repeat of 2. We do want to wait 100 steps in between cycle 1 and 2 but we may not want to start warmup of cycle 1 at step 101 (forced because wait occurs directly after first steps skipped). This diff addresses this by adding a flag to skip the first wait.
Adds new flag but sets to false by default so that existing impl is not affected.
Test Plan:
Got reasonable traces with this schedule:
schedule=torch.profiler.schedule(
wait=10, warmup=3, active=1, repeat=1, skip_first=1, skip_first_wait=1
)
Differential Revision: D66198138
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141070
Approved by: https://github.com/aaronenyeshi, https://github.com/briancoutinho
Summary: Without this change calling `str(MatchState.SOMETHING)` will cause exception.
Test Plan:
Can we add unittest somewhere?
Ensure `str(MatchState.FULLY_MATCHED)` and `str(MatchState.FULLY_MATCHED())` won't raise exception.
Differential Revision: D66321609
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141297
Approved by: https://github.com/fduwjj
#140739 and #140740 made it such that `get_safe_globals` no longer return an empty List by default
This caused some tests that check the content of `get_safe_globals` to fail, in particular when run individually (they didn't fail in test suite as other tests ran before them called `clear_safe_globals`) but will fail when tests are run individually [T208186010](https://www.internalfb.com/intern/tasks/?t=208186010)
test_safe_globals_for_weights_only
test_safe_globals_context_manager_weights_only
This PR fixes that and also makes most tests calling `clear_safe_globals` use the `safe_globals` context manager rather than try: finally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141300
Approved by: https://github.com/awgu
Summary:
`repro.py` can have nested graph modules, e.g.
```
class Repro(torch.nn.Module):
def __init__(self) -> None:
super().__init__()
self.true_graph_0 = GraphModule()
def forward(self):
true_graph_0 = self.true_graph_0
return (true_graph_0,)
```
So dumping the string doesn’t always work.
So,
1) we use exported program in repro.py instead
2) we still dump the graph module string, but only put it in comments
We also added two flags to `minifier_launcher.py`
- `minifier-export-mode`: whether strict or non-strict export is used in the minifier
- `skip-export-error`: intermediate graphs that cannot be exported will be skipped.
Test Plan:
```
buck2 run fbcode//caffe2/test/inductor:minifier_utils_cpu -- -r string
python test/inductor/test_minifier.py
```
Differential Revision: D66175257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141159
Approved by: https://github.com/henrylhtsang
As suggested by @leslie-fang-intel in 4c83e4e751 (diff-139642bd981df977f70f4c18c1c34bd1a85c1d6b9ffa06aaa98426ed83942a31R537) - all elements of `B` tiles (not referring to AMX tiles, but the tiles at the granularity of the micro-kernel) have contiguous elements since `B` matrix is pre-packed, so dequantized buffer loading logic can be simplified. While the previous approach kept elements to be loaded into a B AMX tile contiguous, the new approach doesn't entail any performance penalty either because that data is already in L1D, so loading AMX tiles from non-contiguous dequantized B elements doesn't adversely affect performance.
Also rectified the size of the dequantized B buffer.
Fixes#140208.
A subsequent PR will factor out caching of dequantized int8 weights into a separate codegen function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140258
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
Summary:
When we have both `set_grad` and `autocast` HOP, name collision might happen when we try to inline a node.
For exmaple, for a GraphModule like this:
```
GraphModule(
(submod_0): GraphModule(
(submod_1): GraphModule()
)
(submod_1): GraphModule()
(submod_2): GraphModule()
)
```
when we inline `submod_0`, we might accidentally overwrite `submod_1`.
In this PR, we fix this by check if the graph module already has an attribute with the same name, if so, we use the next "submod_{i}", until no name collision.
Partially fixes https://github.com/pytorch/pytorch/issues/140589.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_predispatch_autocast_and_set_grad
```
Differential Revision: D66200994
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141169
Approved by: https://github.com/angelayi
fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is *not* true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once.
Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at 76ca4a6888/fsspec/transaction.py (L39) -- because each filesystem can have a single transaction, this tear-down logic will error).
Instead, let's manually handle committing / discarding changes to the file. This does this "the old-fashioned way" instead of using `fsspec`'s commit/rollback behavior because the internal PathManagerFileSystem used for `iopath` does not properly support that behavior.
I don't have a minimal test-case, but in Meta this solves a broken test on `fsspec >= 2024.3.0`:
Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607
After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135541
Approved by: https://github.com/Skylion007
Fixes#140986
This includes several improvements on the grammar and wording of nn/module.py, mostly simple one word fixes, but also other slightly more elaborate ones.
It addresses about half of the docs for module.py but I would be glad to cover the rest of it if required to do so.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140987
Approved by: https://github.com/mikaylagawarecki
Summary:
Splitting this PR into two, one for the cuSPARSELt improvements, and one
for the inductor lowering.
This PR adds in the additional cuSPARSELt bindings into pytorch.
* `torch._cslt_sparse_mm_search` will be deprecated in a future PR,
so a warning has been added
* Added a header file for cuSPARSELtOps.cpp
* max_id is now available in `torch.backends.cusparselt` via
`torch.backends.cusparselt.get_max_alg_id()`
* fixed meta registrations for float8
Test Plan:
python test/test_sparse_semi_structured.py
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch, https://github.com/eqy
Ever since #135140, this test will fail if run with CPU parameterization (e.g. test_out__refs_logical_or_cpu_float32) and CUDA available - as far as I can tell, the PyTorch CI isn't currently checking for this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137140
Approved by: https://github.com/ezyang
# Summary
We have another IMA for captured buffers when we are the sequences are not divisible.
Running test before this commit:
```Shell
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 447 errors
========= ERROR SUMMARY: 347 errors were not printed. Use --print-limit option to adjust the number of printed errors
```
And After
```Shell
❯ CUDA_LAUNCH_BLOCKING=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool memcheck pytest test/inductor/test_flex_attention.py -k "test_non_divisible_with_captured_buffer"
========= COMPUTE-SANITIZER
====================================================== test session starts =======================================================
platform linux -- Python 3.12.7, pytest-7.4.0, pluggy-1.5.0
rootdir: /home/drisspg/meta/pytorch
configfile: pytest.ini
plugins: hypothesis-6.115.5, typeguard-4.3.0
collected 518 items / 517 deselected / 1 selected
Running 1 items in this shard
test/inductor/test_flex_attention.py . [100%]
=============================================== 1 passed, 517 deselected in 13.31s ===============================================
========= ERROR SUMMARY: 0 errors
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141164
Approved by: https://github.com/Chillee
Summary:
Add the following inductor fx graph cache stats to dynamo compile
- inductor_fx_cache_hit_count
- inductor_fx_cache_miss_count
- inductor_fx_cache_backend_type
- inductor_fx_cache_hit_keys
- inductor_fx_cache_miss_keys
- remote_cache_version
Test Plan: Run local tests and staging logger: P1683061460
Differential Revision: D66232206
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141190
Approved by: https://github.com/masnesral
Changes semantic of __repr__ of P2POp: s, d are now group ranks instead
of global ranks. I think this is OK since I also updated the field names
to make this obvious.
Also add mypy annotations
Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in #140460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141054
Approved by: https://github.com/kwen2501
Fix: #139936
This PR modifies the lowering of `split` operation, so that it won't generate guards,
specializing on the sizes parameter. Instead, it specializes on the number of output
tensors being generated (i.e. function of the size of the base tensor, and the sizes
parameter).
As a result, operations such as `chunk` (whose number of output tensors usually is
constant given a static chunk number) won't trigger recompiles when varying the size of
the base tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141077
Approved by: https://github.com/ezyang
Currently real tensor tracing raises MetadataMismatchErrors if registered fake kernels don't match the real kernels (e.g. shape, aliasing, dtype, etc.). This adds an option to use fake kernel inference to bypass mismatches - this option defaults to False for real tensor tracing, but is on for draft export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139766
Approved by: https://github.com/angelayi, https://github.com/zou3519
Adding `destroy_pg_upon_exit` property to allow derived Test classes to control whether auto destroy is desired.
(Otherwise, derived test classes will need to rewrite the `_run()` method, leading to duplicated code of `_run()` and if one needs to add things to `_run` in the future, more code change is needed.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141192
Approved by: https://github.com/wconstab
This PR implements the framework for supporting HOP in the ONNX exporter. Refer to https://github.com/pytorch/pytorch/issues/140995 for the design.
- Implement support for torch.cond
- Refactor `_add_nodes` into `_translate_fx_graph` to handle nested subgraphs. To support building subgraphs as functions using the same logic, new handlers for `placeholder` and `output` nodes are added to register inputs and outputs on the onnx function.
- Fuctions are created under the domain of `pkg.torch.__subgraph__`
- Updated the type promotion pass to run on nested subgraphs.
- Implement torch.cond in `_torchlib/ops/hop.py`. Updated the registry to discover these ops.
- Improve opset_import handling robustness with `add_opset_imports` IR pass. To achieve this, we added opset version to all Nodes. Fixes https://github.com/pytorch/pytorch/issues/139503Fixes#117655Fixes#123972Fixes#93743 Closes https://github.com/pytorch/pytorch/issues/140995
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137428
Approved by: https://github.com/justinchuby
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
This test is failing locally in 3.12 and 3.13 and is blocking 3.13 CI enablement.
It may have to do with scipy version, see .ci/docker/requirements-ci.txt (3.12+ has scipy 1.12.0/1.14.1, where as < 3.12 requires scipy 1.10.1).
Wanted to xfail these tests, but they somehow pass sometimes on CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140731
Approved by: https://github.com/ezyang, https://github.com/malfet
Differential Revision: [D65362160](https://our.internmc.facebook.com/intern/diff/D65362160)
State after this IR:
1. For the tests that require inference IR, they are replaced with ep.run_decomp({}) so export_for_training_run_decomp is sort of redundant but i guess it is still nice that multiple round of retracing still working. In general, we need some auditing to reduce our redundant testing coverages.
2. After this PR landed and not get reverted for a week or so, i will replace the export_for_training calls with export as they are the same thing now.
3. Added more tests to also cover now "deprecated" old IR by patching export to use old export. For reviewers, please look at the internal version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139511
Approved by: https://github.com/ydwu4, https://github.com/angelayi, https://github.com/avikchaudhuri
Fixes a bunch of benchmarks that failed with cudagraph errors including `tlp python benchmarks/dynamo/timm_models.py --device cuda --inductor --accuracy --amp --training --only resmlp_12_224` when `specialize_float=False`
Also brings down number of overall failures (with keep-going) from 108 => 62. I'd estimate >80% of those 62 are wobbly expect tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140346
Approved by: https://github.com/ezyang
ghstack dependencies: #140983, #141003
Looks like a regression caused by use of strided API, but adding the test revealed (at least in CI), that on Ventura it worked but returned garbage results, so fixed by removing all the logic about channels last (as it's irrelevant for strided API case and placeholder already turns tensor into a correct one)
This also allows one to remove `mem_format_key` and `ns_shape_key` (it was redundant even back then, as `mem_format_key` + `getTensorsStringKey(grad_output_t)` already uniquely identified the operation)
Fixes https://github.com/pytorch/pytorch/issues/140902
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141009
Approved by: https://github.com/manuelcandales
hipGraphExecDestroy doesn't immediately free memory since rocm6.2.
They wait for next sync point in order to free the memory, this is to ensure that all hipGraphLaunch are finished before we release any memory.
We need to ensure all async opreations finish before deleting the object.
capture_dev_ variable is used to save the device number when capture_begin() method is called
But CUDAGraph can be created and destroyed without calling capture_begin() method. `capture_dev_ = UNDEFINED_DEVICE;` allows to detect such a case and skip sync
Tests impacted:
test_cuda.py::TestCuda::test_graph_make_graphed_callables_*
distributed/test_c10d_nccl.py::ProcessGroupNCCLTest::test_allreduce_in_cudagraph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138722
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/jeffdaily
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
Summary:
**wins**
on torchrec benchmark, for 2K nodes it save 40seconds
with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on).
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200
```
This diff optimizes construction expressions of the form
a+b+c... (all unique symbols).
which are very common in torchrec models.
**How**
Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them.
If we have a+b+c and we are adding (d) to it, we can do a binary search to know
the position of (d) and avoid optimizing the new expression by passing the new order.
**Extensions**:
1. support constant terms.
2. support 10a+10b+.. (this will give even more wins will extend the support in second PR)
Differential Revision: D66008482
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140822
Approved by: https://github.com/ezyang
Tracking issue: #138399
This PR changes the `pow` C++ implementation, making its C++ meta kernel consistent with
its Python ref implementation. The following example shows the inconsistency between the
two:
```python
def run(device):
S = (5,)
a = torch.rand(S, device=device, dtype=torch.float32)
b = 2
out = torch.empty(S, device=device, dtype=torch.float64)
return torch.pow(a, b, out=out)
>>> run("cpu")
Traceback (most recent call last):
File "test.py", line 34, in run
return torch.pow(a, b, out=out)
RuntimeError: Found dtype Double but expected Float
>>> run("meta")
tensor(..., device='meta', size=(5,), dtype=torch.float64)
```
**~Update:~**
~Note that this happens only for `pow.Tensor_Scalar` overloads. Therefore, this PR needed
further 2 modifications:~
- ~Split the `pow` ref implementation, making `pow.Tensor_Scalar` error on mismatching
output dtypes~
- ~Create a dispatch for `pow` when `_refs.pow()` is called~
**Update:**
Changing the `TensorIteratorConfig` for `pow.Tensor_Scalar` was easier and,
after the discussion below, more correct. The solution was to change the
`TensorIteratorBase::build_output_borrowing_argument_owning_unary_op` function,
setting:
- `cast_common_dtype_to_outputs`; and
- `enforce_safe_casting_to_output`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140287
Approved by: https://github.com/ezyang
Detailed description:
The codes below will raise an error
```Python
import torch
from torch.fx.experimental.proxy_tensor import make_fx
def func(a):
b = a + 1
c = b.view(-1)
c.add_(1)
return b
input = torch.randn(2)
out = make_fx(func)(input)
```
The error info are like below:
```Python
...
File "/root/Git.d/pytorch/pytorch/torch/_dynamo/codegen.py", line 34, in <module>
from .variables.torch_function import TensorWithTFOverrideVariable
File "/root/Git.d/pytorch/pytorch/torch/_dynamo/variables/torch_function.py", line 185, in <module>
populate_builtin_to_tensor_fn_map()
File "/root/Git.d/pytorch/pytorch/torch/_dynamo/variables/torch_function.py", line 146, in populate_builtin_to_tensor_fn_map
inp0 = torch.ones(1)
File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 1240, in __torch_function__
return func(*args, **kwargs)
File "/root/Git.d/pytorch/pytorch/torch/utils/_stats.py", line 21, in wrapper
return fn(*args, **kwargs)
File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 1342, in __torch_dispatch__
return proxy_call(self, func, self.pre_dispatch, args, kwargs)
File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 907, in proxy_call
name=proxy_mode.tracer.graph._target_to_str(func.overloadpacket.__name__),
AttributeError: 'PythonKeyTracer' object has no attribute 'graph'
...
```
Solutions:
Import torch._dynamo before dispatch_trace is called to avoid the context set before dispatch_trace from affecting the torch._dynamo import.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141022
Approved by: https://github.com/ezyang
Fixes `python test/inductor/test_fused_attention.py SDPAPatternRewriterCpuTests.test_pattern_fails_with_unsupported_mask_cpu` when `specialize_float=False`. You might wonder how it's related, it's because there is a "negative" test that expects us not to match. Previously it would fail on isinstance(param, Tensor), but now that we tensorify the float, it did match and caused a failure. This check ensures the mask has the same shape to ensure this negative test case actually fails.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141003
Approved by: https://github.com/ezyang
ghstack dependencies: #140983
Summary: The way we've been de/serializing sympy.Exprs is not roundtrippable in all cases (serialize by calling `str(expr)`, and deserialize by calling `sympy.sympify(expr_str)`). This has led to expressions being mathematically equivalent but structurally different, causing issues in ValueRanges. Example issue: https://github.com/pytorch/pytorch/issues/136797
This starts to deprecate the use of `expr_str` and stores expressions in AST format instead. For BC purposes, `expr_str` deserialization is still supported, but we will always serialize to `expr_ast`. We'll kill this once the serialization upgrader design is finalized and implemented.
Test Plan: test_export
Differential Revision: D65638757
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140084
Approved by: https://github.com/angelayi
Fix https://github.com/pytorch/pytorch/issues/140462 .
Horace found that when we implicitly fallback to eager, some eager kernels may not work correctly if Inductor provide non-contiguous inputs (due to padding etc.). The original issue is found for the backward op of weight_norm. The fix in this PR is a general one: we force inputs to all implicit fallback kernels to be contiguous.
I have to refactor the code a bit to make it work. Previously we apply layout constraint in `GraphLowering.run_node`. We looks for implicit fallback in `call_function`. The problem here is, when we setup the implicit fallback in `call_function` with a layout constraint, we don't have a chance to apply the constraints.. The refactor moves the code that applies layout constraints to `call_function`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140996
Approved by: https://github.com/jansel
When the stub file `nn/parallel/distributed.pyi` was removed (#88701), some types that existed are no longer available. This pull request adds them back.
Just for reference, these types are used in pytorch-lightning's LightningCLI. Command line interfaces are created automatically, and having type hints make them nicer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136835
Approved by: https://github.com/kwen2501
Summary:
Some graphs produced by the minifier graph cutter cannot be used for AOTI/export (illegal graphs), these should be considered as graphs that don't fail in the minifier, so the minifier keeps searching.
One example is the following graph, where `true_graph_0` is an fx.GraphModule. Here, export.export() would give a `UserError` with `ErrorType = UserErrorType.INVALID_OUTPUT`.
```
# graph():
# %true_graph_0 : [num_users=1] = get_attr[target=true_graph_0]
# return (true_graph_0,)
```
This graph could be obtained from the module below:
```python
class M(torch.nn.Module):
def forward(self, x, flag):
flag = flag.item()
def true_fn(x):
return x.clone()
return torch.cond(flag > 0, true_fn, true_fn, [x])
```
So we detect such errors, and exclude them from minifier's search (consider these graphs as didn't fail).
This is ok and won't miss any actual errors, since the AOTI minifier is only designed to catch errors in the AOTI phase anyway, it is not responsible to catching export bugs.
Test Plan:
```
buck2 run fbcode//caffe2/test/inductor:test_minifier_utils -- -r invalid_output
```
Differential Revision: D66143487
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140999
Approved by: https://github.com/henrylhtsang
Summary: We are working on onboarding legokit modules to ModuleStability and this is needed to fix the serialization issue found in P1680200613
Test Plan:
`buck2 test //torchrec/fb/legokit/module_stability_tests/layer_norm_stability_test:layer_norm_stability_test -- --env ADD_NEW_STABILITY_CONFIGS=True`
serialization succeeds when the above command is run on top of this diff.
Differential Revision: D66034492
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141047
Approved by: https://github.com/angelayi
Defining `static char shaderSource[]` in the header will instantiate it as often as it is included.
Solved the problem by renaming `static auto getCPLState(const std::string&)` into `auto getFusedAdamCPLState(const std::string&)` and instantiating it only once resulted in 500K reduction in binary size (and perhaps even more in runtime footprint)
I.e. before
```
% ls -lak lib/libtorch_cpu.dylib
-rwxr-xr-x 1 malfet staff 183357744 Nov 19 17:58 lib/libtorch_cpu.dylib
```
and afer
```
% ls -lak lib/libtorch_cpu.dylib
-rwxr-xr-x 1 malfet staff 183357120 Nov 19 17:57 lib/libtorch_cpu.dylib
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141090
Approved by: https://github.com/Skylion007
ghstack dependencies: #141089
### Background
This PR adds the functionality to xfail / skip on a per-`SampleInput` basis for `OpInfo` tests. See #89354 and #82669 for some requests asking for this type of functionality.
This was originally landed for NJT in #138370 and is generalized and slightly tweaked here.
### Design
#### Principles
* Clean separation among `SampleInput` generation logic, test logic that uses the `SampleInput`s, and xfail / skip logic (which will change as bugs are addressed).
* Flexibility in xfail / skip predicate specification - ideally each bug can be handled by a single skip / xfail, even if it surfaces across a specific class of ops.
* This is important in practice for NJT, where it's common to have a bug that affects all binary ops, for example.
* Opt-in with minimal test logic changes + no substantial impact on other tests.
#### Details
The core new concept is a `SampleRule`, which can be either an `XFailRule` or `SkipRule`.
```python
@dataclass
class SampleRule(ABC):
# function to indicate whether the rule applies to this op; return True if so
# NB: str arg of callable is device_type
op_match_fn: Callable[[str, OpInfo], bool] = None
# function to indicate whether the rule applies to this sample; return True if so
sample_match_fn: Callable[[torch.device, SampleInput], bool] = None
# optional name for identifying the rule
name: str = ""
@dataclass
class XFailRule(SampleRule):
# expected error type
error_type: TypeVar = Exception
# expected error message
error_msg: str = ".*"
@dataclass
class SkipRule(SampleRule):
...
```
* See below for example usage details, but at a high level: each test should have a corresponding list of `sample_skips_and_xfails`.
* The list of `sample_skips_and_xfails` is traversed in order, and the first rule that matches (if any) is applied, so order can matter.
* The PR includes a logging mechanism for matched rules accessible by setting the loglevel to `DEBUG`.
* The split between `op_match_fn` and `sample_match_fn` is made to allow pre-filtering of the list of rules to get only those that apply to the op under test.
* Each `SampleInput` is run within a subtest context so they can be individually skipped / xfailed as needed. This also means that a test will no longer stop after the first erroring `SampleInput`; all samples will be run through test logic.
### Example Usage
Consider the following OpInfo test:
```python
class MyTestCase(TestCase):
@ops(op_db)
def test_foo(self, device, dtype, op):
for sample in op.sample_inputs(device, dtype, requires_grad=False):
# do some SampleInput-based test logic
output = op.op(sample.input, *sample.args, **sample.kwargs)
...
```
This is a common pattern for such tests; simply generate a list of `SampleInputs` and run them through the op. Now say you want to xfail one of these `SampleInput`s for a given op. Today, you have to xfail the entire test or hack around this in the test logic.
This PR lets you do this to get very flexible xfail / skips based on op / sample input properties:
```python
# NB: Define rules for per-SampleInput xfails / skips. These can also be defined in-line in the @ops decorator, but
# it can be more readable to maintain these somewhere else. These are attempted to be matched in order and
# the first one that matches applies, so order can matter.
FOO_SKIPS_AND_XFAILS = [
XFailRule(
error_type=ValueError,
error_mg="2D inputs not supported",
op_match_fn=lambda device, op: (
# NB: logic for which ops this rule applies to goes here
op.full_name == "add"
),
sample_match_fn=lambda device, sample: (
# NB: logic which samples this rule applies to goes here
sample.input.dim() == 2
),
# NB: optional rule identifier can help with debugging matched rules
name="add_with_2D_inputs_not_supported",
),
# NB: This follows a similar structure as XFailRule but without error_type / error_msg. Obviously
# this skips a particular SampleInput instead of xfailing :)
SkipRule(...),
...
]
class MyTestCase(TestCase):
@ops(op_db)
@sample_skips_and_xfails(FOO_SKIPS_AND_XFAILS)
# NB: the @ops decorator automatically filters out any rules that don't apply to this op
def test_foo(self, device, dtype, op):
for sample, subtest_ctx in op.sample_inputs(
# NB: use_subtests=True is required for skips / xfails to work. If skips / xfails are defined and use_subtests != True,
# an informative error will be thrown.
device, dtype, requires_grad=False, use_subtests=True
):
# NB: this subtest context manager runs each sample input as a "subtest" and handles skips / xfails appropriately
with subtest_ctx(self):
# do some SampleInput-based test logic
output = op.op(sample.input, *sample.args, **sample.kwargs)
...
```
More examples can be seen in `test/test_nestedtensor.py`, where this system is used in practice.
I also demonstrate usage of syntactic sugar over this system in `test/functorch/test_vmap.py`. Here, a skip for the `to()` operator is replaced with a granular xfail for `test_vmap_exhaustive()`:
```python
...
# pre-existing xfail
xfail("item"),
# new granular xfail using syntactic sugar over the general system
xfailIf(
"to",
lambda sample: (
sample.kwargs["memory_format"] == torch.channels_last
),
),
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140443
Approved by: https://github.com/janeyx99, https://github.com/zou3519
ghstack dependencies: #140160, #138370
Adding some dynamo timed for the purpose of better understanding AOTI compilation time.
Probably would require a few more passes. A lot of time is spent in Scheduler.__init__, and not enough annotations are there.
run_command_and_check takes a lot time as well. But there is probably not much we can do. Maybe we can add a config to tune C++ optimization level?
traces:
<img width="1205" alt="Screenshot 2024-11-08 at 4 41 10 PM" src="https://github.com/user-attachments/assets/61645264-b3af-4d4a-804d-700b0f831c7c">
Differential Revision: D65554141
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140198
Approved by: https://github.com/desertfire
During export, we nub out most CIA ops to return NotImplemented to avoid decomposing them during tracing. To recover the existing shape propagation behavior, we register these CIA decomps directly as FakeTensorMode rules as well. The reason we have to do is because when we return NotImplemented, FakeTensor would fallback to running these CIAs with Meta backend causing device branching CIA ops to fail. (because now the device is Meta. One example is sdpa). If we register a kernel directly to FakeTensorMode, we won't fallback to Meta backend.
Differential Revision: [D65716260](https://our.internmc.facebook.com/intern/diff/D65716260/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140465
Approved by: https://github.com/bdhirsh
Expands the `test_linalg_qr_autograd_errors` unit test to check all cases of differentiablity/non-differentiability as given in the docs https://pytorch.org/docs/stable/generated/torch.linalg.qr.html:
- mode= ‘reduced’ (default): Returns (Q, R) of shapes (*, m, k), (*, k, n) respectively. It is always differentiable.
- mode= ‘complete’: Returns (Q, R) of shapes (*, m, m), (*, m, n) respectively. It is differentiable for m <= n.
- mode= ‘r’: Computes only the reduced R. Returns (Q, R) with Q empty and R of shape (*, k, n). It is never differentiable.
(in particular, the happy paths are added)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135097
Approved by: https://github.com/IvanYashchuk, https://github.com/nikitaved
On ROCm, hipification converts std::min to ::min, but ::min is not returning the right result. This impacts index_add_ operation on a large tensor, we end up picking the large values instead of max supported block size (128). This leads to GPU accessing memory out of bounds.
While we wait for ::min to be fixed, we can use < operator to compare instead of relying on ::min.
Example Code w/ failure:
```
D=6144
hidden_states = torch.zeros([16384, 6144], device="cuda:0", dtype=torch.bfloat16)
index = torch.randint(0, 16384, (1, 32, 16384), device="cuda:0", dtype=torch.int64)
output = torch.empty([1, 32, 16384, 6144], device="cuda:0", dtype=torch.bfloat16)
hidden_states.index_add_(0, index.view(-1), output.view(-1, D))
```
```
Traceback (most recent call last):
RuntimeError: HIP error: invalid configuration argument
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139087
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
Summary:
1. Clearly specify error messages that we are refering to a collective_sequence_id and an internal_record id for entry.
The entry id is semi-useless for the end consumer so at least let them know that this is an internal record id.
2. Add some missing fields in types.py.
self.missing_ranks = set()
self.input_numel = tuple()
self.output_numel = tuple()
self.errors = set()
These were showing up as linter errors when I opened the file in vs-code
Test Plan:
```
buck2 run //caffe2/fb/flight_recorder:fr_trace -- -m f665492593-nerf_training-96ab95e0 -w 8 --mast_job_version 0 -a 0
Buck UI: https://www.internalfb.com/buck2/2cac9273-1b7b-47bf-867f-82f9a4c1d581
Network: Up: 0B Down: 0B
Not all ranks joining collective: sequence number: 31117
internal record id: 31116
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {3, 4, 5, 6, 7}
input sizes: [[1571911]]
output sizes: [[1571911]]
world size: 8
expected ranks: {0, 1, 2, 3, 4, 5, 6, 7}
collective state: scheduled
collective stack trace:
all_reduce at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/distributed_c10d.py:2707
wrapper at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/c10d_logger.py:81
sync_buffers at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/models/gaussian_splatting.py:650
decorate_context at /packages/fblearner.flow.canary/workflow#link-tree/torch/utils/_contextlib.py:116
step at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/training/training_manager/splatting.py:356
main at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/nerf_training.py:260
main_impl at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:57
main at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:34
wrapper at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py:355
<module> at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:118
_run_code at /packages/fblearner.flow.canary/workflow#link-tree/runtime/lib/python3.10/runpy.py:86
_run_module_as_main at /packages/fblearner.flow.canary/workflow#link-tree/runtime/lib/python3.10/runpy.py:196
run_as_main at /packages/fblearner.flow.canary/workflow#link-tree/__par__/bootstrap.py:69
run_as_main at /packages/fblearner.flow.canary/workflow#link-tree/__par__/meta_only/bootstrap.py:98
__invoke_main at /packages/fblearner.flow.canary/workflow#link-tree/__run_lpar_main__.py:28
<module> at /packages/fblearner.flow.canary/workflow#link-tree/__run_lpar_main__.py:31
...
Differential Revision: D66018461
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140969
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
test test_save_load_transform in [test_transforms.py](https://github.com/pytorch/pytorch/blob/main/test/distributions/test_transforms.py)
_pytest test_transforms.py -k test_save_load_transform_
error message:
```
.
.
.
File "/workspace/pytorch/test/distributions/test_transforms.py", line 555, in test_save_load_transform
other = torch.load(stream)
^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/serialization.py", line 1444, in load
raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
(1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
WeightsUnpickler error: Unsupported global: GLOBAL torch.distributions.transformed_distribution.TransformedDistribution was not an allowed global by default. Please use `torch.serialization.add_safe_globals([TransformedDistribution])` or the `torch.serialization.safe_globals([TransformedDistribution])` context manager to allowlist this global if you trust this class/function.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140494
Approved by: https://github.com/mikaylagawarecki
Citing @malfet's [comment](https://github.com/pytorch/pytorch/pull/136343#pullrequestreview-2318792396) in https://github.com/pytorch/pytorch/pull/136343
> It would be great, if users do not have to modify their programs for every new backend, but rather use with torch.device('xpu'): and keep rest of the code unchanged.
This PR makes the backend specification ("nccl", "gloo") optional when user provides a `devce_id` to `init_process_group` (the acceptance of `device_id` has been previously supported for the purpose of eager init).
New user experience:
```
device = torch.device(device_type, rank % device_count)
dist.init_process_group(device_id=device)
```
The line of `device = torch.device(...)` is anyway needed because user would use it for tensor creation etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140963
Approved by: https://github.com/wconstab
Per discussion with @malfet, only allow weights_only unpickler to load NJT if `torch.nested` and `torch._dynamo` are imported
(this is slightly weird as technically `torch.nested` is actually imported by default and `torch._dynamo.decorators._DimRange` is actually what needs to be imported)
we can't import this from `torch.nested` as this would
- undo dynamo lazy import
- cause circular import
===========================
Redo of https://github.com/pytorch/pytorch/pull/140304 caused issues as `torch.nested._internal.foo` needs to be imported, which causes issues like
```python
torch/_weights_only_unpickler.py", line 339, in load
if full_path in _get_allowed_globals():
torch/_weights_only_unpickler.py", line 188, in _get_allowed_globals
torch.nested._internal.nested_tensor.NestedTensor
AttributeError: module 'torch.nested' has no attribute '_internal'
```
**This likely wasn't caught in our CI because imports are global during unit tests(?), so we use subprocess to properly test this time**
Differential Revision: [D65961691](https://our.internmc.facebook.com/intern/diff/D65961691)
@jbschlosser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140739
Approved by: https://github.com/malfet
Summary:
Bug in quantizer when Conv + ReLU is fused even when the preceeding conv has more than one user. Conv and ReLU can not be fused in this case because the result of Conv must be used elsewhere.
XNNPACK Delegate naturally handles this by inserting a clamp node for ReLU.
Test Plan: CI
Reviewed By: digantdesai
Differential Revision: D65989599
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140846
Approved by: https://github.com/digantdesai
Doc updates:
* This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as https://github.com/pytorch/rfcs/pull/71 .
* It also does some general cleanups to simplify the distributed.rst by using `:methods`.
* It adds `__init__` definitions for the Stores
* I've reordered things so the collective APIs are before the Store/PG apis
Test plan:
```
lintrunner -a
cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140853
Approved by: https://github.com/kwen2501
# Overview
Currently monitor.py produces error only result, this pr introduct disable-monitor option to all *-test.yml. We also like to explore how the monitor code affect benchmark results.
# next steps
- fix the monitor.py
- enable non-benchmark tests with monitor
- investigate benchmark test behavior with monitor background job
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140857
Approved by: https://github.com/huydhn
Tracking issue: #138399
This PR fixes a number of reference implementations (which are also used as meta
functions), making them more consistent with CPU device. More specifically, it fixes those
operations that use `_make_elementwise_unary_reference` decorator, and don't error on
mismatching out argument dtype while they error when using concrete devices (e.g. CPU).
The fixed operations are:
- `abs`
- `ceil`
- `floor`
- `frac`
- `isneginf`
- `isposinf`
- `sgn`
- `sign`
- `signbit`
- `trunc`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140288
Approved by: https://github.com/ezyang
ghstack dependencies: #140186, #140286
Fixes#139320
### Summary:
#### (1) Add `_rename_dynamic_shapes_with_model_inputs` for dynamic_shapes to play along with input_names
* Use model forward signature to rename dynamic_shapes when dynamic_shapes is not nested and dynamic_shapes is directly using the customized name. This solves the issue that torch.export.export expects dynamic_shapes only uses the model input names.
* If the dynamic_shapes is nested, we do nothing.
#### (2) Add `_from_dynamic_shapes_to_dynamic_axes` for fallback
* We flatten dynamic_shapes with leaf defined _pytree.tree_leaves()
~~* If a dynamic_shapes is not nested, and defined in dict. We can use the key as the input_names, since it should be renamed by `_rename_dynamic_shapes_with_model_inputs` already.~~
* If a dynamic_shapes is provided, input_names is required to assign the names, because dynamic_axes needs it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139532
Approved by: https://github.com/justinchuby
I.e. replace `at::detail::getMPSHooks().isOnMacOSorNewer` with `is_macos_13_or_newer`, which is a direct function call instead of going thru a virtual method call
Hooks are only needed to provide a feature-agnostic inteface to query something even on the platforms that might not have support for the featuee, while functions implemented in `ATen/native/xxx` should be able to call those platform specific methods directly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140950
Approved by: https://github.com/Skylion007
ghstack dependencies: #140896
Summary: When AOT_PARTITIONER_DEBUG is set to 1 and debug logging is turned on we can now log the full input and output for each knapsack problem.
Differential Revision: D65633086
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140757
Approved by: https://github.com/jansel
Which is a variadic template that automates tedious (and error prone) process of pasing the arguments via series of
```cpp
mtl_setBuffer(encoder, b1, 0);
mtl_setBuffer(encoder, b2, 1);
mtl_setBytes(encoder, param, 2);
```
into a compact
```
mtl_setArgs(encoder, b1, b2, param);
```
Introduce few more specialization of `mps_setArg`, such as:
- Call `setBuffer` for `id<MTLBuffer>`
- Copy double as float (as MPS does not support double precision types)
- Accept `std::optional<at::Tensor>` that will not call setBuffet, if optional is empty
Also, re-metaprogramm `mtl_setBytes` to make it usable with any trivially copiable structs, but keep separate implementation for containers, as uploading `c10:SmallVector`, which is trivially copiable would overwrite next arguments, which luckily resulted in test failures of `test_cross_entropy_label_smoothing_weight_ignore_indices_mps`
Introduce `has_size_type_v` which could be used to diferrentiate between trivially copiable `std::array` and `c10::ArrayRef` vs other trivially copiable structs.
```cpp
template <typename T>
class has_size_type {
template <typename U>
static constexpr std::true_type check(typename U::size_type*);
template <typename>
static constexpr std::false_type check(...);
public:
static constexpr bool value = decltype(check<T>(nullptr))::value;
};
template <typename T>
constexpr bool has_size_type_v = has_size_type<T>::value;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140896
Approved by: https://github.com/Skylion007
Fixes#140598
Allows ragged structures for query and key+value sequence lengths to differ (i.e. supports cross attention for Flex + NJT).
Technically, this is BC-breaking thanks to arg renaming and positional arg reordering in `create_nested_block_mask()`, but Flex + NJT support isn't in a major release yet so I'm hoping we can just do it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140723
Approved by: https://github.com/drisspg
# Motivation
This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types.
It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(https://github.com/pytorch/pytorch/pull/131758#discussion_r1762422784)
This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.
The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support
The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases.
The following changes have been made to test_functional_api.py:
-Functionality has been added to test for non cuda devices using intel HPU as an example
-Multiple set up steps previously required by MultiProcessTestCase have been abstracted out
-Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs
-Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs
NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator.
I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR.
This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138216
Approved by: https://github.com/kwen2501, https://github.com/guangyey
Here's are some explanations of this PR.
1. Changes in `aten/src/ATen/core/Tensor.cpp` and `c10/core/DispatchKey.cpp`: Support toString method for `QuantizedPrivateUse1` backend, make pytorch print out correct backend string for it.
2. Add header `DispatchStub.h` in `aten/src/ATen/native/quantized/IndexKernel.h`: If this header is not included, we can't utilize `masked_fill_kernel_quantized_stub` even we include this `IndexKernel.h` header, it would throw an error during compilation.
3. Add multiple `TORCH_API`s in `aten/src/ATen/native/quantized/AffineQuantizer.h`: these functions is useful for other privateuse1 backends supporting quantization functions, if these `TORCH_API` are missed, it would throw an error during runtime (undefined symbol)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139860
Approved by: https://github.com/bdhirsh
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">
My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.
Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).
But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.
Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
Summary:
We observed another corner case where not all split items are used, see the screenshot
{F1960315622}
We thus skip such cases by checking the getitem indices.
Test Plan:
# local reproduce
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --flow_id 663157369 2>&1 | tee ~/cmf.txt
```
P1679677122
# E2E
before fix
f663157369
after fix
Differential Revision: D65990213
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140788
Approved by: https://github.com/jackiexu1992
Also add missing mypy typing and a few asserts to make mypy happy
Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in #140460
Note: object collective version canonicalizes to global instead of group
rank, simply becuase this left more of the original code intact and
required less conversions overall.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140827
Approved by: https://github.com/kwen2501
Previously `SymmetricMemory` only had private pybind APIs:
```python
from torch.distributed._symmetric_memory import _SymmetricMemory
t = _SymmetricMemory.empty_strided_p2p(
size=(64,),
stride=(1,),
dtype=torch.float32,
device=device,
)
symm_mem_hdl = _SymmetricMemory.rendezvous(t, group_name=group.group_name)
```
This PR introduces user-facing APIs empty() and rendezvous():
```python
import torch.distributed._symmetric_memory as symm_mem
t = symm_mem.empty(64, device="cuda")
symm_mem_hdl = symm_mem.rendezvous(t, group_name=group.group_name)
```
Notable differences compared to the pybind APIs:
- `empty()` now resembles `torch.empty()`:
- shape can either be an integer sequence or pack
- no need to/can't specify stride anymore
- device can either be `torch.device` or string
- `group_name` needs to be specified at rendezvous time as opposed to allocation time. See https://github.com/pytorch/pytorch/pull/139529 for the rationales. I feel the new semantic is superior, hence enforcing it in the public API.
- Currently, the pybind API still support specifying `group_name` at rendezvous time.
This PR does not change the behavior of the pybind APIs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139677
Approved by: https://github.com/lw
ghstack dependencies: #139529
I.e. fixes
```
1082/1084] Building OBJCXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mps/operations/UpSample.mm.o
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/UpSample.mm:224:10: warning: non-portable path to file '<ATen/native/mps/UpSample_metallib.h>'; specified path differs in case from file name on disk [-Wnonportable-include-path]
224 | #include <ATen/native/mps/Upsample_metallib.h>
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| <ATen/native/mps/UpSample_metallib.h>
```
as generated header name should have the same capitalization as respective shader file, i.e. `kernels/UpSample.metal`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140891
Approved by: https://github.com/Skylion007
Before this PR, users need to call `empty_strided_p2p()` with a `group_name`:
```python
tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device, group_name="0")
symm_mem = _SymmetricMemory.rendezvous(tensor)
```
Users can now omit `group_name` at allocation time and specify it later at rendezvous time:
```python
tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device)
symm_mem = _SymmetricMemory.rendezvous(tensor, group_name="0")
```
Rationales for this change:
- This allows the same allocation to establish symmetric memory under different groups
- Specifying `group_name` at rendezvous time instead of allocation time is a more natural UX
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139529
Approved by: https://github.com/lw
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">
My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.
Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).
But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.
Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
ghstack dependencies: #140460, #140815
Avoid copypaste of send/isend and recv/irecv impl.
This does change the warning issued from send to include the identifier
"isend" instead of "send", but I think thats not a big deal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140815
Approved by: https://github.com/fegin
ghstack dependencies: #140460
This PR adds caching for user defined triton kernels by putting the transitive closure of source code in node.meta along with constant arguments.
One HUGE hack we do here is a node looks like
```
triton_kernel_wrapper_functional_proxy = torch.ops.higher_order.triton_kernel_wrapper_functional(kernel_idx = 0, constant_args_idx = 1, grid = [(1, 1, 1)], tma_descriptor_
metadata = {}, kwargs = {'in_ptr0': arg0_1, 'in_ptr1': arg1_1, 'out_ptr': arg0_1}, tensors_to_clone = ['out_ptr']);
```
so we use regex to remove `kernel_idx = 0, constant_args_idx = 1` parts as they are not relevant to cache hash. This is horrible and I'd like to eventually not use pickle as a hashing alternative but this is a longer project.
Differential Revision: [D65895744](https://our.internmc.facebook.com/intern/diff/D65895744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140326
Approved by: https://github.com/zou3519
Summary:
Add a wait counter for the dump function.
This is useful to see if we get stuck in the dump function and never return for a particular job.
Test Plan: Tested locally I and see `pytorch.wait_counter.NCCLTraceBuffer__dump.busy_time_us.sum.60` in ODS.
Differential Revision: D65823433
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140823
Approved by: https://github.com/fduwjj
Summary:
Customize splitter behavior to mark `get_attr` nodes as acc supported.
Currently these nodes are excluded by `FxNetAccNodesFinder` which marks all nodes with op not in `CALLABLE_NODE_OPS` ("call_module", "call_function", "call_method") as unsupported.
Before this change, merge-net is split into an almost empty cpu submodule with a single empty output node:
```
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:###### debug_print nodes for _run_on_cpu_0
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:Found output node: n.name='output', n.target='output', n.args=((),), n.kwargs={}, n.meta={}
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:return ()
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:
_run_on_cpu_0 stats for merge:
[output] output: 1
```
full log: P1678727348 (generated using same command as below)
Test Plan:
Tested by lowering `ig_organic_feed_cn_v2_mtml` using cmd:
```
buck run mode/opt-split-dwarf //tgif/cli:cli -- --model-name=ig_organic_feed_cn_v2_mtml --model-type ig_organic_feed_cn_v2_mtml --world-size=1 --storage-mode 1 --inference-dtype=FP16 --meta-transform=False --use-random-weights=True --accelerator-arch=3 --enable-input-dist=True --embedding-tables-dtype=FP16 --mtia-use-torch-export=True embedding-quantization-pass torchrec-sharding-pass tgif-split-pass gen-app-graph-pass tgif-mtia-lowering-pass dense-quantization-pass save-torch-package-pass generate-model-package-pass pack-weights-and-save-pass 2>&1 | tee /tmp/publish_ig_organic_feed_cn_v2_mtml_mtia_export_20241114_splitter_2.log
```
Output shows only 1 acc submodule is generated for merge:
```
INFO 18:33:15.951 1735650 utils.py:235: [TGIF] num of acc submodules: 1
INFO 18:33:15.952 1735650 utils.py:236: [TGIF] num of cpu submodules: 0
INFO 18:33:16.534 1735650 logging_utils.py:53: [TGIF] _run_on_acc_0 graph module debug info: https://www.internalfb.com/intern/everpaste/?color=0&handle=GK4VKhWsDKF9VdsDAKxhR6KAlhJ0br0LAAAz
INFO 18:33:16.534 1735650 utils.py:257: [TGIF] Start MTIA lowering _run_on_acc_0 in merge, device ordinal: -1
```
full log: P1679596796
Differential Revision: D65983916
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140794
Approved by: https://github.com/ezyang
Summary:
### Motivation
In D65283170, we need subclass of quantizable LSTM to enable split_gates. Also, required for tests.
### What's the change?
As subclass is not part of no_observer() set, an improper observer is added after the quantizable LSTM module. Here, we switch class check change to issubclass check on no_observer set.
Test Plan:
- N6206576
- CI.
Reviewed By: andrewor14
Differential Revision: D65989314
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140818
Approved by: https://github.com/andrewor14
Implementation of the `softmax_backward_data` operator for the CPU backend produces incorrect results when the `output` argument is non-contiguous.
Here is a test case that demonstrates this issue:
```python
torch.manual_seed(0)
op = torch.ops.aten._softmax_backward_data
grad_output = torch.ones(3, 3, 3)
temp = torch.randn(3, 10, 3)
out = temp[:, :3, :]
out = out.contiguous()
print(out.is_contiguous())
grad_input = op(grad_output, out, 1, torch.float32)
print(grad_input)
```
In this test case, the variable `grad_input` yields incorrect results if the line `out = out.contiguous()` is commented out. With this fix, `grad_input` consistently produces the same results whenever `output` is contiguous.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139740
Approved by: https://github.com/zou3519
Now that all cells are modeled as `NewCellVariable` in Dynamo, we no
longer need to put cell variables into this special `closure_cells`,
rather we just merge `closure_cells` with `symbolic_locals`.
This allows us to merge and remove some code paths, notably make
`LOAD_CLOSURE` the same as `LOAD_FAST`, and `LOAD_DEREF` & `STORE_DEREF`
the same for inlining or regular `InstructionTranslator`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140154
Approved by: https://github.com/jansel
ghstack dependencies: #140330, #140152, #140436, #140435, #140153
In addition to `NewCellVariable`, Dynamo has 3 ways of modeling cell objects:
1. For cells captured and created by the root frame, represent them as
their contents in `root_tx.symbolic_locals`, which `LOAD_DEREF` and
`STORE_DEREF` update directly, without going through `SideEffects`.
2. `ClosureVariable`: this is created when cells from (1) are captured
by a newly created function Dynamo is about to inline. It's a handle
with a name that redirects `LOAD_DEREF` and `STORE_DEREF` back (1),
to make `root_tx.symbolic_locals` up-to-date.
3. For cells that are captured by both the root frame and some
pre-existing function Dynamo is about to inline, represent those
cells as contents, and do not allow writes to them.
Note that (2) and (3) are mainly to conform with (1) -- to make sure
Dynamo has a consistent modeling of cells for the same cell objects.
In this patch, we represent all of these cells as `NewCellVariable`. The
main new code paths introduced are:
- using `NewCellVariable` to model cell objects created by the root
frame (the cells are passed in as input to `InstructionTranslator`),
this is what allows us to get rid of all 3 legacy paths above.
- adding a new `AutoDerefLocalSource` to deal with the python-code
level (guards) and bytecode level (codegen) auto-dereferencing
behavior, when accessing pre-existing python cells. This also
involves a tiny update to guard manager generation.
- plumbing some extra info into `LocalSource` and `CellVariable` so that
we can still emit `LOAD_DEREF`, `STORE_DEREF`, `LOAD_CLOSURE` (instead
of `make_cell`, `cell_contents` attribute access, and `LOAD_FAST`),
which is important for readability, performance, and some
assumptions `bytecode_transformation.py` makes.
As a result, this patch removes a lot of the now-dead code paths and
TODOs. Notably, it significantly simplified the `prune_dead_locals`
function, which was duplicating a lot of the logic from
`prune_dead_object_new`; this conveniently closes#137123.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140153
Approved by: https://github.com/jansel
ghstack dependencies: #140330, #140152, #140436, #140435
In `match_nested_cell`, Dynamo tried to identify pre-existing captured
cells by `(cell_name, id(cell_contents))`. This works in most cases, but
as the test added in this patch shows, it's not a complete solution.
This patch
1. changes `match_nested_cell` to `lookup_variable_for_captured_cell`,
and does the lookup based on id of cell objects, not their contents.
This requires plumbing a tuple of captured cell objects from
different CPython versions all the way to
`InstructionTranslator.__init__`, where we store a mapping from the
ids of these cell objects, and use it later in
`UserFunctionVariable.bind_args` to look for these unboxed cells.
2. builds off (1) -- rather than using a `VariableTracker` that
represents the content of the unboxed cells, use `ClosureVariable`,
which enables codegen in case these cells escape as closure of a
`NestedUserFunctionVariable`.
The patch adds a regression test for each of the scenarios above:
1. `test_write_to_cells_with_name_shadowing` where Dynamo mistakenly
thought the program is writing to a cell captured by root frame (which
it doesn't support atm), which resulted in
```
File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/symbolic_convert.py", line 3340, in STORE_DEREF
unimplemented("write to __closure__ while inlining")
File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/exc.py", line 313, in unimplemented
raise Unsupported(msg, case_name=case_name)
torch._dynamo.exc.Unsupported: write to __closure__ while inlining
```
2. `test_existing_func_that_creates_capturing_nested_func` where Dynamo
ended up trying to codegen a `NestedUserFunctionVariable` that
captures a cell which was also captured by the root frame, so it was
unboxed and ends up emitting `LOAD_DEREF` rather than
`LOAD_FAST/LOAD_CLOSURE` during codegen, resulting in
```
File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/variables/functions.py", line 105, in _create_nested_fn
func = FunctionType(code, f_globals, name, defaults, closure)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: arg 5 (closure) expected cell, found int
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140436
Approved by: https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #140330, #140152
This patch introduces a `DynamoFrameType` to serve as a layer between
Dynamo and different versions of Python frame object. In
`DynamoFrameType`, we only register attributes Dynamo cares about (e.g.,
`f_code`, `f_locals`, etc.
This will be helpful when it comes to adding new attributes to this
`DynamoFrameType`, or dealing with Python version changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140330
Approved by: https://github.com/jansel, https://github.com/williamwen42
Context: we are trying to pass an empty tensor through the system now (sometimes;... its an edge case); and it seems to cause all_reduce to seg fault, which is unexpected to me
Deep Shah and Pavan identified the issue, I'm just pushing for a fix :)
Test Plan: idk what i'm doing here, someone help
Reviewed By: shuqiangzhang
Differential Revision: D65956095
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140741
Approved by: https://github.com/shuqiangzhang
Features:
(1) Add support for tree structure.
(2) Add user warning before axes to shapes conversion
(3) Add suggestion of providing `dynamic_shapes` when conversion fails
Notes:
(1) `input_names` is crucial to the conversion, as we don't know the ONNX graph inputs.
(2) min and max are set as default, so LLM has higher chance to fail if users use `dynamic_axes` in terms of the min/max constraints dependency between `attention_mask` and `sequence_length`, etc. (Found in llama-3.2-1B_Instruct)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140488
Approved by: https://github.com/justinchuby
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Summary:
If unhealthy, the user should be able to get the type of errors, e.g.,
timeout,nccl error or remote error.
This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level.
Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG.
Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective
Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140087
Approved by: https://github.com/kwen2501
Functionally two decorators are very similar, but one should rely on expectedFailure as much as possible to get signal when something is fixed.
- Move `product_version` variable from `test_mps` to common_utils, but call it `MACOS_VERSION`
- Introduce `skipIfMPSOnMacOS13` to decorate the hard crashes that happens only on MacOS13 (which at this point will not get any fixes and will be deprecated soon)
- Add `device_type='mps'` to all `skipIfMPS` per https://github.com/pytorch/pytorch/issues/140560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139940
Approved by: https://github.com/janeyx99, https://github.com/huydhn
Fixes following warnings:
```
In file included from /Users/malfet/git/pytorch/pytorch/torch/csrc/Generator.cpp:25:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/mps/MPSGeneratorImpl.h:40:63: warning: extra ';' after member function definition [-Wextra-semi]
40 | void set_engine(at::Philox4_32 engine) { engine_ = engine; };
| ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/mps/MPSGeneratorImpl.h:41:46: warning: extra ';' after member function definition [-Wextra-semi]
41 | at::Philox4_32 engine() { return engine_; };
| ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/mps/MPSGeneratorImpl.h:43:62: warning: extra ';' after member function definition [-Wextra-semi]
43 | static DeviceType device_type() { return DeviceType::MPS; };
| ^
3 warnings generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140776
Approved by: https://github.com/Skylion007
Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598
Approved by: https://github.com/mcr229
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
When `TORCH_SYMM_MEM_ALLOW_OVERLAPPING_DEVICES` is set, the check for overlapping devices and multicast support will be disabled. This is useful for testing with a single device.
Making this is an env var instead of an API argument since this is likely only useful for testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140127
Approved by: https://github.com/lw
Summary: The gm_torch_level can be a _LazyGraphModule(GraphModule) instead of a GraphModule. When we call .recompile(), GraphModule populates the self._out_spec, but _LazyGraphModule(GraphModule).recompile() doesn't populate it.
Test Plan: CI
Differential Revision: D65902135
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140608
Approved by: https://github.com/tugsbayasgalan
This reintroduces support for high channel sizes for convs. The guard for macOS versions < 15.1 is still present to prevent reintroducing #129207.
I'm unsure about the specific macOS version support, but I'm assuming this was fixed in 15.1, and I'm relying on signals from ci for verification. I'm expecting the new test will fail for macOS versions < 15.1, and the old test will start failing for > 15.0. I've added xfails for this and extended the version helpers to support 15.1+.
Fixes#140722
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140726
Approved by: https://github.com/malfet
@ezyang noticed this exercises a multithreading bug that is causing tests to become disabled:
```
2024-11-13T21:05:55.8363582Z inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_fft_ihfftn_cpu_int32 /opt/conda/envs/py_3.9/lib/python3.9/site-packages/_pytest/threadexception.py:73: PytestUnhandledThreadExceptionWarning: Exception in thread Thread-3
2024-11-13T21:05:55.8364857Z
2024-11-13T21:05:55.8364974Z Traceback (most recent call last):
2024-11-13T21:05:55.8365491Z File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 980, in _bootstrap_inner
2024-11-13T21:05:55.8366003Z self.run()
2024-11-13T21:05:55.8366371Z File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 917, in run
2024-11-13T21:05:55.8366858Z self._target(*self._args, **self._kwargs)
2024-11-13T21:05:55.8367518Z File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 176, in _run_event_loop
2024-11-13T21:05:55.8368189Z self.loop.run_until_complete(self.task)
2024-11-13T21:05:55.8368774Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
2024-11-13T21:05:55.8369348Z return future.result()
2024-11-13T21:05:55.8369980Z File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 214, in _worker
2024-11-13T21:05:55.8370603Z message = await asyncio.wait_for(
2024-11-13T21:05:55.8371090Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
2024-11-13T21:05:55.8371573Z return await fut
2024-11-13T21:05:55.8372156Z File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/queues.py", line 166, in get
2024-11-13T21:05:55.8372613Z await getter
2024-11-13T21:05:55.8374010Z RuntimeError: Task <Task pending name='Task-1' coro=<FbScribeLogger._worker() running at /opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py:214> cb=[_run_until_complete_cb() at /opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py:184]> got Future <Future pending> attached to a different loop
2024-11-13T21:05:55.8375366Z
2024-11-13T21:05:55.8375603Z warnings.warn(pytest.PytestUnhandledThreadExceptionWarning(msg))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140717
Approved by: https://github.com/ezyang, https://github.com/zxiiro
Summary: https://github.com/pytorch/pytorch/pull/136505 changed the cache_clear operation to remove loaded modules from disk. That change caused some problems with TORCHINDUCTOR_FORCE_DISABLE_CACHES=1, where there are some code paths (coordinate descent tuning at least), where we call `PyCodeCache.load_by_key_path` and expect that the files are still on disk. (But when caches are disabled, we call cache_clear before every inductor compile). It seems we probably have a shortcoming in the disable-cache logic, but since we also have flakey test failures with the same `'could not get source code'` error, let's restore the previous functionality until I can investigate further.
Since some tests actually _DO_ want to delete on-disk artifacts (e.g., to test remote caching), then I added a `purge` param to optionally delete files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140216
Approved by: https://github.com/eellison
Here's the overview:
There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits.
Some specifics:
* Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile).
* Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed.
* Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead.
* `record_compilation_metrics` is now called on exit from MetricsContext.
* Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`.
* Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext.
And specifically, several changes to dynamo_timed:
* "Modernize" the parameters and update all callsites accordingly.
* Move the backwards logging of the CompilationMetrics to the backwards compile location.
* Add a parameter for which CompilationMetrics field to update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849
Approved by: https://github.com/ezyang
Pylance infers the type of the first argument (`enabled`) to `_record_memory_history` as `str` even though the function accepts `Literal[None, "state", "all"]`.
This raises an issue when passing `None`, even though it is a legitimate argument.
This PR addresses the issue by adding the type annotation in the doc string.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140545
Approved by: https://github.com/Skylion007
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Differential Revision: D63206258
This diff introduces a mechanism to generate a json-compatible deserializer in cpp using nlohmann json (already being used by AOTI).
Why we need this? Because there will be a lot of cases where people don't want to use Python to load the graph (e.g. cpp runtime), and instead they can use this header to deserialize the JSON graph.
Every time we call update_schema.py to update the schema, the header will be auto generated and included into the source files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136398
Approved by: https://github.com/angelayi
On the exeuctor side, when it is found that meta.data_ptr is not in the allocated memory, tensor creation will fail, but there is no need to allocate memory when creating an empty tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140496
Approved by: https://github.com/ezyang
**About this PR**
This PR adds the following ops for `linear_dynamic_fp16` in onednn namespace. These ops are intended for PT2E quantization eager mode.
- `onednn::linear_prepack_fp16`: packs fp32 weight to an fp16 MkldnnCPU tensor.
- `onednn::linear_dynamic_fp16`: takes an fp32 CPU tensor and an fp16 MkldnnCPU tensor and compute linear in fp32
- `onednn::linear_relu_dynamic_fp16`: similar as the former and apply relu on output.
**Test plan**
`python test/test_quantization.py -k test_linear_dynamic_fp16_onednn`
**Implementation**
These ops call oneDNN lib under the hood. It's worth noting that oneDNN does not support f32 * f16 -> f32 computation, so we have to convert fp16 weight to fp32 before computation. And weight is still in plain format after packing.
**Correctness and performance**
Correctness is guaranteed by UT.
Performance of the new ops may be better than the FBGEMM implementation when weight shape is small but worse when weight shape is large. It's because weight dtype conversion and computation are not fused.
For example, I ran benchmarks on an Intel(R) Xeon(R) Platinum 8490H machine with different cores and shapes. When using 1 core per instance, the new implementation generally is faster for weight shape < 1024 * 1024. When using more cores, the threshold will increase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140376
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
Related to #107302
We saw `test_float_to_int_conversion_nonfinite` failed as we upgrade to NumPy 2.
It is caused by the undefined behavior of `numpy` casting `inf`, `-inf` and `nan` from `np.float32` to other dtypes.
The test is using NumPy as reference for the ground truth. (see line 1013-1015)
However, these behaviors are undefined in NumPy.
If you do `np.array([float("inf")]).astype(np.uint8, casting="safe")`, it results in an error `TypeError: Cannot cast array data from dtype('float64') to dtype('uint8') according to the rule 'safe'`.
The undefined behaviors are always subject to change.
This PR address this issue by passing concrete values as the ground truth references.
In the future, even NumPy changes its behavior the test would still remain stable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138131
Approved by: https://github.com/drisspg
…al studio build tool is only needed for Windows
I created no issue since the suggested change is actually very small. This is my very first PR so partly I am creating it just to dip my toes in the water. In fact I would understand if the change does not get accepted since it's a simple modification to part of the wording in the README. The wording as it currently stands is probably clear enough for most people, but I still missed the fact that visual studio build tool must only be installed for Windows (even though that is stated there), and I thought by adding some parentheses this might become even more clear, specially since elsewhere in the README the formatting makes it more explicit that some steps must only be run for Windows/Linux/MacOS
As I said, it's a trivial change so I'd understand if it's not accepted, and I am looking forward to making more meaningful contributions as time goes on.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140442
Approved by: https://github.com/soulitzer
Summary:
Removes print statements and implements logging via the logging library.
Hopefully this will allow more control on the level of logging when running models.
Test Plan:
```
AOT_PARTITIONER_DEBUG=1 buck2 run @mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_fb_fm_v4 launcher.num_workers=2
```
Resulting output paste: P1674535630
* Full logs paste: P1674535621
```
pastry P1674535621 | grep "functorch/partitioners.py" | pastry
```
Logging results: P1674549514
Differential Revision: D61678215
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139782
Approved by: https://github.com/paryxyt, https://github.com/jansel
This PR adds native implementation of unfold_backward as metal shader, mostly copy-n-paste of algorithms used in CUDA and CPU implementations, i.e. considering `out = in.unfold(dim, size, step)`, then following holds true:
* `out.shape[dim] == (in.shape[dim] - size) / step + 1`
* `out.shape[-1] == size`
* `out.ndim == in.ndim + 1`
`unfold_backward` Metal kernel receives `grad_in` and returns `grad_out` such that:
* `grad_in.shape == out.shape`
* `grad_out.shape == in.shape`
For each index in `grad_out` find the elements contributing to it and sum them up. Such algorithm requires no synchronization between threads.
That is `grad_out[...,out_dim_idx,...]` accumulates all values `grad_in[...,in_dim_idx,...,in_last_idx]`, where `in_dim_idx` is range [`(out_dim_idx - size) / step`, `out_dim_idx / step`] clamped to (0, `in_dim_size`) and `in_last_idx` are equal `out_dim_idx - in_dim_idx * step` . Accumulation step is skipped if `in_last_idx` is outside of [0, size] range.
This operator has been requested 16 times on https://github.com/pytorch/pytorch/issues/77764
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135411
Approved by: https://github.com/manuelcandales
Co-authored-by: Manuel Candales <42380156+manuelcandales@users.noreply.github.com>
Summary: output nodes may be eliminated to the input nodes if only partial output nodes are specified. add option to check results for all output nodes in the partitioned graph
Test Plan: see D65367305
Reviewed By: qcyuan
Differential Revision: D65367305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139774
Approved by: https://github.com/jfix71
Summary:
It seems like this issues is due to leftover cupti events during warmup staying persistent in the queue during profiling. These events start before our actual time window and therefore have a timestamp lower than our basetime. This makes the delta become negative which results in unsigned overflow. This then creates a large number which later gets sign added which creates the signed overflow.
Solution: If a raw timestamp is less than the base timestamp, just mark the process timestamp as -1 so we can mark these events as "to ignore". In Kineto, add a special case to ignore timestamps that are negative.
Test Plan: Test with ASAN
Differential Revision: D65835650
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140441
Approved by: https://github.com/davidberard98
Fixes https://github.com/pytorch/pytorch/issues/138715
It looks like we were previously ignoring guards on FSDP module parameters. In the issue linked above, this was causing inductor size/stride asserts to fire. The root cause is that for some code like this:
```
m = FSDP(
torch.nn.Sequential(
torch.compile(torch.nn.Linear(1024, 1024)),
torch.compile(torch.nn.Linear(1024, 4096))
)
)
```
We need to generate two different graphs for the two linear layers, and it looks like without a `TENSOR_MATCH` guard on the linear parameters, dynamo would think that it could re-use the same graph across both layers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138819
Approved by: https://github.com/anijain2305
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.
This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.
#buildsonlynotests - Builds are sufficient
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Reviewed By: meyering
Differential Revision: D65833225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140569
Approved by: https://github.com/Skylion007
Currently, when ptxas errors occur in one of the autotuning configs, we error out. This doesn't match the newly introduced behavior of the native Triton ([here](915c149978/python/triton/runtime/autotuner.py (L164))). In this PR, we match the Inductor's autotuning behavior to native Triton's by ignoring the ptxas errors and the configs triggering thereof.
This unblocks PT2 compilation of an internal model.
Differential Revision: D65861236
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140495
Approved by: https://github.com/chenyang78
Enable concat linear for CPU mkldnn path.
Previously, we have a concat linear in freezing passes but it not worked on CPU.
This is because `concat_linear` pattern happened after `mkldnn_weight_prepack`. And `concat_linear` only handle `addmm/mm` etc.
```
addmm -> mkldnn linear
addmm -> mkldnn linear -> cannot concat
# only worked when disable mkldnn
addmm ->
addmm -> concat linear
```
Now we changed `mkldnn linear` related pass numbers larger than `concat_linear` pass numbers.
```
addmm -> concat linear -> mkldnn linear
addmm ->
```
So it can work fine with mkldnn linear now.
Also, since concat linear not always have benefits. We add 1 flag `config.cpp.enable_concat_linear` and set default value to False. User can enable this by their need.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139048
Approved by: https://github.com/jgong5, https://github.com/jansel
Summary:
1. We don't want to exit with exceptions when there are so many mismatches. We should just break and return.
2. Polish the message of dtype mismatch. This is because dtype of input/output is actually a list not a string. So we don't want to show a list of ['double'] in the output message.
Test Plan:
Testing on the case when we see too many collective dtype mismatch
{F1958467224}
Differential Revision: D65841830
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140451
Approved by: https://github.com/c-p-i-o
aten._add_relu doesn't have meta function registered, so in dynamic shape case it is throwing an error in dynamo logs:
Error:
`V1107 11:25:32.344000 140481543555072 torch/_dynamo/symbolic_convert.py:534] [0/1] [__graph_breaks] NotImplementedError: aten::_add_relu.Tensor: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered. You may have run into this message while using an operator with PT2 compilation APIs (torch.compile/torch.export); in order to use this operator with those APIs you'll need to add a fake impl.`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140009
Approved by: https://github.com/ezyang
This PR is a supplement to https://github.com/pytorch/pytorch/pull/133980. The previous PR fulfill the basic functionality of XPU device guard, while we found it fails to address structured operators.
With current PR, the code snippet in RegisterXPU.cpp is as follows, where we can see the device guard is successfully generated.
```c++
struct structured_exp_out_functional final : public at::native::structured_exp_out {
void set_output_strided(
int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
TensorOptions options, DimnameList names
) override {
auto current_device = guard_.current_device();
if (C10_UNLIKELY(current_device.has_value())) {
TORCH_INTERNAL_ASSERT(*current_device == options.device(),
"structured kernels don't support multi-device outputs");
} else {
guard_.reset_device(options.device());
}
outputs_[output_idx] = create_out(sizes, strides, options);
if (!names.empty()) {
namedinference::propagate_names(outputs_[output_idx], names);
}
// super must happen after, so that downstream can use maybe_get_output
// to retrieve the output
at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
}
void set_output_raw_strided(
int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
TensorOptions options, DimnameList names
) override {
auto current_device = guard_.current_device();
if (C10_UNLIKELY(current_device.has_value())) {
TORCH_INTERNAL_ASSERT(*current_device == options.device(),
"structured kernels don't support multi-device outputs");
} else {
guard_.reset_device(options.device());
}
outputs_[output_idx] = create_out(sizes, strides, options);
if (!names.empty()) {
namedinference::propagate_names(outputs_[output_idx], names);
}
// super must happen after, so that downstream can use maybe_get_output
// to retrieve the output
at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
}
const Tensor& maybe_get_output(int64_t output_idx) override {
return outputs_[output_idx];
}
std::array<Tensor, 1> outputs_;
c10::OptionalDeviceGuard guard_;
};
```
However, without current change, the generated code is
```c++
struct structured_exp_out_functional final : public at::native::structured_exp_out {
void set_output_strided(
int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
TensorOptions options, DimnameList names
) override {
outputs_[output_idx] = create_out(sizes, strides, options);
if (!names.empty()) {
namedinference::propagate_names(outputs_[output_idx], names);
}
// super must happen after, so that downstream can use maybe_get_output
// to retrieve the output
at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
}
void set_output_raw_strided(
int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
TensorOptions options, DimnameList names
) override {
outputs_[output_idx] = create_out(sizes, strides, options);
if (!names.empty()) {
namedinference::propagate_names(outputs_[output_idx], names);
}
// super must happen after, so that downstream can use maybe_get_output
// to retrieve the output
at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
}
const Tensor& maybe_get_output(int64_t output_idx) override {
return outputs_[output_idx];
}
std::array<Tensor, 1> outputs_;
};
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138802
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/ezyang
There has been a series of attempts to provide support for resizing in
torch operators like `torch.sigmoid(x, out=y)`, i.e., `y` would have a
different shape before and after this expression. Prior to this patch,
we have some checks to graph break if the shape changed.
This patch extends
1. extends the existing check and graph break for any shape change, not
just for `TensorVariable` with source field.
2. removes an old code path which was introduced to address the shape
change, but became obselete in that regard because we added extra
checks to graph break upon shape change. Moreover, this old code path
is unsound, it tries to replace references to the old
`TensorVariable` the new one returned by `wrap_fx_proxy`, but it only
does the replacement in `symbolic_locals`, which breaks when cells
are involved. In general the old `TensorVariable` could be _anywhere_,
think the `replace_all` we had for immutable VTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140202
Approved by: https://github.com/jansel
ghstack dependencies: #140035, #140036, #140149, #140150, #140151, #140201
This patch fixes 2 things which are exposed if we have `NewCellVariable`
rather than `ClosureVariable` to model python cells:
1. `codegen_save_tempvars` must run first, to establish `source` for
objects, otherwise they can't reconstruct.
2. `prune_dead_object_new` must account for `OutputGraph.backward_state`
as well, since it also contains variables that must live.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140201
Approved by: https://github.com/jansel
ghstack dependencies: #140035, #140036, #140149, #140150, #140151
The `cell_or_freevar` was added in #106403 to help us ensure
Dynamo-export only allows graph input that depends on the frame input
(rather than a captured cell, for instance).
However, when taken literally, the `cell_or_freevar` condition is
actually not accurate, because for frame inputs that are also cells
(i.e., captured by some inner function), we actually set the
`cell_or_freevar` flag to false. This makes sense, because otherwise the
existing implementation would prevent Dynamo-export to add any of these
inputs to the graph.
To help with reasoning, this patch refines the `cell_or_freevar` flag to
what we really want to check -- `is_input`, and updates the relevant use
sites.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140151
Approved by: https://github.com/jansel
ghstack dependencies: #140035, #140036, #140149, #140150
In `UserFunctionVariable.bind_args`, there's a rare case when the
underlying function satisfies all conditions below
1. The function captures a pre-existing cell
2. The cell isn't captured by root frame
3. `UserFunctionVariable.source` is `None`
In such cases, Dynamo would model the cell as its content (just like
what we do for cells in the root frame). However, this could break in
two cases:
- We could have multiple instances of `UserFunctionVariable`, where some
have source and others don't. This means sometimes we'll model the
cell as a `NewCellVariable`, and sometimes as its content. This
causes issues because writes to the `NewCellVariable` would be
buffered in `SideEffects` and never get picked up by the other
modeling.
- Only when `UserFunctionVariable` has a source, do we check whether we
already had a `NewCellVariable` for the captured cell. This again causes
Dynamo to potentially have multiple representations for the same cell
object, resulting in a similar "buffered writes not reflected" issue
as above.
This patch fixes the above 2 issues by
1. modeling captured cells of sourceless `UserFunctionVariable` as
immutable `NewCellVariable`, and adds a few lines in `SideEffects` to
account for its immutability.
2. always checking whether we already had a `NewCellVariable` for the
captured cell, before constructing a new one.
Tests are added for each aforementioned case.
I also left a TODO to investigate why exactly we would lose source
information for `UserFunctionVariable`. Some cases are easily fixable,
but others not so much.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140150
Approved by: https://github.com/jansel
ghstack dependencies: #140035, #140036, #140149
We added an unboxing optimization to avoid writes to cells that existed
before Dynamo tracing (such writes interfere with HOPs). However, the
avoided write shouldn't be there in the first place, since we were
basically creating an empty `NewCellVariable`, and then write the
pre-existing content into the variable.
This patch
1. adds logic to bypass the initial write for pre-existing cells
without undermining correctness.
2. removes the unboxing optimization and the restart code path.
Fixes#137456, #138491; also see those issues for more historical
context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140149
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #140035, #140036
The `export_freevars` method was introduced very early on, for
propagating writes to unboxed cells from child to parent frame, see
https://github.com/pytorch/torchdynamo/commit/d0c10341.
However, it's no longer needed after we started to modify root tracer's
`symbolic_locals` directly for the unboxed cells, see
https://github.com/pytorch/torchdynamo/commit/663e4d92.
As a result, we no longer need `export_freevars`. In fact, it can cause
a very subtle bug when name collision happens across the parent and
child frames during inlining, because the parent frame isn't necessarily
the frame that defined the cell captured by child frame.
In summary, this patch removes the `export_freevars` bits, and adds a
regression test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140036
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #140035
This patch establishes the invariant that `ClosureVariable` and
`NewCellVariable` are always in `closure_cells`, never in
`symbolic_locals`, and therefore removes some duplicated code paths.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140035
Approved by: https://github.com/jansel
Summary:
`with_comms()` is mostly used as a decorator with an optional input argument `eager_init`. The problem of a decorator with input argument is that it has to be used with invocation always, i.e., you have to use as `with_comms()` rather than `with_comms` which majority of the existing usages.
This diff tries to provide a solution such that we could use `with_comms`, `with_comms()`, `with_comms(eager_init=False)`, and `with_comms(eager_init=True)`.
Test Plan: Contbuild & OSS CI
Differential Revision: D65385700
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139637
Approved by: https://github.com/wz337
Previously the split decomp would return the input when there were no splits. this errors in torch.compile (or FakeTensorMode) with :
> RuntimeError: View operation returned a tensor that is the same as the input base tensor. This is no longer allowed; you must explicitly create a new tensor (e.g., using .detach()). As a user, you could have made a mistake implementing __torch_dispatch__ or a Python operator decomposition or meta registration; if that's not the case, please report a bug to PyTorch or the backend you are using.
Fix for https://github.com/pytorch/pytorch/issues/133394
Differential Revision: [D65635070](https://our.internmc.facebook.com/intern/diff/D65635070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140065
Approved by: https://github.com/bdhirsh
Summary:
The recent tries on bandwidth profiler is not as expected. I have observed a few issues and tried to fix them in this diff:
1. The return of the DebugAutotuner class
2. Profiling results shows really large overhead.
DebugAutotuner.run() returns the benchmark time around 45ms while CachingAutotuner.run() returns the benchmark time around 0.45ms.
The `_find_names` and `re.match` takes 45ms: P1669186358
After we commenting out the above _find_names and re.match, the benchmark time become consistent with non-profiling mode: P1669185589
3. introduce a variable `bandwidth_info` to control the path in DebugAutotuner.run(). During benchmarking of configuration selection, we should turn off the `bandwidth_info`
After applying this diff, the profiling issues mentioned above are fixed: P1669273172
Test Plan:
```
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=~/tmp/profile.txt TORCH_LOGS='+inductor,+schedule,output_code' TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 CUDA_VISIBLE_DEVICES=5 buck run mode/{opt,inplace} scripts/wwei6/triton_examples:test_mat 2>&1 | tee profiling-5.log
```
If we want to disable the Aten backend, just add TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON"
Differential Revision: D64883079
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139607
Approved by: https://github.com/chenyang78
Summary: Add a log warning users about how disabling only CUDA events can cause incorrect correlation IDs
Test Plan: Log was printed in the correct scenario
Differential Revision: D65762576
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140285
Approved by: https://github.com/sanrise
Remove most references to rockset:
* replace comments and docs with a generic "backend database"
* Delete `upload_to_rockset`, so we no longer need to install the package.
* Do not upload perf stats to rockset as well (we should be completely on DynamoDB now right @huydhn?)
According to VSCode, it went from 41 -> 7 instances of "rockset" in the repo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139922
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
Before this change, if one builds PyTorch without XPU build process will
be perpetually regenerating code because of the reference to non-existing
file, that will make autograd codegened files always out of date, see part of the `ninja -d explain torch_cpu` output:
```
ninja explain: output ../torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp doesn't exist
ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist
ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty
ninja explain: /Users/malfet/git/pytorch/pytorch/torch/csrc/autograd/generated/Functions.cpp is dirty
```
This is a regression introduced by https://github.com/pytorch/pytorch/pull/139025.
After this change, incremental rebuilds with no changes cause no build actions:
```
% ninja -j1 -v -d explain -n torch_cpu
ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist
ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty
ninja: no work to do.
```
Test plan: Wait for at least on XPU build to finish...
Fixes https://github.com/pytorch/pytorch/issues/140432
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140438
Approved by: https://github.com/kit1980, https://github.com/huydhn
Previously we assumed that the number of tensor elements multiplied by the type size is not greater than the allocated memory size. However in some scenarios such as `tensor.expand`, the stride can be zero, which makes the assumption not true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140255
Approved by: https://github.com/ezyang
This PR replaces the parameter names specified in the `triangular_solve_meta`
function (specifically in its `@out_wrapper(...)` decorator) by those written in the
_native_functions.yaml_ file.
This name mismatch caused the operation to fail when using the meta device (see error
below):
```python
Traceback (most recent call last):
File "examples/test.py", line 23, in <module>
torch.triangular_solve(b.to("meta"), A.to("meta"), out=meta_out)
File "torch/_decomp/__init__.py", line 100, in _fn
return f(*args, **kwargs, out=None if is_none else out_kwargs)
File "torch/_prims_common/wrappers.py", line 289, in _fn
result = fn(*args, **kwargs)
TypeError: triangular_solve_meta() got an unexpected keyword argument 'X'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140186
Approved by: https://github.com/ezyang
Differential Revision: [D65307961](https://our.internmc.facebook.com/intern/diff/D65307961/)
This PR introduces the concept of a "dispatcher" module `n` that carries multiple interpreter modules `n`, `n@1`, `n@2`, etc., each corresponding to a particular call of `n` and thus might carry a different specialized graph. We only do this when we're preserving module call signatures for `n`. The carried modules have the same number and order of calls to `n` appearing in the original module / exported program. In the unflattened module, all those calls go to the "dispatcher" module which internally tracks how many calls have been made so far and invokes the corresponding interpreter module. We reset this tracking after a successful or unsuccessful run of the unflattened module.
Overall this makes swapping easier when module call signatures are preserved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139439
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #139438
Differential Revision: [D65308061](https://our.internmc.facebook.com/intern/diff/D65308061/)
When a shared submodule is called multiple times with different aliases, e.g., `self.a` and `self.b` are both `C()` under the hood and we have calls to both `self.a(...)` and `self.b(...)`, we wrap `C()` to emit as many export tracepoints as there are aliases. This caused us to compute module call signatures that conflated information: we'd add inputs and outputs of one call to inputs and outputs of a different call. Overall preserving module call signatures in the presence of shared submodules was borked because of this bug.
The fix is to pay attention to the nn module stack, which accurately tracks individual calls, thus allowing us to ignore some export tracepoints that get the module correct but not the alias through which the call was made.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139438
Approved by: https://github.com/zhxchen17
This was introduced in https://github.com/pytorch/torchdynamo/commit/d0c10341
as limited support for pre-existing cells, since we know `__class__` wouldn't be modified
in most cases. It's no longer needed now that we have much more support for these cells.
Example:
```python
class Foo():
def __init__(self):
super().__init__()
print(Foo.__init__.__code__.co_freevars) # ('__class__',)
print(Foo.__init__.__closure__) # (<cell at 0x1011fb310: type object at 0x10fe185b0>,)
```
This patch also exposed and fixes a bug in
`NNModuleVariable.var_getattr`, where Dynamo wasn't propagating source
correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140034
Approved by: https://github.com/williamwen42, https://github.com/anijain2305, https://github.com/jansel
Summary:
UBSan hits undefined behavior in this file. This fixes it by marking these pointers as unaligned.
```
caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/__ukernels_sse2__/buck-private-headers/q8gemm/4x4c2-sse2.c:325:5: runtime error: store to misaligned address 0x62900313891f for type 'uint32_t' (aka 'unsigned int'), which requires 4 byte alignment
0x62900313891f: note: pointer points here
be be be be be be be be be be be be be be be be be be be be be be be be be be be be be be be be
^
UndefinedBehaviorSanitizer: undefined-behavior buck-caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/__ukernels_sse2__/buck-private-headers/q8gemm/4x4c2-sse2.c:325:5 in
```
The fix is to mark these variables as unaligned following D42179009's example
q8gemm.cc + internal integration test
Differential Revision: D65637959
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140188
Approved by: https://github.com/digantdesai
Leverage existing FindGloo CMake module to locate system's library and headers. Add system's gloo headers to include path rather than the gloo from third party when USE_SYSTEM_GLOO is specified.
Fixes#140274
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140275
Approved by: https://github.com/malfet
It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation.
On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory.
In this kernel (https://github.com/pytorch/pytorch/pull/87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd.
Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140259
Approved by: https://github.com/jianyuh
This PR updates OpInfo-based tests for NJTs:
* Adds extensive coverage across non-contiguous NJTs (both non-contiguous transposed and non-contiguous with holes)
* The `_sample_njts()` helper that `sample_input_func`s utilize now produces non-contig NJTs as well
* Utilizes a `SampleInput`-based xfail system for granular classification of bugs. For example, it's possible to indicate that a class of ops is expected to fail only on non-contig with holes NJT inputs.
* I decided on adding `SampleInput`s and utilizing this system over using test parametrization for two reasons:
* Test perf - adding `SampleInput`s is faster than generating entire new tests
* Avoiding the possibility of `sample_input_func`s not respecting the non-contig test parameter - this would result in silently incorrect passing of these tests. Keeping the responsibility for `SampleInput` generation firmly within each `OpInfo`'s `sample_input_func` means weirdness like this isn't possible
* Improves `SampleInput` naming for a bunch of `sample_input_func`s. This makes it easier to xfail them as needed. For example, binary / unary / other ops now use the new `_describe_njt()` helper to get a string repr that uniquely defines the type of NJT being passed to the op
* Adds appropriate `XFailRule`s to get tests passing for forward / backward / forward compile / backward compile. In general, each xfail corresponds to some bug that needs to be fixed
```python
# Represents a rule indicating how to xfail a particular test. It allows granularity
# at the device, dtype, op, and individual sample levels. This flexibility allows entire
# bugs to be represented by a single rule, even if this corresponds with multiple conceptual
# test cases across multiple ops.
@dataclass
class XFailRule:
# expected error type
error_type: TypeVar = Exception
# expected error message
error_msg: str = ".*"
# function to indicate whether the rule applies; return True if so
match_fn: Callable[[torch.device, torch.dtype, OpInfo, SampleInput], bool] = None
# optional name for identifying the rule
name: str = ""
def match(self, device, dtype, op, sample) -> bool:
return self.match_fn(device, dtype, op, sample)
```
Example:
```python
# Bug when broadcasting a binary op with non-contiguous with holes NJT + dense
# tensor with 1 in ragged dim.
XFailRule(
error_type=RuntimeError,
error_msg="cannot call binary pointwise function .* with inputs of shapes",
match_fn=lambda device, dtype, op, sample: (
isinstance(op, BinaryUfuncInfo)
and "noncontig_holes" in sample.name
and "broadcasting 1 over ragged" in sample.name
),
name="binary_noncontig_holes_broadcasting_1_over_ragged",
),
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138370
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #140160
This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities:
- Changed the API to take a uint32 tensor as argument, instead of a device pointer
- Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method
- Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934
Approved by: https://github.com/weifengpy
ghstack dependencies: #139227
Buck1 is no longer supported in favor of buck2. This CI tests the old buck1 flow, however it is difficult to maintain especially since buck1 doesn't support aarch64 mac.
I am suggesting that this CI be deprecated until a decision on buck2 is made, and buck2 support is added. As of now, there seems to be no push towards adding buck2 support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140067
Approved by: https://github.com/huydhn
It was wrong to add it to MPSDevice in the first place, as in the end it's just a regular shader, like all others.
I.e. this PR:
- Moves contents of `at::mps::indexing_metal_shaders` into `kernels/Indexing.metal`
- Deletes `MPSDevice::getMetalIndexingLibrary()` and `MPSDevice::metalIndexingPSO` methods
- Moves `at::native::mps::generateKernelDataOffsets` implementation from `OperationUtils.mm` to `Indexing.mm`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140271
Approved by: https://github.com/Skylion007
Publish current state of s390x builder image to allow reproducing worker setup.
Also, if this image gets published to docker repository later, it'd be possible to download published image instead of building it into worker image in https://github.com/pytorch/pytorch/blob/main/.github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile#L66, which should allow improving restart time at the cost of additional runtime overhead.
Compared to first attempt to merge:
- default docker repository settings are added to all runners. Changes are mirrored in this PR.
- job is moved into separate workflow file.
- it's no longer attempted to update limits on s390x. Limits should be properly set up there on the host. And it's not possible to update them from worker since it runs in container. Also, worker container currently doesn't have sudo installed or configured or any systemd running.
- github token is now passed once via named pipe instead of environment variable. This should increase security of tokens.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132983
Approved by: https://github.com/huydhn, https://github.com/malfet
This PR adds "PrimHOPBase", which is intended to be a base class that
one can extend to create new HOPs that match some criteria:
- they take one subgraph as input, and their semantics are running the
subgraph on some operands
- the HOP stays alive until Inductor
The motivation is that we are seeing a lot more HOPs (invoke_subgraph,
invoke_quant) that have this property and there can be a lot of shared
code between them.
Future:
- Migrate invoke_subgraph to use this
- There are some TODOs in the code
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139898
Approved by: https://github.com/anijain2305, https://github.com/ydwu4
In old triton versions, you take the hash of the triton kernel and use it in the filepath for the cached kernel. In Triton 3.2 (after https://github.com/triton-lang/triton/pull/4553), the filepath will use the base-64-encoded representation of the hash in the path.
This PR checks whether the `_base64` function exists in triton, and if so, uses the base-64-encoded represenatation in the path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140190
Approved by: https://github.com/ezyang
Here's the overview:
There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits.
Some specifics:
* Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile).
* Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed.
* Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead.
* `record_compilation_metrics` is now called on exit from MetricsContext.
* Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`.
* Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext.
And specifically, several changes to dynamo_timed:
* "Modernize" the parameters and update all callsites accordingly.
* Move the backwards logging of the CompilationMetrics to the backwards compile location.
* Add a parameter for which CompilationMetrics field to update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849
Approved by: https://github.com/ezyang
ghstack dependencies: #140094
Summary:
I was looking into why the non-standard bool value will fail for msort - it makes sense for argsort and sort to fail, because we're randomly generating uint8 so the order will be different (and thus the indices will be different). But msort should work.
After some digging, it's interesting that even though scalar_t is bool, when the actual value is a uint8_t, the comparison will treat them as signed. I tried lhs=255 and rhs=0: lhs < rhs is equivalent to -1 < 0 which is true (but it's supposed to be False)
Therefore we add an explicit type cast.
Test Plan: Remove the test skip
Differential Revision: D65472170
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139870
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities:
- Changed the API to take a uint32 tensor as argument, instead of a device pointer
- Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method
- Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934
Approved by: https://github.com/weifengpy
ghstack dependencies: #139227
This changes the conda-builder workflow to almalinux-builder and switches Docker file to almalinux.
Please note: Published conda-builder images will still be available, hence workflows that use these images will still work.
We will be switching workflows that use conda-builder images to almalinux-builder
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140157
Approved by: https://github.com/malfet
# Motivation
This PR aims to maintain backward compatibility when building PyTorch XPU with the old and new compilers.
# Additional Context
The details are described here. The new compiler (2025.0.0) has some breaking changes compared with the old compiler(2024.1), for examples:
1. On Windows, sycl library is named `sycl7.lib` in the old compiler but is named `sycl.lib` in the new compiler.
2. On Linux, in order to support ABI=0, we have to link `libsycl-preview.so` in the old compiler but we could link `libsycl.so` in the new compiler to have the same ABI compatibility.
3. We added a macro `SYCL_COMPILER_VERSION` to support our new code has good backward compatibility with the old compiler. Now the new feature(Event elapsed_time, memory summary, and device architecture property) introduced by the new compiler will be controlled within the macro `SYCL_COMPILER_VERSION`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139258
Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/gujinghui
[AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c shim for XPU.
### Motivation
Since the current c shim codegen will only produce C wrappers for Op's registered in `aten/src/ATen/native/native_functions.yaml`, for the same backend, when a portion of out-of-tree OP's are not registered in that file, but are registered externally. For example, `third_party/torch-xpu-ops/yaml/native_functions.yaml` , in this case, the existing codegen can't fulfill the need to do extensions for the c shims from the out-of-tree OPs for the in-tree that has already been produced.
### Design
To extend the c shim with more OP for a backend from out-of-tree.
The PR provided a bool option `--aoti-extend` to indicate the codegen is to extend c shim from out-of-tree.
The generated c shim is stored in the `extend` subdirectory , for example:
```
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.cpp
```
example usage:
`python -m torchgen.gen --source-path third_party/torch-xpu-ops/yaml/ --xpu --aoti-extend --update-aoti-c-shim `
`--xpu`: generate c shim for XPU
`--aoti-extend `: this is an out-of-tree OPs(defined in `third_party/torch-xpu-ops/yaml/native_functions.yaml`) extend for in-tree ops(defined in `aten/src/ATen/native/native_functions.yaml`)
`--update-aoti-c-shim`: always generate c_shim_xpu.h for the extend c_shim.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136742
Approved by: https://github.com/EikanWang, https://github.com/desertfire
ghstack dependencies: #139025
[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops.
Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part,since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well.
At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139025
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
Summary: Add time log to cudagraph, including `create deferred_cudagraphify wrapper`, `warmup`, `record`, and `checkpoint`.
Test Plan:
1. buck2 run fbcode//mode/opt //pytorch/benchmark:run -- resnet50 -d cuda -t train --inductor --pt2-triton-cudagraph
2. Found the result in [scuba table](https://fburl.com/scuba/pt2_compile_events/0oik8nu9).
{F1954034920}
Differential Revision: D65505659
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139818
Approved by: https://github.com/eellison
Summary:
AMD lowering duration is 1.55x longer than H100. Profiling shows hipification related functions took 22% of overall lowering time.
This diff cuts that time by safely memoize the trie to regex logic. The trick is to incrementally build a state of the trie during the trie construction. The state is the hash of all the words added to the trie.
Differential Revision: D65659445
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140156
Approved by: https://github.com/ColinPeppler
Co-authored-by: Kefei Lu <kefeilu@meta.com>
Fixes#126268
I've basically followed @ezyang suggestion (I think) to use `func.decompose(...)`. Since `__torch_dispatch__` won't be called a second time for the same op, I've added a second `TorchDispatchMode` (`_DecomposedCounterMode`) that simpy dispatches to the parent flop counter. Using `self` as the inner context manager is not possible, since the second call to `__enter__` would re-initialize the counter's tracking state.
Let me know if there's something wrong with this implementation, since I'm quite unsure how the decomposition thing actually works :D
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138508
Approved by: https://github.com/ezyang
This fix was a bit more involved:
1) It fixes a item_memo loss place.
2) It updates a test to be eager instead of aot_eager since it reveals a very obscure bug related to replacements that's not worth solving since in practice inductor will regenerate the runtime asserts anyways
3) It updates tensorify to specialize more places now that the aforementioned bug is fixed.
Fixes `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=6 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCPU.test_comprehensive_linalg_norm_cpu_float16` when `specialize_float=False`
while ensuring `python test/dynamo/test_dynamic_shapes.py DynamicShapesMiscTests.test_runtime_assert_replacement_dynamic_shapes` doesn't regress
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139587
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454, #139896, #139935
Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598
Approved by: https://github.com/ezyang
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
This PR contains several fixes related to non-contiguous NJTs:
1. Propagates `lengths` through op calls appropriately (see desc of #138098)
* SDPA now calls `nested_view_from_values_offsets_lengths()` instead of `nested_view_from_values_offsets()`
2. Allows non-contig NJTs in unsqueeze / transpose / select
3. Expands padded dense -> NJT conversion to support non-contig NJTs
4. (unrelated sorry) Updates `split` / `split_with_sizes` to allow for optional `dim`, matching the ATen signature
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140160
Approved by: https://github.com/cpuhrsch
Fixes#136559
As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking.
This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged.
Before this PR, the following tests failed:
```
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors
```
With this PR, the supported/unsupported ops in NumPy 1 are not changed.
For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list.
I used the following scripts to check the differences before and after the change for both NumPy 1 & 2.
The output is empty for NumPy 1 since there is no change.
The output is a list of `numpy.random` that considered supported for NumPy 2.
```py
from torch._dynamo import trace_rules
import numpy as np
def new_numpy_function_ids():
unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"}
def is_supported(k, v, mod):
if not callable(v):
return False
if not getattr(v, "__module__", None):
return True
if v.__module__ == mod.__name__:
return True
if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs:
return True
return False
rv = {}
for mod in trace_rules.NP_SUPPORTED_MODULES:
for k, v in mod.__dict__.items():
if is_supported(k, v, mod):
rv[id(v)] = f"{mod.__name__}.{k}"
return rv
def old_numpy_function_ids():
rv = {}
for mod in trace_rules.NP_SUPPORTED_MODULES:
rv.update(
{
id(v): f"{mod.__name__}.{k}"
for k, v in mod.__dict__.items()
if callable(v)
and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__
}
)
return rv
rv1 = set(old_numpy_function_ids().values())
rv2 = set(new_numpy_function_ids().values())
for v in (rv1 - rv2):
print(v)
print("****")
for v in (rv2 - rv1):
print(v)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686
Approved by: https://github.com/williamwen42
This PR introduces the following:
### torch.ops.symm_mem._async_input_mm
`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`
An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
chunk_idx = chunk_idx % num_chunks
wait_signal(a_chunk_signals, chunk_idx)
# Compute output tiles that consumes the input chunk
```
### PersistentAsyncInputScheduler
This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:
- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.
Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.
Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
Shape<int, int, int, int>,
CollectiveMainloop,
CollectiveEpilogue,
cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```
### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.
## Benchmarks
### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us
<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">
### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us
<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">
## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl
Differential temp Revision: [D65623152](https://our.internmc.facebook.com/intern/diff/D65623152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
As manylinuxaarch64-builder already comes pre-built with all versions of python runtime
Refactor logic for setting path to DESIRED_PYTHON from `manywheel/build_common` into `set_desired_python.sh` and call it from aarch64_ci_setup.sh
In followup PRs move scons and ninja installation into base docker image
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140093
Approved by: https://github.com/atalman
When we have hardware support, we can use it. When we don't have hardware support, we can still do better than vec_base.h. I'm not sure to what extent we're set up to properly test both `defined(__ARM_FEATURE_BF16)` and `!defined(__ARM_FEATURE_BF16)` builds, feedback especially welcome there.
Testing: vec_test_all_types should cover correctness. For perf, seems clear that using vectorized intrinsics should be better than vec_base?
Differential Revision: [D64997747](https://our.internmc.facebook.com/intern/diff/D64997747/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139090
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #139084
Summary: Tighten the AOTIModelContainerRunner::run interface to take a const vector of at::Tensor, which 1) makes it clear that the runner will not modify the input tensor vector; 2) runner will be able to take a temp vector of tensors as the input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139955
Approved by: https://github.com/chenyang78
Previously the split decomp would return the input when there were no splits. this errors in torch.compile (or FakeTensorMode) with :
> RuntimeError: View operation returned a tensor that is the same as the input base tensor. This is no longer allowed; you must explicitly create a new tensor (e.g., using .detach()). As a user, you could have made a mistake implementing __torch_dispatch__ or a Python operator decomposition or meta registration; if that's not the case, please report a bug to PyTorch or the backend you are using.
Fix for https://github.com/pytorch/pytorch/issues/133394
Differential Revision: [D65635070](https://our.internmc.facebook.com/intern/diff/D65635070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140065
Approved by: https://github.com/bdhirsh
This PR adds support for the `restore_value` argument of the
`@triton.autotune` for the user-defined Triton kernels in PT2.
The `kernel.restore_idx` are extracted in the
`ir.UserDefinedTritonKernel` and the corresponding arg names are
placed into the `triton_meta["restore_value"]`. From there, those
are added to the existing `mutated_arg_names` in the caching autotuner
infra which already exists and leads to the listed argss being cloned.
This achieves the equivalent effect to the native `restore_value`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139851
Approved by: https://github.com/oulgen
This is a first step towards removing builds dependency to conda.
Currently we build magma as a conda package in a pytorch conda channel, implemented in a1b372dbda/magma.
This commit adapts the logic from pytorch/builder as follows:
- use pytorch/manylinux-cuda<cuda-version> as base image
- apply patches and invoke the build.sh script directly (not anymore through conda build)
- stores license and build files along with the built artifact, in an info subfolder
- create a tarball file which resembles that created by conda, without any conda-specific metadata
A new matrix workflow is added, which runs the build for each supported cuda version, and uploads the binaries to pyorch s3 bucket.
For the upload, define an upload.sh script, which will be used by the magma windows job as well, to upload to `s3://ossci-*` buckets.
The build runs on PR and push, upload runs in DRY_RUN mode in case of PR.
Fixes#139397
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139888
Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/seemethere
1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing.
2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338
Approved by: https://github.com/kwen2501
Here are the cases that Inductor does autotuning at compile time:
1. pad mm: benchmark to decide if we should pad or not
2. template autotuning: benchmark triton/cutlass templates and ATen kernel for matmul/conv and pick the fastest one.
The PR annotate these cases with `dynamo_timed`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139431
Approved by: https://github.com/ezyang
**About the PR**
In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between.
This PR adds a pass to fuse the corresponding patterns:
- (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape`
- (with bias) `pattern_no_bias -> add -> reshape -> reshape`
The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants.
Note that `onednn.qlinear_pointwise` does not support per-channel quantization of activation, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`.
**Validation results**
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:
- Model: EleutherAI/gpt-j-6b
- Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
- Using Intel OMP and Tcmalloc
- Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile`
**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139595
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
Test uses `torch.load()` for DTensor state_dict:
```
python3 test/distributed/fsdp/test_fsdp_dtensor_state_dict.py -k TestFSDPWithDeviceMeshAndDTensor
```
In this PR, we add `DTensor` related class to allowed safe globals so we can still `torch.load()` a `DTensor` with `weights_only=True`. We also need this for backward compatibility, since `DTensor` can be `torch.load()` before `weights_only` defaults to True. Without the change, `torch.load()` a `DTensor` would run into the following error:
```
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
(1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
WeightsUnpickler error: Unsupported global: GLOBAL torch.distributed.tensor.DTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals([DTensor])` or the `torch.serialization.safe_globals([DTensor])` context manager to allowlist this global if you trust this class/function.
```
The unit test failure is not being captured by CI when `weights_only` being rolled out for `torch.load()` by default. This is due to another issue that the test communication wrapper `with_comms` let unit tests silently pass without capturing failure due to a recent change (https://github.com/pytorch/pytorch/pull/138108). This wrapper issue is going to be fixed
by a separate PR https://github.com/pytorch/pytorch/pull/139637.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139949
Approved by: https://github.com/mikaylagawarecki
At line 205, I believe the code `x = self.activations[act](x)` should be indented so that it is in the body of the for loop. Otherwise, applying the four linear modules has the same effect as applying a single linear module, in the sense that it is still just a linear map so there is no point in having four of them. In other words, each layer of this network should have a nonlinearity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139667
Approved by: https://github.com/malfet
Reverts PR https://github.com/pytorch/pytorch/pull/137523
Reasons for the reversion:
1. NCCL profiler plugin is meant to be opened by NCCL. And the profiler's implementation is meant to be provided by a profiler. There is no evidence that `torch.distributed` is at a better position to be either an opener or a provider. (The PR to be reverted made `torch.distributed` an opener).
2. The main purpose of the reverted PR is to dlopen a dump function, with the help of an environment variable `NCCL_PROFILER_PLUGIN_FUN` that provides the symbol name, as in code below:
c19c384690/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L415-L427)
After some investigation, NCCL does not support env var `NCCL_PROFILER_PLUGIN_FUN`. And NCCL's profiler contract `nccl_profiler.h` does not have a function called "ncclProfilerPluginDump" defined. So this looks like a private add-on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139847
Approved by: https://github.com/c-p-i-o
Summary: I'm refactoring dynamo_timed and updating the params. It will be much easier to do this refactor entirely in OSS. So this diff essentially provides a couple aliases in the OSS area that I can update without affecting the internal usage.
Test Plan: Ran locally and made sure I still got samples: https://fburl.com/scuba/dynamo_compile/sandbox/qub89lwj
Reviewed By: oulgen
Differential Revision: D65580302
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140016
Approved by: https://github.com/oulgen
My sales pitch: I need to ssh into the runner from time to time on my PR to debug issues, but it's well-known that LF runners don't support SSH login anymore. So, the propose fix here is to introduce a new label called ~no-runner-determinator~ `no-runner-experiments` that can be attached to the PR. Whenever `.github/scripts/runner_determinator.py` runs on a PR and sees this label, it will not apply any logic and just straight up use an empty prefix.
### Testing
With the label:
```
python3 runner_determinator.py \
--github-token "MY_TOKEN" \
--github-issue "5132" \
--github-branch "install-torchao-torchtune-et" \
--github-actor "huydhn" \
--github-issue-owner "huydhn" \
--github-ref-type "branch" \
--github-repo "pytorch/pytorch" \
--eligible-experiments "" \
--pr-number "139947"
INFO : Opt-out runner determinator because #139947 has no-runner-determinator label
WARNING : No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method.
::set-output name=label-type::
```
Without the label:
```
python3 runner_determinator.py \
--github-token "MY_TOKEN" \
--github-issue "5132" \
--github-branch "install-torchao-torchtune-et" \
--github-actor "huydhn" \
--github-issue-owner "huydhn" \
--github-ref-type "branch" \
--github-repo "pytorch/pytorch" \
--eligible-experiments "" \
--pr-number "139947"
INFO : Based on rollout percentage of 95%, enabling experiment lf.
INFO : Skipping experiment 'awsa100', as it is not a default experiment
WARNING : No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method.
::set-output name=label-type::lf.
```
Running in trunk commit without a PR number will use the regular logic:
```
python3 runner_determinator.py \
--github-token "MY_TOKEN" \
--github-issue "5132" \
--github-branch "install-torchao-torchtune-et" \
--github-actor "huydhn" \
--github-issue-owner "huydhn" \
--github-ref-type "branch" \
--github-repo "pytorch/pytorch" \
--eligible-experiments "" \
--pr-number ""
INFO : Based on rollout percentage of 95%, enabling experiment lf.
INFO : Skipping experiment 'awsa100', as it is not a default experiment
WARNING : No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method.
::set-output name=label-type::lf.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140054
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
Summary:
In most cases, we don't need to turn on AttrProxy tracing for two reasons:
1. It's only needed when you have one submodule owning multiple FQNs.
2. AND it will cause model using module identity to be traced incorrectly (because we substitute module objects at tracing time).
Overall after offline discussion with some export folk, we think it's better to turn off AttrProxy if we can make sure every submodule has unique FQN, which tends to be the common case.
Test Plan: buck test mode/opt caffe2/test:test_export -- -r module_dict_key
Differential Revision: D65555919
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139918
Approved by: https://github.com/tugsbayasgalan
I recently added a new pattern here https://github.com/pytorch/pytorch/pull/139136 to remove pointless view/permute pairs. At that PR, I've already updated the matched pattern/node count in `test_linear_binary` to account for the new pattern. But it looks like with cpp wrapper, one more pattern will be matched.
```
7 patterns without cpp-wrapper:
========== pattern matched <code object pointless_view at 0x7f6d25c67aa0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object pointless_view_pair at 0x7f6d25c67b50, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.p
y", line 581> =======
========== pattern matched <code object pointless_view at 0x7f6d25c67aa0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object pointless_view at 0x7f6d25c67aa0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object linear at 0x7f6d176e5dc0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 11
21> =======
========== pattern matched <code object reshape_linear_reshape_pattern at 0x7f6d176e5210, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mk
ldnn_fusion.py", line 732> =======
========== pattern matched <code object fn at 0x7f6d176d3ec0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 476> =
======
8 patterns with cpp wrapper:
========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object pointless_view_pair at 0x7f8e78bf0870, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.p
y", line 581> =======
========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object linear at 0x7f8e59c04190, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 11
21> =======
========== pattern matched <code object reshape_linear_reshape_pattern at 0x7f8e59dfb520, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mk
ldnn_fusion.py", line 732> =======
========== pattern matched <code object fn at 0x7f8e59dfa290, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 476> =
======
```
I fixed this test by +1 to the expected number if cpp wrapper is enabled. But I think fundamentally can we not assert for the total number of patterns matched in the test? I think that makes the test very fragile. People adding new patterns may keep breaking these 'un-related' tests. One possible way to improve is, we have a counter for each specific pattern, in the tests, instead of check the total number of patterns matched, just check the match count for the ***RELEVANT*** patterns. That should reduce false-positive for broken tests. cc possible test creator @jgong5
Fixes https://github.com/pytorch/pytorch/issues/139812 (we need to have this to run this disabled test on your PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139942
Approved by: https://github.com/huydhn, https://github.com/jgong5
Fixes#139755#139621
The new stream pipeliner on AMD triton backend enables num_stages to function equivalent to NV backend. This upgrade in triton 3.2 will cause OOM issues in flex attention due to num_stages=3 setting, we have tuned this to num_stages=1 which is the best setting for flash attention kernels and avoids the shmem issues.
We will follow up this PR with some config tuning on AMD backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139883
Approved by: https://github.com/bertmaher
Fixes `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_torch.py TestTorchDeviceTypeCPU.test_gradient_type_promotion_cpu` when `specialize_float=False`
Reviewers might wonder why we need to have this whitelist. Can't we rely on python_arg_parser.h to do the specialization generically? Alas this path doesn't actually FFI to C++ so we do need to do the specialization in pythonland.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139935
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454, #139896
Instead of moving these queries to ClickHouse, we're just going to remove it since it's not really used. We do want something for test aggregates, but we can make a new script instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139915
Approved by: https://github.com/huydhn
Set PYTORCH_MIOPEN_SUGGEST_NHWC environment variable to force output layout to channels-last.
This way, the channels-last CK instances will be added to benchmark choices in max autotune
# Testing
```
pytest test/inductor/test_ck_backend.py -k conv2d
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138643
Approved by: https://github.com/chenyang78
Shell script still referencing builder checkout rather than PyTorch, which results in
```
python /builder/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn
python: can't open file '/builder/aarch64_linux/aarch64_wheel_ci_build.py': [Errno 2] No such file or directory
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140020
Approved by: https://github.com/atalman
The test was failing when I ran the whole test suite. I'm guessing that the exact indices would previously depend on the order that tests would run; by resetting the kernel_side_table we should hopefully get results that are reproducible independent of the test execution order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139907
Approved by: https://github.com/oulgen, https://github.com/aakhundov
- Refactored traceback code into `work.printTraceback()`. cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @shuqiangzhang
- Refactored desync debug code into `class DesyncDebugger`.
- Moved occurrences of `futureWorkResult_->markCompleted` into `checkAndSetException` and `checkTimeout`, respectively. cc @shuqiangzhang
- Modularized dump signal broadcast code into `ProcessGroupNCCL::broadcastDumpSignal`. cc @fduwjj @c-p-i-o
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139834
Approved by: https://github.com/shuqiangzhang
Based on discussion here: https://github.com/pytorch/pytorch/pull/138731
Introducing ability for subclass implement type convertion to expected_type.
```
def __coerce_same_metadata_as_tangent__(
self, expected_metadata: Any, expected_type: Optional[Type] = None
):
```
Here if `expected_type=None` means `SubclassClass` is expected.
E.g. for `DTensor` we may find tangent `AsyncCollectiveTensor` where we expected `Tensor` - in this case
`expected_type=Tensor` will be called during runtime
Adding implementation to AsyncCollectiveTensor, that just triggers `wait()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139095
Approved by: https://github.com/bdhirsh
@frost-intel discovered that some Inductor auto-tuning UTs for CPU are currently broken on machines supporting AMX ISA. That's because in #136688, I had reverted a change in the AMX GEMM micro-kernel that was introduced in #131887, but it looks like some other implementations introduced after the aforementioned change rely upon it, so it should not have been reverted.
Added a fix.
Ideally, a CI machine that supports AMX should cover these UTs (test/inductor/test_cpu_select_algorithm.py). We do have at least one CI machines that support AMX.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139906
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
Support stream. When the driver communicates with the executor, it will send the stream id corresponding to the execution command; when the executor receives the command with the stream id, it will ignore the stream id because cpu backend doesn't support asynchronous execution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136991
Approved by: https://github.com/ezyang
There are 4 parts (they are hard to further break into smaller ones cause they're highly coupled) in this PR:
1. **Whenever we call create_graph_input, we try to bind the symbols in the graph input.**
We've enforced the invariant that all create_graph_inputs calls must provide an example value, we could intercept at the create_graph_input calls (This PR only handles free symbols in tensors).
2. **We cache the bound_symbols** to avoid lift the same symbol repeated.
3. For lifted symbols, we re-used **lifted_freevars** i.e. the mapping between symbol proxy in parent graph to the lifted phs in current subgraph, which we handle lifted tensors. In this way, all hops that supports lifted tensors should be able to handle lifted_symints automatically (at least in dynamo part).
4. For **unbacked symbols** created during tracing, we need to also bound these symbols to its proxy. This is to support the tests cases where we want to lift unbacked symbols as input. We need the proxy of the unbacked symbol in parent graph in order to properly create the args to the hop.
5. We change all the tests after free symbols are lifted in subgraphs. And also supports the lifted symbols in existing higher order ops.
**The interaction of nested tracers:**
The previous design for lifting tensor closures is that: suppose we're in nested tracers, whenever we see a new proxy that's not created by create tracer, we recursively look for the proxy in parent tracer until we find the tracer that creates this proxy (either a placeholder or some intermediate results). More detail is in Note [Nested SubgraphTracer and free_variable handling].
Given the above design, the plan for lifting the free symbols is: whenever we lift a free tensor to be the inputs of current subgraph, we'll look at the symbols in it and bind the symbols at the same time.
For example, suppose we have the following function:
```python
def f(x: [s1, s2]):
def true_f():
def true_f_inner():
return x.sin()
```
what will happen in time order:
1. we create a subtracer 1 and start to speculate the outer cond's true_f
2. we create a another subtracer 2 and start to speculate the inner cond's true_f_inner.
3. dynamo realize the tensor input x by calling wrap_tensor in top-level to create graph input x (tracer 0), we bind the symbol s1, s2 after ph for x is created. So the graph now looks like:
```python
def gm(s1, s2, x):
```
4. when seeing TensorVariable.call_method of x, tracer2 wants to create a call_function(sin, proxy_of_x), but it finds that proxy_of_x is not created by current tracer. So it recursively look up its parent tracer1 and find parent tracer1 also doesn't track this proxy_of_x then it finds the root tracer0, who is the creator of it and tracks it as a ph. Then tracer 1 create_graph_input to lift the closure to its input ph1 and add (proxy_of_x: ph1) k-v in **lifted_freevars** of tracer 1.
Now the graph looks like:
```python
def gm(s1, s2, x):
def true_gm(x):
```
5. Since there are free symbols inside this new tensor input, tracer 1 also binds the symbols (maybe_bind_symbol), which calls create_graph_input for s1 and s2. Now the graph looks like
```python
def gm(s1, s2, x):
def true_gm(s1, s2, x):
```
6. then it goes back to tracer 2, and call create_graph_input for x and get ph2, tracer 2's **lifted_freevars** records (ph1, ph2). and tracer 2 also binds the symbols in this new tensor input. Now the graph looks like:
```python
def gm(s1, s2, x):
def true_gm(s1, s2, x):
def true_gm_inner(s1, s2, x):
```
7. Finally the sin call_function node is created by tracer 2.
**This PR also handles the following cases:**
- What if we lift two tensors share the same symbol? e.g. x1 [s1, s2], x2 [s2, s3]? Each subtracer maintains bound_symbols as a cache that maps a symbol.expr to its proxy in current tracer. So when we see x1, we'll track s1 and s2 as inputs and bound s1 to ph1, s2 to ph2. So when we try to bind symbols of x2, s2 will already be tracked so no graph input is created.
- what if a subgraph close over a symint? e.g.
```python
def f(x):
def true_f():
c = x.size(0)
def true_fn_inner():
return c
```
When we speculate true_fn_inner, we find proxy_of_c is not tracked by tracer 2, so it recursively looks up its parent. At this point, x and its symbols have been lifted as input of true_f (as a result of lifting x during tracing true_f in tracer 1. Specifically the graph looks like:
```python
def gm(s1, s2, x):
def true_gm(s1, s2, x):
def true_gm_inner():
```
So tracer 2 is able to find that s1 have been tracked as ph in tracer 1 so it returns back to gm and call create_graph_input on s1. The graph now looks like:
```python
def gm(s1, s2, x):
def true_gm(s1, s2, x):
def true_gm_inner(s1):
return s1
```
- What if subgraph close over an unbacked symint? e.g.
```python
def f(x):
def true_f():
c = x.item()
def true_f_inner():
return c
```
When x.item() is called, proxy_of_c and its symnode variable is created for tracer 1, and we also call track_unbacked_symbols to record this relationship. So when tracer 2 finds proxy_of_c is not created by current tracer, it recursivelly looks up its parent tracer and finds that that expression u0 has been tracked as a result of track_unbacked_symbol in tracer 1. So it will stop the recursion and create_graph_input u0 in tracer 2. Graph looks like:
```python
def f(x):
def true_f(s1, s2, x):
c = x.item()
def true_gm_inner(u0):
return u0
cond(pred, true_gm_inner, false_gm_inner, (c,))
```
- what if subgraph close over a tensor with unbacked symint shape?
```python
def f(x):
def true_f():
c = x.item()
r = torch.randn((c,))
def true_f_inner():
return r + 1
```
This is the same as the case of closing over tensors with backed shapes. where we first lift r, then bind u0 in it, which recursively bind_symint of u0 in its parent and found u0 is tracked in parent tracer as a result of .item() call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138363
Approved by: https://github.com/zou3519
This PR introduces the following:
### torch.ops.symm_mem._async_input_mm
`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`
An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
chunk_idx = chunk_idx % num_chunks
wait_signal(a_chunk_signals, chunk_idx)
# Compute output tiles that consumes the input chunk
```
### PersistentAsyncInputScheduler
This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:
- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.
Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.
Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
Shape<int, int, int, int>,
CollectiveMainloop,
CollectiveEpilogue,
cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```
### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.
## Benchmarks
### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us
<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">
### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us
<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">
## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
This patch
1. Adds documentation to `PyCodegen.__call__`, `PyCodegen.tempvars` and
the `allow_cache` flag.
2. Merges a few existing code paths in `PyCodegen.__call__`.
3. removes the `elif var in cg.tempvars` code path in
`codegen_save_tempvars`, because it's no longer needed after #113725,
as we have up-to-date `VariableTracker.source` now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139670
Approved by: https://github.com/jansel
ghstack dependencies: #139538
This effectively undoes #115095, which is not longer be needed after #113725.
Why did we need #115095? I went back in history and found that [this line](https://github.com/pytorch/pytorch/pull/113725/files#diff-0bb1756725c4426408938314b0c9d3988ae5bf49994892d7038ad7746e209e9fR86)
actually fixed what #115095 fixed. Specifically, without the
`allow_cache` check for the "dup_top" optimization, we could incorrectly
codegen based on source, despite `codegen_update_mutated` requested to
codegen from value, for updates to pre-existing lists, etc. Since #113725 added
the `allow_cache` check, we no longer need the `mutable_side_effects_from_source`
code path from #115095.
However, #115442 introduced a `value_from_source` flag which didn't
account for the `mutable_side_effects_from_source` branch. So this patch
adds an extra check to keep existing behavior for export, and leaves a
TODO for investigating what exactly export wants from codegen, when it
comes to side effects and sources.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139538
Approved by: https://github.com/jansel
Summary:
Move flight recorder logger class out from utils.py into its own file.
This makes the program more modular.
This is mostly a refactoring/non-functional change.
Test Plan:
Build fr_trace locally and ran it.
```
buck build //caffe2/fb/flight_recorder:fr_trace
Buck UI: https://www.internalfb.com/buck2/875ca6a3-e86e-4263-95a0-579502494c5c
Network: Up: 0B Down: 0B
Jobs completed: 6818. Time elapsed: 0.2s.
BUILD SUCCEEDED
```
Ran it as follows:
```
cd buck-out/v2/gen/fbcode/caffe2/fb/flight_recorder
./fr_trace.par -p trace_ /tmp
Not all ranks joining collective 3 at entry 2
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {1}
input sizes: [[4, 5]]
output sizes: [[4, 5]]
expected ranks: 2
collective state: scheduled
collective stack trace:
<module> at /home/cpio/test/c.py:66
```
Differential Revision: D65503768
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139806
Approved by: https://github.com/fduwjj
Fixes#138550.
### Description
In the fusion of two nodes, one node with less variables (`node_to_recomp`) would make its variable ranges aligned with the other node (`ref_node`). In detail, `node_to_recomp` would change its variable ranges to the original ranges of `ref_node`. However, if both of the nodes have changed its ranges, i.e., the simplified variable ranges are different from its original ones, the issue comes up.
### Solution
For the case where the `ref_node` also changes its variable ranges, we recompute the size and body for it, to ensure the nodes are simplified to the same size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138568
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
Fixes#111824
Currently it is the case that if the user specifies their group normalization to be of NHWC format, pytorch will default to NCHW tensors and convert. This conversion is not immediately obvious to the user unless they check the format themselves which is not intuitive. This PR adds suppor for NHWC for cuda by adding necessary kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126635
Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki
Summary:
This diff reverts D65290089
This change is introducing more logging than I realized and could present problems for tlparsen
Test Plan: NA
Reviewed By: jamesjwu
Differential Revision: D65541060
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139893
Approved by: https://github.com/jamesjwu
Follow up to some issues @malfet's recent PR pointed out about missing ops #139763. Tried to mirror it to other important nearby ops. Seems like we could automate / autogen this more for generic pointwise ops like this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139890
Approved by: https://github.com/malfet
This allows Configs to handle setting their defaults (or overriding
themselves) via environment variables.
The environment variables are resolved at install time (which is usually
import time). This is done 1) to avoid any race conditions between
threads etc..., but 2) to help encourage people to just go modify the
configs directly, vs overriding environment variables to change
pytorch behaviour.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138956
Approved by: https://github.com/ezyang
ghstack dependencies: #138766
Fixes#137512
Relaxes the restriction that the ragged dim is immediately next to the batch dim e.g. `(B, *, D_0, ..., D_N)`. This allows for constructing NJTs of shape e.g. `(B, D, j0)` directly. It's possible before this PR to get an NJT of e.g. shape `(B, D, j0)` by constructing an NJT of shape `(B, j0, D)` and transposing it. This PR allows a user to go straight there without the transpose. The standard `torch.nested.nested_tensor(list)` constructor has been updated to support this.
At the very least, this is useful for testing on transposed NJTs. I'm willing to make this functionality private if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137125
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
Add a new documentation to show one memory usage benefit brought by TorchDynamo-based ONNX exporter.
Also add a unit test to make sure TorchDynamo-based ONNX exporter works well under FakeTensorMode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139388
Approved by: https://github.com/xadupre
Per our discussion in https://fburl.com/gdoc/voce5o06, we will run slow jobs more frequently on all trunk commits. Note that slowgradcheck jobs are moved to periodic as they are not about running slow tests.
There are currently 3 GPU + 2 ROCm + some CPU `linux.4xlarge` runners running slow jobs. So, I don't expect to see a big increase in CI cost after this.
Also, these slow jobs will only run in trunk commits, not in PRs, so their duration won't affect PR TTS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139842
Approved by: https://github.com/clee2000
Summary: When we use aoti_compile_and_package to package the AOTI compiled artifacts, cubin files will be included, and at the deploy time, we should setup the cubin file directory to the right path that contains unziped cubin files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139848
Approved by: https://github.com/aakhundov
As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks.
Test plan: Run
```python
from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler
import torch
import time
vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0",
subfolder='vae',
torch_dtype=torch.bfloat16,
force_upcast=False).to('mps')
pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae,
torch_dtype=torch.bfloat16, variant="fp16").to('mps')
pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
start_time = time.time()
start_mps_mem = torch.mps.driver_allocated_memory()
image = pipe(prompt="Spherical cow in vacuum",
num_inference_steps=10,
guidance_scale=8,
generator=torch.Generator("mps").manual_seed(42),
).images[0]
end_mps_mem = torch.mps.driver_allocated_memory()
run_time = time.time() - start_time
print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb")
image.save(f'bfloat16.png')
```
Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same:

Fixes https://github.com/pytorch/pytorch/issues/139389
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139791
Approved by: https://github.com/drisspg, https://github.com/Skylion007
ghstack dependencies: #139788, #139784, #139763
fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is *not* true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once.
Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at 76ca4a6888/fsspec/transaction.py (L39) -- because each filesystem can have a single transaction, this tear-down logic will error).
Instead, let's manually handle committing / discarding changes to the file.
I don't have a minimal test-case, but in Meta this solves a broken test on `fsspec >= 2024.3.0`:
Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607
After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135541
Approved by: https://github.com/Skylion007
**About the PR**
In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between.
This PR adds a pass to fuse the corresponding patterns:
- (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape`
- (with bias) `pattern_no_bias -> add -> reshape -> reshape`
The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants.
Note that `onednn.qlinear_pointwise` does not support per-channel quantization of activation, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`.
**Validation results**
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:
- Model: EleutherAI/gpt-j-6b
- Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
- Using Intel OMP and Tcmalloc
- Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile`
**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139595
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to
Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implementations are trivial oneline call
Also, as suggested by @Skylion007, replace `at::where(foo->logical_not, -inf, 0)` with `at::where(*foo, 0, -inf)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139784
Approved by: https://github.com/Skylion007, https://github.com/drisspg
ghstack dependencies: #139788
Summary:
save around 8% on the torchrec model.
In most case the new implications are not optimizaiton anyway in some case though they are,
but optimizing them is useless.
ex:
```
generating implications for Eq(Mod(s0, 3), 0)
adding Eq(Mod(s0, 3), 0)
adding Eq(0, Mod(s0, 3))
adding Ne(Mod(s0, 3), 0)
adding Ne(0, Mod(s0, 3))
adding Mod(s0, 3) <= 0
adding 0 < Mod(s0, 3)
adding True
adding False
```
VS
```
generating implications for Eq(Mod(s0, 3), 0)
adding Eq(Mod(s0, 3), 0)
adding Eq(0, Mod(s0, 3))
adding Ne(Mod(s0, 3), 0)
adding Ne(0, Mod(s0, 3))
adding Mod(s0, 3) <= 0
adding 0 < Mod(s0, 3)
adding 0 <= Mod(s0, 3)
adding Mod(s0, 3) < 0
```
the main difference is that 0 <= Mod(s0, 3) can be simplified to True and Mod(s0, 3) < 0 to False but with this change
this wont happen. but True:True and False: False are useless anyway lol. so its ok i think
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=1000
```
<img width="1082" alt="Screenshot 2024-11-04 at 9 25 51 PM" src="https://github.com/user-attachments/assets/a26e291b-9280-4b55-9275-f3201a36ac51">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139738
Approved by: https://github.com/ezyang
ghstack dependencies: #139703
Assuming the forward pass user code looks like:
```
for _ in range(2):
x = layer(x)
```
and we have `fully_shard(layer)`, then:
- the forward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" (currently same for both eager and compile)
- the backward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" in eager, but currently it's "unshard layer -> call layer 1st time -> call layer 2nd time -> reshard layer" in compile
The behavior in the backward pass is different between eager and compile, which is not ideal.
I am currently trying to look for a way to fix this non-ideal behavior of compile - tried a few things:
1. Tracing the RegisterPostBackwardFunction custom autograd function - this stills seems to be a no-go, due to HOP not supporting side-effects.
2. Instead of custom autograd function, do a "multi-grad hook" to wait for all gradients to be ready before triggering post_backward. However, this approach seems to have bad interaction with register_hook of pre_backward, in the sense that it's unclear which of them will be triggered first in practice.
3. Force execute any pending post_backward before unshard in pre_backward hook, and rely on compiler to move the reshard to the right place to optimize peak memory. -> This PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139671
Approved by: https://github.com/awgu
- Remove "mypy: allow-untyped-defs" and mark functions individually with "no-untyped-def"
- Mark some trivial functions with the proper return types (`None` and `torch.dtype`)
- Fixed a type bug in the signature of supported_dtype_of_cpp_wrapper()
- `ruff check torch/_inductor/ir.py --select ANN --fix --unsafe-fixes` and then fixed up things that looked incorrectly applied.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139238
Approved by: https://github.com/Skylion007, https://github.com/ezyang
**Summary**
In the case of LLaMA2, for a linear operation with an activation size of `(4, 1, 4096)` and a stride of `(4096, 128, 1)` which has been decomposed into `matmul`. And the decomposition of `matmul` results in `bmm` due to a strict continuity check. We can align the continuity check with ATen by skip dim of size 1 to enable decomposition into `mm` instead.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_input_non_contiguous_3D_wo_bias
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139172
Approved by: https://github.com/jgong5, https://github.com/ezyang
This is a bug on the main exposed by https://github.com/pytorch/pytorch/issues/139476
We have dict tag optimization where if the dict tag does not change, we
skip guards on all the items of the dict that are "immutable". We
considered tensors as immutable in such scenarios. This is critical for
guard eval performance, because generally users dont change their
parameters.
If I try to remove this optimization, we see slowdowns, e.g, 3.03x to
2.95x on conv_mixer TIMM benchamrk.
So, I am adding a flag which keeps the current state but allows the
users to remove this optimization. Not ideal, but given how serious guard eval perf has to be,
we are in the gray are of unsoundness vs performance tradeoff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139560
Approved by: https://github.com/jansel
This PR adds some instructions for how to add a TARGETS file to run the
fx_graph_runnable script. I'm planning to add some followups that will
add additional imports for custom ops and use autodeps to get the
dependencies, but I figure this PR is an easy first step.
Test Plan:
- pytest test/dynamo/test_structured_trace.py
- Does anyone have suggestions for how to test this?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139481
Approved by: https://github.com/eellison
This patch adds 2 simple methods `VariableTracker.is_mutable()` and
`VariableTracker.is_immutable()`, which helps clarify intention. For
instance, rather than writing
```python
if var.mutation_type:
...
```
After this patch one can write
```python
if var.is_mutable():
...
```
This patch also simplifies `mutation_type` propagation in some
`ListVariable` methods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139341
Approved by: https://github.com/mlazos, https://github.com/anijain2305
ghstack dependencies: #139339, #139340
This patch addresses the renaming part of #133027, specifically, it
renames the following and adds documentation for relevant classes.
1. `VariableTracker.mutable_local` to `mutation_type`
2. `MatableLocal `to `ValueMutationNew`
3. `MutableSideEffects `to `ValueMutationExisting`
4. `MutableLocalSource` to `SourceType`
5. `MutableLocalSource.Local` to `New`
Note that (2), (3) and (5) are mainly to bring consistency between them
and `AttributeMutationNew`, `AttributeMutationExisting`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139339
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305
### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.
### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.
The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.
### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).
```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.
Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
#3 foo from /data/users/kw2501/sync_async/repro.py:24
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40
[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
#3 foo from /data/users/kw2501/sync_async/repro.py:26
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```
From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139659
Approved by: https://github.com/wconstab, https://github.com/fduwjj
## This Stack
This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`
These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.
## This PR
Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility.
To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`:
- `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`.
- `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755
Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw
This PR enables donated buffer in OSS and handles two edge cases:
1. While donated buffer relies on storage to check alias, sparse tensor subclasses does not provide access to storage. So we skip sparse tensor subclasses for donated buffer.
2. Handles missing "val" from n.meta. This is observed from `inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_11_cpu`,
`functorch/test_aotdispatch.py::TestAOTAutograd::test_input_mutation_simple_with_none_and_nontensor`, and
`inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139669
Approved by: https://github.com/bdhirsh
We don't need to do a loop over all the args, kwargs in the
AdInplaceOrView key; we just need to bump the version on the args,
kwargs that are mutable.
On the benchmark mentioned in
https://github.com/pytorch/pytorch/issues/139494
this made the time go from
```
mutate2 = 61.72943878173828
no_mutate2 = 36.89440155029297
mutate = 236.3092498779297
no_mutate = 59.31964874267578
```
to
```
mutate2 = 47.976478576660156
no_mutate2 = 38.37468719482422
mutate = 71.21315002441406
no_mutate = 59.7432975769043
```
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139513
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139509
Summary:
Dedup the data-dependent errors based on the stacktrace it points to. Right now we just display every propagate-real-tensor log that shows up, but we actually can dedup them if they are due to the same piece of code (ex. there could multiple calls to a piece of code that does some data dependent computation).
This occurred when trying out draft export on the PT2I model zoo. For a specific model, previously we would get ~3k data dependent errors, but after deduping based on the stacktrace we now only get 4 errors.
Test Plan: CI
Differential Revision: D65374254
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139540
Approved by: https://github.com/pianpwk, https://github.com/zou3519
Summary:
When we bypass cache write on inductor, we were also forgetting to reset the bundle, this moves resetting the bundle into post_compile step so it gets uniformly reset.
This diff also turns on the cache for internal so that we can do a code rollout.
Test Plan: updated tests
Differential Revision: D65457224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139698
Approved by: https://github.com/ezyang
Summary: During dynamic rendezvous, we shouldn't use the address from the store but just use `self._this_node.addr` directly because sometimes, the store host is not the host of rank0. Passing wrong host will cause timeout error. This is a follow up fix to S463164, for internal tests, we disable the TCPStore sharing for now.
Test Plan: CI.
Differential Revision: D65453312
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139702
Approved by: https://github.com/XilunWu
Disable tree vectorize in vec_convert.h for gcc10 and aarch64+sve which causes compiler error to occur.
```
/tmp/tmpuqk7lj9j/zx/czx2eyturb6j6m727xhvknkjbdu3y5nqqk66wgxcjkwnxuzvpm5r.cpp:3:18: internal compiler error: in vect_get_vector_types_for_stmt, at tree-vect-stmts.c:12252
3 | extern "C" void kernel(const float* in_ptr0,
```
Fixes#137775
I've not linked a gcc bug report yet as they require a minimal reproducer to be made.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137795
Approved by: https://github.com/malfet
Summary:
Currently, we incorrectly log process_group for DCP based events.
We rely on [c10d_logger.py](https://fburl.com/v4mdme9z) to fill in information about process_group (e.g. backend, nccl_version if available).
In [checkpoint/logger.py](https://fburl.com/yho9nqbu) we pass the `msg_dict` to c10d_logger which never contains the `process_group` param, so [c10d_logger](https://fburl.com/zlw2ukxp) logs information about the default process_group which is always `NCCL`.
Test Plan:
Before:
Always defaults to NCCL even though GLOO is passed by caller.
{F1950847585}
After:
GLOO backend shows up.
{F1950848375}
Differential Revision: D65255871
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139428
Approved by: https://github.com/teja-rao, https://github.com/mhorowitz
As an outcome of https://fburl.com/gdoc/voce5o06, I want to assign owner(s) to any periodic or slows job that are still needed but couldn't run more frequently (too $$$, capacity constraint, don't fail that often). They include:
* multigpu
* debug build
* ROCm (distributed, slow)
@malfet @soulitzer I put down your names as the owners of debug build and slowgradcheck respectively. Please let me know if you are ok with that, or if you have a better option in mind.
Any jobs there without an owner are owned by us (PT Dev Infra)
### Testing
The owners are show up in the job name https://hud.pytorch.org/pr/139519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139519
Approved by: https://github.com/malfet
Summary:
I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer.
Implements:
https://github.com/pytorch/pytorch/issues/132826
Test Plan:
New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383
Approved by: https://github.com/eellison
Summary:
I realized I wanted to check "are my cache entries/IO unreasonably large"
and there's no easy way to do it. This lets me do it.
Test Plan: servicelab
Differential Revision: D65390363
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139627
Approved by: https://github.com/c00w
Summary:
While testing exportability for PT2 Inference models, we found various cases of invalid op inputs during tracing, for example errors like: `a and b must have same reduction dim`, `expected scalar type Long but found Int`, etc. Looking more closely, these happened to due the same few meta kernels & eager kernels producing mismatched outputs upstream (e.g. different output tensor dtype, int output).
Adding checks to catch mismatched outputs in real tensor prop upstream, so errors are raised at the mismatched op, instead of the downstream ops taking them as inputs. Relies a lot on utils from [CrossRefFakeMode](929797dedb/torch/_subclasses/fake_utils.py (L78))
Follow ups: could add more checks, and maybe have a flag to only enable these for cases like draft mode, so perf doesn't suffer?
Test Plan: test_export, test_fake_tensor
Differential Revision: D64210055
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137747
Approved by: https://github.com/zou3519
# Why?
I want the following code to work.
minimal repro:
```
class M(torch.nn.Module):
def forward(self, dilate_flag):
return dilate_flag.item()
input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
model = M().cuda()
ep = torch.export.export(model, input1, strict=True)
path = torch._inductor.aot_compile(ep.module(), input1)
aot_model = torch._export.aot_load(path, device="cuda")
actual_output = aot_model(*input1)
```
error: AssertionError: Encountered an unsupported object of type <class 'torch.SymBool'> while writing the metadata for exported program
second error will be handled by https://github.com/pytorch/pytorch/pull/138760
# Motivation
I could technically bypass it with a torch.int tensor. However, it doesn't work with torch.cond. I want the following to work. It would also require https://github.com/pytorch/pytorch/pull/138760 for aot compile to work.
```
class M(torch.nn.Module):
def __init__(self) -> None:
super().__init__()
self.dilate_flag = 0
def forward(self, dilate_flag):
self.dilate_flag = dilate_flag.item()
def true_fn(dilate_flag):
return dilate_flag.clone()
def false_fn(dilate_flag):
return dilate_flag.clone()
torch.cond(
self.dilate_flag,
true_fn,
false_fn,
(dilate_flag,),
)
return self.dilate_flag
input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
input2 = (torch.tensor([0], dtype=torch.bool, device="cuda"),)
inputs = (input1, input2)
model = M().cuda()
for input in inputs:
expected_output = model(*input)
ep = torch.export.export(model, input, strict=False)
path = torch._inductor.aot_compile(ep.module(), input)
aot_model = torch._export.aot_load(path, device="cuda")
actual_output = aot_model(*input)
assert (
expected_output == actual_output
), f"henry they are not equal {expected_output} != {actual_output}"
```
Differential Revision: D64867504
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138765
Approved by: https://github.com/ydwu4
This refactoring is for getting a deterministic ordering of binding tensors and sizes of tensors. When seeing a free tensor x with shape (s0,) in subgraph, the ordering of lifting changes from
```
lift_x_in_child, lift_s0_in_child, lift_s0_in_parent, lift_x_in_parent
```
to
```
lift_x_in_parent, lift_s0_in_parent, lift_x_in_child, lift_s0_in_child
```
This produces a determinstic ordering of handling the symints in lifted tensors.
This is also the current contract of dynamo top-level graph: we lift free_symbols in sizes after tensor x and insert the free symbols before the tensor x's proxy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138559
Approved by: https://github.com/zou3519
ghstack dependencies: #138345, #138428, #138558, #138737
Code refactoring only. We move the wrap_to_fake_tensor_logic out of wrap_fx_proxy for placeholders to provide the invariant that **all graph inputs must set their example values when creating the inputs**. This invariant helps us to identify all the free symbols in the graph in top-level and sub-graphs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138428
Approved by: https://github.com/ezyang, https://github.com/zou3519
ghstack dependencies: #138345
Enabling Manywheel builds here: https://github.com/pytorch/pytorch/pull/138732
During the build I observe the failure with cuda jobs:
```
-- Compiler does not support SVE extension. Will not build perfkernels.
-- Found CUDA: /usr/local/cuda (found version "11.8")
-- The CUDA compiler identification is unknown
CMake Error at cmake/public/cuda.cmake:47 (enable_language):
No CMAKE_CUDA_COMPILER could be found.
Tell CMake where to find the compiler by setting either the environment
variable "CUDACXX" or the CMake cache entry CMAKE_CUDA_COMPILER to the full
path to the compiler, or to the compiler name if it is in the PATH.
Call Stack (most recent call first):
cmake/Dependencies.cmake:44 (include)
CMakeLists.txt:851 (include)
```
While correct sequence suppose to be:
```
-- Found CUDA: /usr/local/cuda (found version "11.8")
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.8.89")
```
Issue found to be missing PATH setting in 2_28 Docker file. This section exist in CentOS Docker file here:
https://github.com/pytorch/pytorch/blob/main/.ci/docker/manywheel/Dockerfile#L174-L175
(Please Note these Docker images are not used yet. The https://github.com/pytorch/pytorch/pull/138732 should enable using these images)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139631
Approved by: https://github.com/malfet, https://github.com/huydhn
Partially fixing https://github.com/pytorch/pytorch/issues/138685
Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.
The doc string mainly explains what this PR is doing:
```
The implementation is more like a heuristic since we don't really know if we are at peak
or not when trying to fuse these two ndoes. The order of nodes may change later which makes the
peak memory estimation hard.
Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes:
1. find all buffers read by each node with a single user. These buffers are supposed to
be reused if we don't fuses these 2 nodes
2. find the intersection of these buffers for the two node and sum the total buffer size.
If we don't fuse these two nodes, we can at lease avoid this much memory allocation.
Note that the extra memory allocation is not necessarily causing peak memory increase.
This is just a heuristic.
We return true only if the saving for fusion can not trade off the extra memory allocation.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138756
Approved by: https://github.com/jansel
ghstack dependencies: #139136
Since any stage can run a mixture of full backwards and split backwards,
it is important to count the sum of (full_backwards + backward_weight)
when comparing to num microbatches to determine last backward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139415
Approved by: https://github.com/H-Huang
This addresses
https://github.com/pytorch/pytorch/pull/137677/files#r1799836499, which
had to set `allow_cache=False` for codegen on `DataPtrVariable.base`,
which is a `TensorVariable`, otherwise we observe failure of
`test_no_grad_copy` when testing with Dynamo.
I've seen `test_no_grad_copy` failing a few times, and every single time
it's related to cyclic reference, my best guess is the cyclic reference
holds some tensor object longer in memory than necessary, preventing the
optimization introduced in #11165.
This patch makes `OutputGraph.cleanup()` more aggressive by clearing out
all fields that might reference a `VariableTracker`. As a result, we can
remove the aforementioned `allow_cache=False`, which helps generate
better code (e.g., in the case of `test_no_grad_copy`, it skipped generating
a redundant graph whose only op is returning the input tensor; instead we just
generate a single `LOAD_FAST`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139487
Approved by: https://github.com/jansel, https://github.com/aakhundov
These are not artificial patterns I come up. They shows up in linear+CrossEntropyLoss graph.
Consider this snippet:
```
class LinearAndCEL(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(C, V)
self.ce = nn.CrossEntropyLoss()
def forward(self, x, y):
return self.ce(self.linear(x).view(B * T, V), y.view(-1))
```
`x` passed to `forward` is a 3D tensor of shape [B, T, C].
The `self.linear` will view x as [BxT, C] shape tensor first, do the matmul and produce a [BxT, V] tensor, and then view this output back to a 3D tensor with shape [B, T, V]. User code is gonna add another view op to convert the tensor shape to [B x T, V]. This generates a pair of redundant views . A pair of redundant permute happens in the backward part when we compute gradients.
The view ops makes it hard to chunk linear+CEL. When the view op breaks up the dimension being chunked, what should the chunker do (even if we merge those dimension again later)? Removing these pointless view pairs makes the chunker simpler. And I think it's in general nice to do.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139136
Approved by: https://github.com/Chillee, https://github.com/jansel
This is tested in PR stacked above in
```python
python test/distributed/fsdp/test_fsdp_state_dict.py TestFSDPStateDict.test_torch_save_load
```
We cannot depend on whether `hasattr(..., __slots__)` to know whether a BUILD instruction has slotstate. For example, if a class subclasses ABC `hasattr(__slots__)` will be `True` but there might be no slots (and hence `state` will not be a tuple). So revert #138936 to following the pickle library's code
```python
>>> from abc import ABC
>>> hasattr(ABC, "__slots__")
True
```
So
```python
import torch
from abc import ABC
from dataclasses import dataclass
class Foo(ABC):
pass
class FooWrapper(Foo):
def __init__(self, x, y):
self.x = x
self.y = y
f = FooWrapper(1, 2)
torch.save(f, "temp.pt")
with torch.serialization.safe_globals([FooWrapper]):
torch.load("temp.pt")
```
Would fail on the previous code with
```
File "/data/users/mg1998/pytorch/torch/serialization.py", line 1934, in _load
result = unpickler.load()
File "/data/users/mg1998/pytorch/torch/_weights_only_unpickler.py", line 366, in load
for k, v in slotstate.items():
```
As there is actually no slotstate
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139541
Approved by: https://github.com/malfet
ghstack dependencies: #138936, #139221, #139433
In this diff, i make test_torchbind.py tests to handle training IR. Today in the training IR, we don't see the effect token and HOP because this happens at the FunctionalTensorMode. Maybe in the future, we should move this logic up to the training IR so that writing passes etc on training Ir is safer. But for the migration purposes, i think it is ok for now. I also fixed two bugs:
1. ep.module() doesn't register all aliased constants in the module.
2. When we retrace, we need to fakify the original Torchbind object.
3. We don't run any DCE on training IR so we need to add some more torch ops to verifier.
Differential Revision: [D64853530](https://our.internmc.facebook.com/intern/diff/D64853530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138658
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
Adds a few more dynamo_timed() to measure triton compilation and load_by_key_path times.
In the case of async compilation with multiple threads, we'll generate a single `kernel_compile` event that occurs when waiting on all the parallel compiles to finish.
In the case where async parallel compilation is disabled (or, compile threads are warming up), we'll generate a `triton_compile` event for each kernel.
The `triton_compile` events is a bit questionable: do we need a row for each triton compile event? It might eat up on our already low retention, so I might just remove that. Will discuss with @slarsen.
Differential Revision: [D65215707](https://our.internmc.facebook.com/intern/diff/D65215707/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139402
Approved by: https://github.com/oulgen
This is a bug on the main exposed by https://github.com/pytorch/pytorch/issues/139476
We have dict tag optimization where if the dict tag does not change, we
skip guards on all the items of the dict that are "immutable". We
considered tensors as immutable in such scenarios. This is critical for
guard eval performance, because generally users dont change their
parameters.
If I try to remove this optimization, we see slowdowns, e.g, 3.03x to
2.95x on conv_mixer TIMM benchamrk.
So, I am adding a flag which keeps the current state but allows the
users to remove this optimization. Not ideal, but given how serious guard eval perf has to be,
we are in the gray are of unsoundness vs performance tradeoff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139560
Approved by: https://github.com/jansel
## This Stack
This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`
These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.
## This PR
Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility.
To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`:
- `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`.
- `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755
Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.
This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
This is the next step in support dynamic float arguments in PT2: docs.google.com/document/d/1HswUSp9H6mg8Vg27mhRk8YzC9q_uf63b6wz-gwx65BQ/edit?pli=1#heading=h.xvyiqp8tuje6. To make this more incremental and tractable, we've decided to opt the export path our of this first phase of the rollout.
Fixes python test/export/test_export.py TestExport.test_export_input_mutation_dynamic_shape when specialize_float=False
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139486
Approved by: https://github.com/ezyang
ghstack dependencies: #139451, #139482, #139484
In https://github.com/pytorch/pytorch/pull/134685, I transformed the following code:
```CPP
if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
// At scope exit, acquire the lock again. This provides safety against
// any potential exceptions in the cudaMallocMaybeCapturing function.
auto sg = c10::make_scope_exit([&]() { lock.lock(); });
lock.unlock();
p.err = cudaMallocMaybeCapturing(&ptr, size);
} else {
p.err = cudaMallocMaybeCapturing(&ptr, size);
}
if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
TORCH_CHECK(
lock.owns_lock(), "Failed to acquire lock after cudaMalloc");
}
```
into:
```CPP
if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
// At scope exit, acquire the lock again. This provides safety against
// any potential exceptions in the cudaMallocMaybeCapturing function.
auto sg = c10::make_scope_exit([&]() { lock.lock(); });
lock.unlock();
}
auto active_pool = MemPoolContext::getActiveMemPool();
if (active_pool && active_pool->allocator() &&
p.pool->owner_PrivatePool) {
ptr = active_pool->allocator()->raw_alloc(size);
p.err = ptr ? cudaSuccess : cudaErrorMemoryAllocation;
} else {
p.err = cudaMallocMaybeCapturing(&ptr, size);
}
if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
TORCH_CHECK(
lock.owns_lock(), "Failed to acquire lock after cudaMalloc");
}
```
This is wrong because, I didn't realize what `c10::make_scope_exit([&]() { lock.lock(); });` does. And so my changes doesn't let `release_lock_on_cudamalloc` unlock..execute alloc..lock, and instead it just unlock..locks. This PR rectifies that change, and in addition adds an ASSERT ensuring the active pool and p.pool are the same (mirroring the behavior from released_cached_blocks).
Thanks @zvon82 for reporting this!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139430
Approved by: https://github.com/ezyang
This PR ensures that the `nanmean()` function raises a `RuntimeError` when using `int64` or `bool` dtypes, even for empty tensors. Previously, non-empty tensors correctly raised errors for unsupported dtypes, while empty tensors did not. This change brings consistent error handling for both cases.
addressing the need raised in an issue by @hyperkai (Issue [#131043](https://github.com/pytorch/pytorch/issues/131043)).
### Changes
- Added checks in `nanmean_out()` to raise errors for `int64` and `bool` dtypes regardless of tensor size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138745
Approved by: https://github.com/ezyang
Fixes issue with timm models where
example_value = 0.09999
proxy.node.target = <built-in function sub>
would fall through to
```
unimplemented(
"torch.* op returned non-Tensor "
+ f"{typestr(example_value)} {proxy.node.op} {proxy.node.target}",
case_name="unsupported_operator",
)
```
and graph break
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139482
Approved by: https://github.com/ezyang
ghstack dependencies: #139451
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.
This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
Summary:
Restarting (aborting and re-initialize a PG) is a basic need if we want
to achieve in-process restart of PGs without tearing down the whole
process.
Add this tests to verify that this is supported by current NCCL.
Note that this restart test passes steadily only for blocking mode for now.
In nonblockin mode. There is problem in either nccl init or abort that
needs further investigation
Test Plan:
new UT
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139496
Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501
2024-11-01 22:13:37 +00:00
1655 changed files with 59611 additions and 31988 deletions
Scripts for building aarch64 PyTorch PIP Wheels. These scripts build the following wheels:
* torch
* torchvision
* torchaudio
* torchtext
* torchdata
## Aarch64_ci_build.sh
This script is design to support CD operations within PyPi manylinux aarch64 container, and be executed in the container. It prepares the container and then executes __aarch64_wheel_ci_build.py__ to build the wheels. The script "assumes" the PyTorch repo is located at: ```/pytorch``` and will put the wheels into ```/artifacts```.
This app allows a person to build using AWS EC3 resources and requires AWS-CLI and Boto3 with AWS credentials to support building EC2 instances for the wheel builds. Can be used in a codebuild CD or from a local system.
If you do not have upload permissions, please ping @seemethere or @soumith to gain access
## New versions
New CUDA versions can be added by creating a new make target with the next desired version. For CUDA version NN.n, the target should be named `magma-cudaNNn`.
Make sure to edit the appropriate environment variables (e.g., DESIRED_CUDA, CUDA_ARCH_LIST) in the `Makefile` accordingly. Remember also to check `build_magma.sh` to ensure the logic for copying over the files remains correct.
New patches can be added by editing `Makefile` and`build_magma.sh` the same way `getrf_nbparam.patch` is implemented.
f"Invalid experiment name: {experiment_name}. Experiment names should only contain alphanumeric characters, '_', and '-'. They cannot contain spaces, and the special characters '_' and '-' cannot be the first or last characters."
f"Invalid experiment name: {experiment_name}. Experiment names should only contain alphanumeric characters, '_', and '-'. They cannot contain spaces, and the special characters '_' and '-' cannot be the first or last characters."
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.