Fix for https://github.com/pytorch/pytorch/issues/122871. There are two cases where we emit pointwise cat:
- fusing into a pointwise use
- horizontally fusing copy_ kernels
The regression I looked into previously was due to being overly aggressive in the latter case. I've updated the logic there so that we only emit the horizontal fusion in the case that we would have to emit separate copy_ kernels anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125772
Approved by: https://github.com/Chillee
This PR is part of a series of PRs to significantly speed up torch.onnx.export for models with many nodes (e.g. LLM). See #121422 for more analysis.
- As part of torch.onnx.export, a reverse look-up is made in env. This is done for each node, and this look-up costs in proportional to the graph size, which incurs and overall O(N^2) time complexity.
- A pragmatic solution is simply to keep a separate data structure to make this de facto constant time. So, this introduces a set containing all the values of env. Open to other ideas. Ideally `exist_in_env` wouldn't be needed at all, but to preserve current behavior exactly I'm not sure how that can be done.
- Resolves (4) in #121422.
- This code change and the choice of py::set looks a bit more natural on top of #123063, where the env is changed from a std::unordered_map to a py::dict.
Partially fixes#121422
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124909
Approved by: https://github.com/srikris-sridhar, https://github.com/justinchuby
When dispatching a fake tensor op we cache the result with `(op, args)` as the key. There are some args (such as one with a dynamic output shape) where the output can't be cached. Instead of validating the args every time we compute the cache only validate the args when we first see a new cache key.
18.3% FakeTensor perf win on the microbenchmark (21.7% cumulative)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124223
Approved by: https://github.com/oulgen, https://github.com/masnesral
ghstack dependencies: #122911
Summary: This change introduces a new flagg to perform a "warm start" test from the benchmark harness. The idea is to test a model twice: first with a fresh inductor cache (i.e., a "cold start"), and then a second run in a fresh process with the cache available (i.e. a "warm start"). We can later add this mode to CI runs to collect compile times for warm start.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125353
Approved by: https://github.com/eellison, https://github.com/desertfire
Move gha artifact download to before any xml parsing is done for uplaod-test-stats
Do not download gha artifacts during xml parsing since got uploaded to s3 in the above and will be downloaded when all the artifacts are downloaded from s3
The previous method resulted in dups if you run the script again
TODO: write a deduper so we don't have to worry at all
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125609
Approved by: https://github.com/huydhn
Summary: This fix does three things:
1. When we add inputs from partioner to the top level graph module, we insert in the order of partioner which is not guaranteed to be same as original graph inputs. This PR fixes that.
2. When we replace autograd ops with HOP, we create new submodules and access their outputs via getitem calls. As a result, previous node names associated with getitem gets updated, resulting in the graph being different from produced graph signature. So I just update the graph signature accordingly.
3. We run runtime_assertion pass before autograd HOP pass because the constraints won't be populated correctly.
Differential Revision: [D57130314](https://our.internmc.facebook.com/intern/diff/D57130314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125793
Approved by: https://github.com/zhxchen17
Fixes#69031, #42793
This PR fixes the bug introduced in #54981 where parameters used within a `no_sync` scope are not respected when `find_unused_parameters` is set to `True`. The `local_used_map_` and `numGradHooksTriggeredMap_` variables should be updated regardless of the `no_sync` state.
Tested and verified with fairseq2 and wav2vec2 ASR finetuning recipe. All gradients are correctly synced across workers as expected after applying this fix.
Co-authored-by: Kaushik Ram Sadagopan <kaushikram2811@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124193
Approved by: https://github.com/rohan-varma
Get's rid of the following warning:
```
/Users/shenke/workspace/pytorch/test/test_mps.py:9229: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
if base.storage().data_ptr() != other.storage().data_ptr():
```
(noticed while looking at https://github.com/pytorch/pytorch/issues/96153#issuecomment-2101876484 )
Respective change to view ops was landed back in 2022, see https://github.com/pytorch/pytorch/pull/91414
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125838
Approved by: https://github.com/albanD
**Summary**
Per the discussion in https://github.com/pytorch/pytorch/pull/123444, the `decomposed quant/dequant` patterns changed after https://github.com/pytorch/pytorch/pull/123445, we can move the optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase to avoid the changes. In this way, we can:
- Avoid the pattern matcher failure introduced in https://github.com/pytorch/pytorch/pull/123445
- Make the quantization pattern clearer in the pattern matcher phase, since the `quant/dequant` nodes have not been decomposed.
**Changes in this PR**
- Move optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase.
- Corresponding changes in the quantization pattern matcher to ensure no bc-breaking.
**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_q
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124041
Approved by: https://github.com/peterbell10, https://github.com/jgong5
Summary:
It seems like required functions are not available due to `_MSC_VER` guard. Does anyone have more context why this functionality has been disabled for windows?
I'm also unsure how this currently compiles in OSS land on windows, as there doesn't seem to be any preprocessor protection around `scaled_gemm` getting pulled in.
Test Plan:
Fix compilation errors like this
```
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\tunable\TunableGemm.h(74): error C2039: 'scaled_gemm': is not a member of 'at::cuda::blas'
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\CUDABlas.h(19): note: see declaration of 'at::cuda::blas'
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\tunable\TunableGemm.h(74): note: the template instantiation context (the oldest one first) is
C:\open\fbsource\xplat\caffe2\aten\src\ATen\cuda\tunable\TunableGemm.h(71): note: while compiling class template 'at::cuda::tunable::DefaultScaledGemmOp'
Action failed: fbsource//xplat/caffe2:ATen_cuda_lib_ovrsource (cxx_compile aten/src/ATen/native/cuda/Blas.cpp)
```
Differential Revision: D57087985
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125792
Approved by: https://github.com/malfet, https://github.com/eqy
Summary:
The macros that build `c10::Error` compute the stack trace at the point of throwing, which is then returned as part of the `what()`. If `what()` is never called, which is the case for most exceptions (since logging is throttled), the cost of computing the stack trace was wasted.
By far, the most expensive part of computing the stack trace is its symbolization; just unwinding the stack and collecting the instruction addresses is comparatively cheap. We can thus defer the symbolization to first invocation of `what()`.
Test Plan:
Added unit tests exercising the lazy nature of `what()`.
Ran an adfinder canary: https://www.internalfb.com/intern/ads/canary/460118801509424346
We can see that the cost of symbolization is obliterated (meaning that `what()` is virtually never called, as expected):
{F1496627896}
Differential Revision: D57128632
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125787
Approved by: https://github.com/huydhn
**Summary**
Per the discussion in https://github.com/pytorch/pytorch/pull/123444, the `decomposed quant/dequant` patterns changed after https://github.com/pytorch/pytorch/pull/123445, we can move the optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase to avoid the changes. In this way, we can:
- Avoid the pattern matcher failure introduced in https://github.com/pytorch/pytorch/pull/123445
- Make the quantization pattern clearer in the pattern matcher phase, since the `quant/dequant` nodes have not been decomposed.
**Changes in this PR**
- Move optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase.
- Corresponding changes in the quantization pattern matcher to ensure no bc-breaking.
**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_q
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124041
Approved by: https://github.com/peterbell10, https://github.com/jgong5
Enables LRScheduler to handle tensor LRs.
Note on test changes:
For the test modifications I just removed itertools.product and created two loops. This allows us to create a new set of optim_inputs on each iteration to prevent mutations on the tensor LR carrying over across iterations. Nothing else in those tests was modified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123753
Approved by: https://github.com/janeyx99
ghstack dependencies: #123751, #123752
This PR fix `torch.backends.xeon.run_cpu` behavior when it is launched from `torchrun` with `--nproc-per-node` parameter.
As a CPU launcher, `run_cpu` would bind cores to each instance it launches using `numactl`, and assign cores to each instance evenly.
However, if we use `torchrun` to start `run_cpu` and use `--nproc-per-node` to create multiple `run_cpu` processes. In this case, each `run_cpu` process would assume it can use all the CPU cores, which causes each `run_cpu` process compete for CPU cores. This results in poor performance.
This PR recognize environment variable `LOCAL_WORLD_SIZE` and `LOCAL_RANK` set by `torchrun`, then use this information to further shard the cores bind to each instance. With this PR, when launched by `torchrun --nproc-per-node ...`, different CPU cores will be bind to different workers, which maximize CPU utilization and application performance.
The specific use case this PR enabled is using TorchServe with DeepSpeed tensor parallel. In this case, TorchServe would run `torchrun --nproc-per-node <tp_size>` to start tensor parallel workers it needed. When run TorchServe on multisocket CPU server with DeepSpeed tensor parallel, we need this PR to achieve best performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123711
Approved by: https://github.com/jingxu10, https://github.com/ezyang
This PR adds the autotune Infrastructure for CPU. It generalizes and extends `BenchmarkRequest` with CPU support and C++ module loader. A `do_bench_cpu` util function is added for benchmarking functions on CPU with warmups and returns the median number from multiple trials.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125159
Approved by: https://github.com/jansel
Summary:
github issue: https://github.com/pytorch/pytorch/issues/73828
Whenever we transition from RECORD_AND_SAVE to WARMUP in the profiler schedule, we instantiate a new backend profiler which wipes out the last cycle's information. This makes using the `repeat` parameter less useful in the schedule as you only get contents of the last cycle/repeat. In this diff, we save the accumulated Function Events before setting the new ones and then merge the two EventLists after post processing/cleaning is done. This diff only fixes Function Events so that we can get statistics over each cycle within a schedule. A follow up should be made to accumulate the chrome tracings as well if it is requested.
Test Plan: Added functional python tests in test_profiler.py that test different schedules and their FunctionEvent counts
Differential Revision: D56956245
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125510
Approved by: https://github.com/aaronenyeshi
```
python test/test_fx.py -k test_public_api_surface
```
was failing with a complaint about infinite recursion. Fixed that and then marked the two API changes from #123681 as private (for `get_example_value`) and backward compatible (for `insert_deferred_runtime_asserts`).
Fixes#104012
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125706
Approved by: https://github.com/BoyuanFeng
## Problem this PR resolves
Today, most of distributed tests are arranged like this:
```
def test_allreduce(self):
pg = self._create_process_group_nccl(store, self.opts())
pg.allreduce(tensor)
...
```
Thus, we are paying PG creation time **per test**. That's bad. But why were we doing that? Is there a constraint?
If we look deeper, we would find that most of our test cases inherit from `torch.testing._internal.common_distributed.MultiProcessTestCase`. From the name, nothing seems wrong, and probably fits distributed well. But a "problem" exists in its `setUp()` and `tearDown()` methods, which basically do the following:
```
def setUp(self):
self._spawn_processes()
def tearDown(self):
for p in self.processes:
p.terminate()
```
Since `setUp` and `tearDown` are "**test-scope fixtures"**, meaning, they are called per test, each test will have brand new processes. Of course we'd have to recreate ProcessGroup every time.
## How we are fixing it
First, obviously, we need to put a PG's lifetime into a longer scope. Python `unittest` provides such a helper, called **"class-scope fixtures."** It is embodied by a `setUpClass` method and a `tearDownClass` method (note the name difference), which are called only once for all tests in the same test class. Therefore, we would do:
```
@classmethod
def setUpClass(self):
dist.init_process_group(...)
@classmethod
def tearDownClass(self):
dist.destroy_process_group()
```
**In this PR, we create a new test template for distributed: `MultiProcContinousTest`, to hold this class-scope fixture.**
Second, we'd need to avoid per-test process spawn and terminate. That's easy, we can either:
1. launch the whole test file with `torchrun --nproc-per-node=...` or
2. use `mp.spawn()` under `if __name__ == "__main__":`.
Point is, launch the processes only once.
## Result
We moved the "positive tests" from test_c10d_nccl.py to test_c10d_ops_nccl.py.
Before this PR:
```
$ python test_c10d_nccl.py -k ProcessGroupNCCLTest
Ran 24 tests in 174.457s
```
After this PR:
```
$ torchrun --nproc-per-node 2 test_c10d_ops_nccl.py
or
$ python test_c10d_ops_nccl.py
Ran 24 tests in 16.247s
```
10X speedup.
## Limitation
For tests intended to test destroy or abort of PGs, we'd need to go back to the old style. So it would make sense to divide our tests into two classes: one for positive tests where we would reuse the PGs, and the other one for abort/destroy and negative tests like watchdog timeout.
## Next step
Migrate the tests of distributed that would fit with this test style!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125648
Approved by: https://github.com/wconstab
I am building PyTorch with the Intel oneAPI 2024.0 compiler and without cuSparseLt, and encountered various type errors of the following forms:
```
[ 63%] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu.o
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(87): error: no operator "=" matches these operands
operand types are: cutlass::uint2b_t = int
detected during:
instantiation of "at::native::Indices4x4 at::native::LargestValuesGreedy<Op>::operator()(Tile4x4Accessor) [with Op=at::native::IdentityOp, Tile4x4Accessor=at::native::KernelTypes<cutlass::half_t>::Tile4x4Accessor]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(349): here
instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
(177): here
instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(88): error: no operator "=" matches these operands
operand types are: cutlass::uint2b_t = int
detected during:
instantiation of "at::native::Indices4x4 at::native::LargestValuesGreedy<Op>::operator()(Tile4x4Accessor) [with Op=at::native::IdentityOp, Tile4x4Accessor=at::native::KernelTypes<cutlass::half_t>::Tile4x4Accessor]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(349): here
instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
(177): here
instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(238): error: function "lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void" cannot be called with the given argument list
argument types are: (int, int)
object type is: lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void
detected during:
instantiation of "at::native::KernelTypes<Element_>::Tile4x4Packed at::native::KernelTypes<Element_>::pack_4x4(at::native::Indices4x4, at::native::KernelTypes<Element_>::Tile4x4Accessor, uint32_t &, int, __nv_bool) [with Element_=cutlass::half_t]"
(354): here
instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(177): here
instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredPack.h(241): error: function "lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void" cannot be called with the given argument list
argument types are: (int, int)
object type is: lambda [](cutlass::uint2b_t, cutlass::uint2b_t)->void
detected during:
instantiation of "at::native::KernelTypes<Element_>::Tile4x4Packed at::native::KernelTypes<Element_>::pack_4x4(at::native::Indices4x4, at::native::KernelTypes<Element_>::Tile4x4Accessor, uint32_t &, int, __nv_bool) [with Element_=cutlass::half_t]"
(354): here
instantiation of "void at::native::KernelTypes<Element_>::sparse_semi_structured_tile_kernel(at::native::KernelTypes<Element_>::Params, MetadataStore, Algorithm) [with Element_=cutlass::half_t, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>, MetadataStore=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(201): here
instantiation of "void at::native::sparse_semi_structured_tile_kernel<KT,Metadata,Algorithm>(KT::Params, Metadata, Algorithm) [with KT=at::native::KernelTypes<cutlass::half_t>, Metadata=at::native::MetadataCutlass, Algorithm=at::native::LargestValuesGreedy<at::native::IdentityOp>]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h(177): here
instantiation of "void at::native::named_algorithms(T) [with T=lambda [](auto, const std::__cxx11::string &)->auto]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(265): here
instantiation of "std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::sparse_semi_structured_tile_typed<Element,MetadataFormat>(at::Tensor, std::__cxx11::string) [with Element=cutlass::half_t, MetadataFormat=at::native::MetadataCutlass]"
/tmp/pytorch/aten/src/ATen/native/sparse/cuda/SparseSemiStructuredTile.cu(293): here
```
The casts added by this PR get the build working again for me.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124777
Approved by: https://github.com/jcaip
- `FakeContext` hides all fields other than ctx.saved_tensors, this dynamo errors when the autograd.Function.backward uses other attrs on ctx and it also doesn't allow fallback to eager.
- If we remove it, we still can't fallback to eager: node variables are already freed (ctx.saved_tensors throws)
- However, we can fallback to "pseudo-eager" by using a duck-typed ctx and routing the ctx.saved_tensors to lifted tensors
- Dynamo tries to inline external_utils.call_backward, treats BackwardCFunction as a AutogradFunctionContextVariable (only used up until we create the fake context: FakeBackwardCFunction)
- we call_function backward from the forward class AutogradFunctionVariable, and we still pass in the fake context as a UserDefinedObjectVariable (can later use AutogradFunctionContextVariable + HOO graph speculate)
Fixes#125489#124827
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125661
Approved by: https://github.com/jansel
Summary:
Update the Kineto submodule in PyTorch. The following diffs are included:
- Removed CUPTI overhead track in AMD traces
- Delay logging for CUDA stream wait event until the end
- Changed chrome trace unit will be in milliseconds, and data will be in ns
- Refactored roctracer to include metadata and improved names.
- Lowered Kineto Stage log level, reducing noisy output
- Changed relative time of ts to quarterly interval for distributed trace alignment
- Fixed Non-risky deprecated use of 0/NULL
- Removed hardcoding of /opt/rocm
- Handling cuLaunchKernelEx better
- Fixed Non-risky missing field initializers and unused variables.
Test Plan: CI and this is running internally.
Differential Revision: D57011897
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125621
Approved by: https://github.com/sraikund16
As per title.
This ensures that all the places where we assume the method defined in _tensor.py do exist.
BC-Breaking: This is bc-breaking as the user cannot subclass this private class anymore.
You should replace any use of _TensorBase to Tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125558
Approved by: https://github.com/ezyang
Summary:
block traverse mode:
Assumption:
culprits block formed by (start_idx, end_idx) in topologically sorted graph
and the error will go away if graph patterns breaks
Reviewed By: junhanh
Differential Revision: D56799587
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125613
Approved by: https://github.com/jfix71
In case the `dfs_iter_find_cycle` function receives duplicated node entries in the `all_user_nodes` argument, it will still process each one of them. This commit changes the `all_user_nodes` list into a set, so each element is unique, resulting in a shorter execution time of the `propose_partitions` function.
Fixes#125584
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125585
Approved by: https://github.com/Skylion007
Summary:
By default, some inferred dynamic shapes guards/constraints that are not expressible with the current dynamic shapes language will lead to specialization to the concrete input values provided. If disable_forced_specializations is set to True, we will not specialize, and will not perform runtime checks on such produced guards. Instead, we allow the user to specify arbitrary shapes, and fail during runtime if the inputs are invalid. Constraints expressible with the language (e.g. ranges, linear derived dims) will still be enforced, and behavior for all other guards remains the same.
Cases where we typically specialize are reshapes:
```
x: [4, 6] # [s0, s1]
x = x.reshape([x.shape[0] - 1, -1])
# this emits a guard Mod(s0*s1, s0-1) = 0, we specialize on s0=4, s1=6
x: [4, 6], y: [24] # [s0, s1], [s2]
x = x.reshape([-1]) + y
# this emits a guard s0*s1 = s2, we specialize on s0=4, s1=6, s2=24
```
For now only applicable for non-strict mode (need to figure out how to pass this flag into dynamo's call of produce_guards).
Test Plan: Added test case that checks compilation, runtime, and suggested fixes behavior.
Differential Revision: D56361177
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124949
Approved by: https://github.com/avikchaudhuri
Now that efficient attention is supported in dtensor, we can modify the transformer test to use dtensor in SDPA and get rid of the manual num_head adjustments.
Caveat: Efficient attention is supported only with bf16/fp32 (not fp64) and has other constraints. If any of the constraints are not satisfied, the SDPA would fall back to the math decomposed attention, which will break as it does not fully work with dtensor (it creates a `torch.Tensor` mask in the middle). I considered adding some checks like in P1202254918 but that needs to be added everywhere this Transformer is used. Is it necessary if the current CI machines can run efficient attention?
Test files containing this Transformer:
- `test/distributed/tensor/parallel/test_tp_examples.py`
- `test/distributed/_composable/fsdp/test_fully_shard_training.py`
- `test/distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122997
Approved by: https://github.com/XilunWu
ghstack dependencies: #122995, #122996
Previously, the new tensor out of the "new factory" all become replicated.
With this PR, if the new tensor has the same shape as the old tensor **and** the shape can be evenly sharded, then the old spec is inherited and preferred.
To accommodate this when the old tensor has sharded placements, the input args for local computation (size, stride) need to be adjusted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122995
Approved by: https://github.com/wanchaol
@wanchaol was seeing the loss eventually become NaN when compiling individual transformer blocks in torchtitan - with this patch I no longer see the NaN loss.
The problem is the following:
(1) It is possible to have graph inputs to a compiled region that are AsyncCollectiveTensors. In particular: when we compile individual transformer blocks in the llama model, the first layer (embedding layer) is run in eager mode, and it outputs an AsyncCollectiveTensor that is fed to the first transformer block
(2) ideally, we would like that AsyncCollectiveTensor graph input to desugar into a `wait_tensor()` op that shows up at the beginning of the graph.
(3) the way this is supposed to happen is: AOTAutograd traces through the __torch_dispatch__ of AsyncCollectiveTensor, tracing out a `wait_tensor()` call before dispatching to any of the other ops in the function we are tracing
(4) however: `trigger_wait()` was getting called in a way where we would ignore its output (and return `self.elem` directly), which would cause the `wait_tensor` ops to get DCE'd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125677
Approved by: https://github.com/wanchaol, https://github.com/yifuwang
ghstack dependencies: #125676
This is the first PR in a series where I try to organize our runtime wrappers a bit: specifically, I'd like to separate wrappers into objects that have (up to) 2 methods:
A **pre-compile** function, which takes in flat_fn and flat_args (inputs to the compiler) and wraps/modifies them
A **post-compile** function, which takes in a compiled_fn and runtime args and wraps the compiled_function.
Extra metadata necessary to run the compile functions can be stored on the attributes of the class. This way, when we think about caching, the set of attributes on the class should be the exact set of metadata that we need to serialize and save in the cache (along with common data, like fw_metadata)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125595
Approved by: https://github.com/bdhirsh
# Motivation
As discussed in [#124479](https://github.com/pytorch/pytorch/pull/124479), `torch.amp.autocast` can NOT be completely equivalent to `torch.cuda.amp.autocast` and `torch.cpu.amp.autocast` since `torch.amp.autocast` has NOT the default `dtype` for CPU (`torch.bfloat16` by default) and CUDA (`torch.float16` by default) respectively. We would like `torch.amp.autocast` to be more generic to help the developer/customer write the device-agnostic code. Because there are not enough reasons to add device-specific autocast `torch.xxx.amp.autocast` for each device backend.
# Solution
When `None` is passed to `dtype`, we should use `torch.get_autocast_dtype` to get the related dtype for each backend. Meanwhile, `torch.get_autocast_dtype` is necessary to be supported in JIT path for BC.
# Additional Context
With this PR, `torch.amp.autocast(device_type='cuda')` is equivalent to `torch.cuda.amp.autocast`.
Add two new UTs to cover this change in eager and jit path respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125103
Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui
Summary:
The macros that build `c10::Error` compute the stack trace at the point of throwing, which is then returned as part of the `what()`. If `what()` is never called, which is the case for most exceptions (since logging is throttled), the cost of computing the stack trace was wasted.
By far, the most expensive part of computing the stack trace is its symbolization; just unwinding the stack and collecting the instruction addresses is comparatively cheap. We can thus defer the symbolization to first invocation of `what()`.
Test Plan:
Added unit tests exercising the lazy nature of `what()`.
Ran an adfinder canary: https://www.internalfb.com/intern/ads/canary/460118801509424346
We can see that the cost of symbolization is obliterated (meaning that `what()` is virtually never called, as expected):
{F1496627896}
Reviewed By: ezyang
Differential Revision: D56586844
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125682
Approved by: https://github.com/ezyang
as titled, for meta tensor ops, we should avoid calling the RNGTracker,
which could potentially alter the current RNG state. Meta tensor ops
should be no-op and post `to_empty` init would really alter the RNG
state
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125693
Approved by: https://github.com/XilunWu
Fixes#125272
## About
(This is a re-spin of PR #106617)
Kineto introduced a new profiler to read performance counters from NVIDIA GPUs (CUPTI Range Profiler API) added in PR[75616](https://github.com/pytorch/pytorch/pull/75616). Support for the range profiler mode was disabled as we had to link with a NV PerfWorks library (`libnvperf_host.so`). This PR adds that link.
The change includes-
* Updates cmake build files to find `libnvperf_host.so` and set `CUDA_nvperf_host_LIBRARY`
* WIP use the above cmake variable in kineto, will update this PR after kineto PR has landed
See https://github.com/pytorch/kineto/pull/724
## Example usage of CUPTI profiler
The code snippet below shows how to configure pytorch profiler in CUPTI Profiler mode. Any code included in profiling window with be profiler by CUPTI/Kineto. Note how the `_ExperimentalConfig` struct is used to configure profiler metrics
```
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CUDA],
record_shapes=True,
on_trace_ready=trace_handler,
experimental_config=torch.profiler._ExperimentalConfig(
profiler_metrics=[
"kineto__tensor_core_insts",
"dram__bytes_read.sum",
"dram__bytes_write.sum"],
profiler_measure_per_kernel=False),
) as prof:
res = train_batch(modeldef)
prof.step()
```
For a full example see this [xor.py](https://gist.github.com/briancoutinho/b1ec7919d8ea2bf1f019b4f4cd50ea80) gist.
### Details of how to configure CUPTI profielr
The` _Experimental` config structure can be used to pass metrics to profiler
```
profiler_metrics : a list of CUPTI profiler metrics used
to measure GPU performance events. Any metric supported by CUPTI can be used, see here=
https://docs.nvidia.com/cupti/r_main.html#r_profiler
There are two special alias metrics `kineto__tensor_core_insts` and `kineto__cuda_core_flops` for FLOPS counting.
profiler_measure_per_kernel (bool) : whether to profile metrics per kernel
or for the entire measurement duration.
```
## Testing
Built from source with kineto [PR](https://github.com/pytorch/kineto/pull/724)
```
$> USE_CUDA=1 python setup.py install
-- CUDA_cupti_LIBRARY = /public/apps/cuda/11.6/extras/CUPTI/lib64/libcupti.so
-- CUDA_nvperf_host_LIBRARY = /public/apps/cuda/11.6/extras/CUPTI/lib64/libnvperf_host.so
```
Then run example [xor.py](https://gist.github.com/briancoutinho/b1ec7919d8ea2bf1f019b4f4cd50ea80). This only works on V100+ GPUs only. Adding logs for debugging etc.
```
>$ export KINETO_LOG_LEVEL=1
>$ python xor.py
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:167] CUDA versions. CUPTI: 16; Runtime: 11060; Driver: 11040
Log file: /tmp/libkineto_activities_1683060.json
Trace start time: 2023-02-11 19:11:47 Trace duration: 500ms
Warmup duration: 0s
Max GPU buffer size: 128MB
Enabled activities: cuda_profiler_range
Cupti Profiler metrics : kineto__tensor_core_insts, dram__bytes_read.sum, dram__bytes_write.sum
Cupti Profiler measure per kernel : 0
Cupti Profiler max ranges : 10
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:638] Enabling GPU tracing
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:567] Running child profiler CuptiRangeProfiler for 500 ms
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:104] Configuring 3 CUPTI metrics
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109] sm__inst_executed_pipe_tensor.sum
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109] dram__bytes_read.sum
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiRangeProfiler.cpp:109] dram__bytes_write.sum
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:575] Running child profiler CuptiRangeProfiler for 500 ms
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:672] Tracing starting in 9s
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:677] Tracing will end in 10s
STAGE:2023-02-11 19:11:37 1683060:1683060 ActivityProfilerController.cpp:310] Completed Stage: Warm Up
INFO:2023-02-11 19:11:37 1683060:1683060 CuptiActivityProfiler.cpp:693] Starting child profiler session
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125685
Approved by: https://github.com/sraikund16
Fixes#125526 [#1811](https://github.com/pytorch/builder/issues/1811)
Adopt syntax=docker/dockerfile:1 whcih has been stable since 2018, while still best practice to declare in 2024.
- Syntax features dependent upon the [syntax directive version are documented here](https://hub.docker.com/r/docker/dockerfile).
- While you can set a fixed minor version, [Docker officially advises to only pin the major version]
```
(https://docs.docker.com/build/dockerfile/frontend/#stable-channel):
We recommend using docker/dockerfile:1, which always points to the latest stable release of the version 1 syntax, and receives both "minor" and "patch" updates for the version 1 release cycle.
BuildKit automatically checks for updates of the syntax when performing a build, making sure you are using the most current version.
```
**Support for building with Docker prior to v23 (released on Feb 2023)**
NOTE: 18.06 may not be the accurate minimum version for using docker/dockerfile:1, according to the [DockerHub tag history](https://hub.docker.com/layers/docker/dockerfile/1.0/images/sha256-92f5351b2fca8f7e2f452aa9aec1c34213cdd2702ca92414eee6466fab21814a?context=explore) 1.0 of the syntax seems to be from Dec 2018, which is probably why docker/dockerfile:experimental was paired with it in this file.
Personally, I'd favor only supporting builds with Docker v23. This is only relevant for someone building this Dockerfile locally, the user could still extend the already built and published image from a registry on older versions of Docker without any concern for this directive which only applies to building this Dockerfile, not images that extend it.
However if you're reluctant, you may want to refer others to [this Docker docs page](https://docs.docker.com/build/buildkit/#getting-started) where they should only need the ENV DOCKER_BUILDKIT=1, presumably the requirement for experimental was dropped with syntax=docker/dockerfile:1 with releases of Docker since Dec 2018. Affected users can often quite easily install a newer version of Docker on their OS, as per Dockers official guidance (usually via including an additional repo to the package manager).
**Reference links**
Since one of these was already included in the inline note (now a broken link), I've included relevant links mentioned above. You could alternatively rely on git blame with a commit message referencing the links or this PR for more information.
Feel free to remove any of the reference links, they're mostly only relevant to maintainers to be aware of (which this PR itself has detailed adequately above).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125632
Approved by: https://github.com/malfet
This PR seeks to increase observability of save/load requests. This is accomplished with two main changes:
1. The creation of save_id and load_id:
- a save_id and load_id is added to the filesystem writer. `save_id` is re-generated on every save call, and `load_id` is also re-generated on every load call.
- both these ID's are stored in a new `StorageMeta` class, and saved as part of Metadata. (`load_id` is None when we save, and only set during load)
2. A new mechanism is implemented in the save path which gives the SavePlanner a chance to inspect the `storage_meta` object. The mechanism mirrors the same metadata exchange in the load path. In the load path, `storage_meta` is added to `metadata` such that the LoadPlanner can also access `storage_meta` before we begin loading.
*If users now wish to access the checkpoint_id in the SavePlanner, they simple need to access the value in `storage_meta` from the `set_up_planner` call*
*Additionally, users now have a generic way of passing data to the SavePlanner from the StorageWriter at the start of the save path, similar to the load path*
This PR has been tested for backwards compatibility -- meaning any checkpoints saved before this PR can continue being loaded after this PR.
One major consideration is that there is limited forwards compatibility. If a checkpoint is generated _past_ this PR, there is no support for loading it using older torch versions. This brings up a fairly important point: since we expect the metadata object (which is saved to the disk) to continue evolving, and we want to support forwards compatibility, we explore patching `pickle` so we can at least add new members to `metadata` and maintain fwd compat.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124772
Approved by: https://github.com/fegin
Fixes [internal error](https://fb.workplace.com/groups/1075192433118967/permalink/1416709435633930/).
The issue is that the asserting nodes added in the `insert_deferred_runtime_assertion` pass do not contain metadata that the ExportedProgram requires the graph to have. One solution to fix this is to retrace the entire module, or another solution is to manually add back this metadata.
This diff implements the latter solution (manually add back the metadata) through hooking into fx.graph's `create_node` function, and adding export-specific metadata for every node that is created. The reason I did this is so that the `insert_deferred_runtime_assertion` does not have to know about what metadata export wants.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125414
Approved by: https://github.com/zhxchen17, https://github.com/BoyuanFeng
Summary: capture_pre_autograd_graph is deprecated and torch.export won't able to provide timely fix for this API. To reduce some confusion around this we should explicitly give users clear warnings.
Test Plan: eyes
Reviewed By: tarun292
Differential Revision: D56955202
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125602
Approved by: https://github.com/angelayi
By defining `CASE_ISSIGNED` macros that just returns `std::numeric_limits<dtype>::is_signed` for the types where it makes sense and explicitly code some types when it does not
Remove `default:` case from the switch to avoid regressions like the one reported in https://github.com/pytorch/pytorch/issues/125124 , as [`-Wswitch-enum`](https://clang.llvm.org/docs/DiagnosticsReference.html#wswitch-enum) in combination with `-Werror` will raise an error in case of a missing entry, for example:
```
/Users/nshulga/git/pytorch/pytorch/c10/core/ScalarType.h:518:11: warning: enumeration value 'QInt32' not handled in switch [-Wswitch]
switch (t) {
^
/Users/nshulga/git/pytorch/pytorch/c10/core/ScalarType.h:518:11: note: add missing switch cases
switch (t) {
^
1 warning generated.
```
Fixes https://github.com/pytorch/pytorch/issues/125124
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125637
Approved by: https://github.com/albanD
This resolves a bug in eager where if an old state dict is loaded (without the capturable flag) but the original dict had the capturable flag, then state_steps would be on cuda but we would take the non-capturable path. We now fallback to eager if capturable=False.
Current design doc and discussion: https://docs.google.com/document/d/1DmmbiaSp16CDZtGw1qzXKHFTY_0gqc0xpnBdviXq0vk/edit#heading=h.871u7bvwz7ze
Note on the actual fallback logic - there was an issue with torchscript originally not handling *args, **kwargs properly, after rectifying that by using `functools.wraps`, there was an additional bug with scoping which required the single tensor implementation to be in the global scope at the time of the fallback closure being created. I pass in the single tensor function to the `_disable_dynamo_if_unsupported` decorator to workaround this bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123619
Approved by: https://github.com/janeyx99
Fixes https://github.com/pytorch/pytorch/issues/125109 which is a regression introduced by https://github.com/pytorch/builder/pull/1467 that adds dynamic dependency to mkl, which if installed in the user-dir is placed into `sysconfig.sysconfig.get_config_var("userbase") / "Library" / "bin"`
Fix this one, but adding `userbase` folder to the DLL search path
Testing before this fix:
```
Python 3.12.3 (tags/v3.12.3:f6650f9, Apr 9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Administrator\AppData\Roaming\Python\Python312\site-packages\torch\__init__.py", line 141, in <module>
raise err
OSError: [WinError 126] The specified module could not be found. Error loading "C:\Users\Administrator\AppData\Roaming\Python\Python312\site-packages\torch\lib\shm.dll" or one of its dependencies.
>>> exit()
```
After:
```
c:\Program Files\Python312>python
Python 3.12.3 (tags/v3.12.3:f6650f9, Apr 9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> exit()
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125684
Approved by: https://github.com/malfet
We save and restore the DynamicLayerStack during frame eval but since fx graph has no way to express a try/finally we just assume it will happen. If we throw an exception between the push and pop to the stack then we're left in a state that affects following operations poorly. Make sure that if it's in a bad state we restore it after frame eval.
Repro:
before:
```
$ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8
$ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8'
============= 1 passed, 8588 deselected in 9.75s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k
'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8'
================== short test summary info ===================
FAILED [0.0632s] test/test_sparse.py::TestSparseCPU::test_log1p_cpu_uint8 - AssertionError: "only Tensors of floating point dtype can require gradients"
does not match "You are attempting to call Tensor.requires_grad_() (or perhaps using torch.autograd.functional.* APIs) inside of a function ...
======= 1 failed, 1 skipped, 8587 deselected in 10.99s =======
```
(Note that adding test_vmap_free_tensor_dynamic_shapes causes test_vmap_free_tensor_dynamic_shapes to fail)
after:
```
$ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8
$ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8'
============= 1 passed, 8588 deselected in 9.89s =============
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k
'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8'
======= 1 passed, 1 skipped, 8587 deselected in 11.34s =======
```
(test_vmap_free_tensor_dynamic_shapes passes either way)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122074
Approved by: https://github.com/oulgen
Summary:
Right now DCP only flatten a mapping (e.g., dict) if that mapping has tensor objects. This behavior is odd as users may save different non-tensor objects on different ranks. Without flattening the mappings, we may lose these non-tensor objects. One use case is dataloader state_dict.
We may also want to do so for a list/tuple. But this will cause extra pickles. So we don't do this for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125335
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: #125333, #125501, #125334
This fixes a logic regression introduced by https://github.com/pytorch/pytorch/pull/123247 where
```python
if self.use_device and self.use_device != _get_privateuse1_backend_name():
```
was replaced with
```python
VALID_DEVICE_OPTIONS = ["cuda", "xpu", "privateuseone"]
if self.use_device not in VALID_DEVICE_OPTIONS:
```
That triggers a warning every time code is invoke with `self.use_device` set to None
This change also skips all the checks which are useless if `use_device` is None to begin with
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125654
Approved by: https://github.com/aaronenyeshi
Looking at the unrelated Windows timeout failure on https://github.com/pytorch/pytorch/pull/125199, it looks like we don't have a timeout value set for C++ tests atm. In this case, a C++ test on Windows timed out after 2+ hours.
```
2024-05-02T23:35:34.0639067Z Running cpp/c10_TypeList_test 1/1 ... [2024-05-02 23:35:34.059021]
2024-05-02T23:35:34.0641108Z Executing ['pytest', 'C:\\actions-runner\\_work\\pytorch\\pytorch\\build\\win_tmp\\build\\torch\\test\\c10_TypeList_test.exe', '-m', 'not serial', '-v', '-vv', '-rfEX', '-n', '2', '--junit-xml-reruns', 'test-reports\\python-pytest\\test\\run_test\\test\\run_test-c898ddeff8f33cbf.xml', '-x', '--reruns=2'] ... [2024-05-02 23:35:34.062137]
2024-05-03T02:45:33.7862004Z Process SpawnPoolWorker-2:
2024-05-03T02:45:33.7927201Z Traceback (most recent call last):
2024-05-03T02:45:33.7928032Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 315, in _bootstrap
2024-05-03T02:45:33.7928722Z self.run()
2024-05-03T02:45:33.7929722Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 108, in run
2024-05-03T02:45:33.7931639Z self._target(*self._args, **self._kwargs)
2024-05-03T02:45:33.7932435Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\pool.py", line 114, in worker
2024-05-03T02:45:33.7933338Z task = get()
2024-05-03T02:45:33.7933946Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\queues.py", line 365, in get
2024-05-03T02:45:33.7935219Z res = self._reader.recv_bytes()
2024-05-03T02:45:33.7935897Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 221, in recv_bytes
2024-05-03T02:45:33.7936609Z buf = self._recv_bytes(maxlength)
2024-05-03T02:45:33.7937302Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 310, in _recv_bytes
2024-05-03T02:45:33.7938316Z waitres = _winapi.WaitForMultipleObjects(
2024-05-03T02:45:33.7938766Z KeyboardInterrupt
```
Retrying was working, but it was already too late to finish the job. I'm setting the same default `THRESHOLD * 3` timeout value here for C++ tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125517
Approved by: https://github.com/clee2000
1. Fix the wrong tests about lazy init for PrivateUse1 named foo
2. Refactor the tests and make it more flexible
3. Disable the two tests temporarily
- test_open_device_faketensor
- test_open_device_scalar_type_fallback
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125572
Approved by: https://github.com/albanD
The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things
- index formula matches
- var_ranges matches
In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories
- rand_seed: the index for rand seed access is an integer and different access uses different integer offset
- different numel: this happens for cat operation
- broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B
- different loop orders: the major category we want inductor to be able to fuse
- different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset.
- unknown
My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2 ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 )
For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124986
Approved by: https://github.com/eellison, https://github.com/jansel
Check that they are not used by running the following
```
% grep -h "AT_FORALL_SCALAR_TYPES_AND" . -R|grep -v #define|cut -d\( -f1|sort|uniq
AT_FORALL_SCALAR_TYPES_AND3
AT_FORALL_SCALAR_TYPES_AND3
AT_FORALL_SCALAR_TYPES_AND
AT_FORALL_SCALAR_TYPES_AND2
AT_FORALL_SCALAR_TYPES_AND3
AT_FORALL_SCALAR_TYPES_AND7
AT_FORALL_SCALAR_TYPES_AND2
AT_FORALL_SCALAR_TYPES_AND3
AT_FORALL_SCALAR_TYPES_AND7
AT_FORALL_SCALAR_TYPES_AND2
AT_FORALL_SCALAR_TYPES_AND3
AT_FORALL_SCALAR_TYPES_AND7
// AT_FORALL_SCALAR_TYPES / AT_FORALL_SCALAR_TYPES_AND macros below, which are
AT_FORALL_SCALAR_TYPES_AND
AT_FORALL_SCALAR_TYPES_AND2
AT_FORALL_SCALAR_TYPES_AND3
AT_FORALL_SCALAR_TYPES_AND7
using at::Half; // for AT_FORALL_SCALAR_TYPES_AND3
```
or by checking online using https://github.com/search?type=code&q=AT_FORALL_SCALAR_TYPES_AND4+repo%3Apytorch%2Fpytorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125607
Approved by: https://github.com/albanD
This PR makes libtorch behave the same as PyTorch when loading optimizer state from archive. With PyTorch, options of parameter groups are loaded from the archive, which is missing currently in libtorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125215
Approved by: https://github.com/janeyx99
Fixes#117850
This PR:
* Adds the class name in the repro command
* Fixes the path to the test file for python 3.8 jobs (apparently `inspect.getfile(class_type)` returns a relative path in this older python version)
Before (in python 3.8):
```sh
PYTORCH_TEST_WITH_DYNAMO=1 python test_autograd.py -k test_foo
```
After:
```sh
PYTORCH_TEST_WITH_DYNAMO=1 python test/test_autograd.py -k TestAutograd.test_foo
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125498
Approved by: https://github.com/huydhn, https://github.com/janeyx99
This diff implements a remote caching strategy (memcache for internal and redis for external) for caching of Inductor FX Graph to Inductor generated wrapper file.
It uses the same idea with the autotuning result cache that is currently live.
This will land turned off and before turning this on by default, I will do more testing and including looking at the dynamic shape guards added by inductor.
Differential Revision: [D56441624](https://our.internmc.facebook.com/intern/diff/D56441624/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124669
Approved by: https://github.com/jansel, https://github.com/eellison
Saw a warning like this:
```
/opt/conda/lib/python3.10/site-packages/torch/utils/hooks.py:86: UserWarning: backward hook functools.partial(<function _pre_backward_hook at 0x7f9a3940fac0>, FullyShardedDataParallel(
....
), <torch.distributed.fsdp.flat_param.FlatParamHandle object at 0x7f25202a9720>) on tensor will not be serialized. If this is expected, you can decorate the function with @torch.utils.hooks.unserializable_hook to suppress this warning
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125464
Approved by: https://github.com/ezyang
That factors out a repeated pattern of creating a library/fetching a func from source
Typical usecase
```cpp
static MetalShaderLibrary lib(SHADER_SOURCE);
...
id<MTLComputePipelineState> cplState = lib.getPipelieStateForFunc("kernel_name")
```
- Make it possible to use with templated sources
- Add `scalarToMetalTypeString(const Tensor&)` variant to avoid repeated `scalarToMetalTypeString(t.scalar_type())` calls in the code
I.e. it makes no functional changes, but reduces MPS codebase size by 365 lines
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125550
Approved by: https://github.com/kulinseth
Summary:
This shim exports symbols on Windows, which can lead to symbol clashes at link time in the following scenario:
1. A DLL imports libtorch
2. A binary imports libtorch, and also depends on the DLL in (1)
Under that scenario, the symbols exported from `shim.h` can clash at link time.
Given that AOTInductor only works for PyTorch2, and PyTorch2 doesn't currently work for Windows, we can work around this problem by simply removing the symbols export on Windows. In the long term, this will need to be figured out when Windows support is added & tested for PyTorch2.
Differential Revision: D56936696
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125472
Approved by: https://github.com/desertfire
This allows `associative_scan` to take an arbitrary pytree of tensors,
which is flattened to their leaves before calling the `associative_scan`
higher order operator.
I also add support in inductor to generate code for scanning over sequences
of tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122137
Approved by: https://github.com/lezcano, https://github.com/Chillee
ghstack dependencies: #119430
Triton updated the interface for `triton.compile` 5162346487
The `target` argument to compile needs to be wrapped in a `GPUTarget` object. Without proper wrapping, we hit an assert in `compile`. If that assert is removed, Triton attempts to read device info from Torch while inside a torch thread, which hits an in bad fork assert. This change is required for compatibility with latest commits in Triton. The implementation is backwards compatible, so existing versions of Triton that work now continue to work.
Re-submitting this after https://github.com/pytorch/pytorch/pull/125241 was reverted due to an unrelated CI issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125553
Approved by: https://github.com/huydhn
This diff makes sure that a custom exception is thrown when no valid
choices remain during autotuning. This allows to gracefully fall back
to a default choice, even if that default choice has not been passed to
autotune_select_algorithm.
Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning.
( An error is being logged, though).
Test Plan:
CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124928
Approved by: https://github.com/int3
ghstack dependencies: #125406
This PR completely removes the Inductor IR for legacy functional collectives:
- Removed the `CollectiveKernel` hiearchy and `Wait`, as well as the corresponding lowerings. These IRs are target (i.e. Python) specific and don't model node dependencies propoerly (e.g. they rely on `never_reuse_buffers` for correct behavior). They've been superceded by `ir._CollectiveKernel`.
- Removed `InPlaceHint` and the scheduler logic for handling it. `InPlaceHint` is a codegen-time buffer reuse mechanism controlled by the IR's codegen. It's a bit hacky and overlaps with the default buffer reuse mechanism. Removing it since it is only used by legacy functional collectives.
- Removed `OutputBuffer` and `MultiOutputNoSizeAssert` which are designed for and only used by legacy functional collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124992
Approved by: https://github.com/Chillee, https://github.com/wanchaol
Differential Revision: D56347560
More details in this pytorch issue: https://github.com/pytorch/pytorch/issues/124468
It seems there is a race in the ProcessGroupNCCL shutdown logic. The code is quite simple:
```
for i in range(100):
dist.all_to_all_single(tensor_out, tensor_in)
dist.destroy_process_group()
```
What can happen is this:
1. dist.destroy_process_group() calls into shutdown() and then calls into abort: b2f6cfd9c0/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L1095)
2. It'll call ncclCommAbort (not graceful afaict), and also set the ncclAsyncErr_ = ncclSystemError; b2f6cfd9c0/torch/csrc/distributed/c10d/NCCLUtils.hpp (L388).
3. ncclWatchdog thread may not have woken up while all this shutdown process happens. And in shutdown we're not waiting for watchdog thread
4. ProcessGroupNCCL dtor is called. It'll wait for the watchdog thread to join
5. watchdog will check the work's isCompleted() -> then calls checkAndSetException(). Because ncclAsyncError_ was set to ncclSystemError, it'll error out and makes you think it's a nccl error.
So we can mitigate this issue by checking if the comm was aborted during work.isCompleted/isStarted
Some more longer term discussion in the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124466
Approved by: https://github.com/shuqiangzhang, https://github.com/yoyoyocmu, https://github.com/kwen2501
Summary:
We remove the assertion for target_func being cat.
The reason is that we have multiple flavors of concat, such as
cat/cat.default/cat_slice/cat_slice_cat/...
Assertion here is causing multiple times of false positives.
Test Plan: Removing assertion code only.
Differential Revision: D56971387
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125540
Approved by: https://github.com/hl475
While there are some similarities, they are also quite different (one
handles Numpy numbers while the other handles ints. I am also going to
add a wrap_symfloat soon which will do even more different behavior.
So split these out for clarity.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125483
Approved by: https://github.com/lezcano
ghstack dependencies: #125395, #125419
Enable nonzero workspace and Cutlass StreamK for Inductor Cutlass GEMM ops.
This is a simpler rewrite of my original version of #119005 using @peterbell10 's workspace allocation mechanism from #117992
Test Plan:
- Additional unit test in test_cutlass_backend.py which specifically tests StreamK GEMM with workspace requirement
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125406
Approved by: https://github.com/jansel
**Context**
We are interested in supporting the case where HSDP reduce-scatters but does not all-reduce in a microbatch backward. This saves communication while still saving memory. Only on the last microbatch do we need to both reduce-scatter and all-reduce. This is not implemented yet and will hopefully come in a future PR.
There is one notable part of doing this. On the last microbatch, we need to perform an accumulation step after reduce-scatter and before all-reduce. If not, then the preceding microbatch's gradients will not be contributed across the replica group. (In other words, we cannot simply accumulate _after_ all-reduce.)
Consider 32 GPUs with 4-way replication and 8-way sharding and 2 microbatches, and focus on global rank 0.
- After the first microbatch, rank 0 will have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}$, where we define $S(0) = \{0, 1, \dots, 7\}$ to be the ranks in its shard group and we define the $(1)$ superscript to denote the first microbatch.
- Upon the second microbatch, rank 0 after its reduce-scatter will additionally have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(2)}$. If we only all-reduce this, then this second microbatch's gradients become $\frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, so in total, rank 0 has $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, which is wrong.
- Importantly, we must accumulate $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{8} \sum_{i \in S(0)} g_i^{(2)} = \frac{1}{8}\sum_{i \in S(0)} (g_i^{(1)} + g_i^{(2)})$ first before all-reducing to get $\frac{1}{32} \sum_{i=0, 1, \dots, 31} (g_i^{(1)} + g_i^{(2)})$.
Now, note how under this approach, we want a factor of $\frac{1}{8}$ only (i.e. reciprocal of the shard group size), not $\frac{1}{32}$, for the first microbatch's gradients.
- For bf16/fp32, since we use `ReduceOp.AVG` and we only reduce-scatter on the first microbatch, we correctly have a factor of $\frac{1}{8}$ on the first microbatch.
- For fp16, since we precompute the gradient divide factors at init time assuming always reducing over both shard and replica groups, we incorrectly have a factor of $\frac{1}{32}$ on the first microbatch, deviating from the bf16/fp32 case.
We can address this issue by matching the bf16/fp32 vs. fp16 semantics by computing the divide factors at runtime based on which process groups were passed into the reduction function (`foreach_reduce`).
**Additional Notes**
How to implement the HSDP reduce-scatter but no all-reduce is not entirely clear yet. (What is the cleanest way to do this?) We need to store the partial reduce-scatter output and check for it upon the next backward. We should also be sure to error if the set of parameters receiving gradients changes, in which case we cannot support this easily. Anyway, we will implement this in a follow-up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125484
Approved by: https://github.com/wanchaol
ghstack dependencies: #125431, #125479
This does a few things that were originally a few PRs but I am on a new machine and don't have ghstack.
If it is too problematic to review, I can re-split, just let me know.
This does:
- Cleanup context manager use in test_flop_counter
- Remove need for mod argument in FlopCounterMode, warning about it
- Re-implement a Module tracker from scratch using global forward Module use and multi_grad_hook (we cannot use global backward Module hook because they don't look for nested Tensor and they're custom Function based instead of multi_grad_hook).
- Update FlopCouterMode to use the new ModuleTracker. All the existing test suite passes as-is (only changes there are new tests and refactoring mentioned above)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125352
Approved by: https://github.com/mikaylagawarecki
You can trigger ciflow tags on main branch commits, so we should be more conservative when checking to see if a workflow is a PR/on the main branch.
get_pr_number checks for the pr number based on the PR_NUMBER env var or a tag of the for `ciflow/workflow/pr number`
If we fail to find something like this, then assume it is on the main branch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125485
Approved by: https://github.com/huydhn
A re-land of #124239.
This PR fakify ScriptObject inputs and attributes in export non-strict mode by default.
The basic idea is to only fakify the script object during tracing (i.e. aot_export). After we get the traced graph module, eagerly executing, serializing, or running more passes will use the real script objects. This is essentially treating the script object as constant tensor.
Concretely, we
fakify all the script object inputs, and module attributes (gathered by constant_attrs).
patch the module's attributes with fakified script object
right after aot_export, remove the patching (to avoid changing the original module) then modify the exported graph module's attribute to real script object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125490
Approved by: https://github.com/angelayi
Summary:
We found that some dumps are missing when monitoring thread timeout.
This is likely due to multiple PGs could still dump the same records
at the same time. So we should allow only PG0 to actualy dump
Test Plan:
unit test
python test/run_test.py --cpp --verbose -i cpp/ProcessGroupNCCLErrorsTest
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125356
Approved by: https://github.com/c-p-i-o
Triton updated the interface for `triton.compile` 5162346487
The `target` argument to compile needs to be wrapped in a `GPUTarget` object. Without proper wrapping, we hit an assert in `compile`. If that assert is removed, Triton attempts to read device info from Torch while inside a torch thread, which hits an in bad fork assert. This change is required for compatibility with latest commits in Triton. The implementation is backwards compatible, so existing versions of Triton that work now continue to work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125241
Approved by: https://github.com/jansel
**Context**
We are interested in supporting the case where HSDP reduce-scatters but does not all-reduce in a microbatch backward. This saves communication while still saving memory. Only on the last microbatch do we need to both reduce-scatter and all-reduce. This is not implemented yet and will hopefully come in a future PR.
There is one notable part of doing this. On the last microbatch, we need to perform an accumulation step after reduce-scatter and before all-reduce. If not, then the preceding microbatch's gradients will not be contributed across the replica group. (In other words, we cannot simply accumulate _after_ all-reduce.)
Consider 32 GPUs with 4-way replication and 8-way sharding and 2 microbatches, and focus on global rank 0.
- After the first microbatch, rank 0 will have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}$, where we define $S(0) = \{0, 1, \dots, 7\}$ to be the ranks in its shard group and we define the $(1)$ superscript to denote the first microbatch.
- Upon the second microbatch, rank 0 after its reduce-scatter will additionally have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(2)}$. If we only all-reduce this, then this second microbatch's gradients become $\frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, so in total, rank 0 has $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, which is wrong.
- Importantly, we must accumulate $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{8} \sum_{i \in S(0)} g_i^{(2)} = \frac{1}{8}\sum_{i \in S(0)} (g_i^{(1)} + g_i^{(2)})$ first before all-reducing to get $\frac{1}{32} \sum_{i=0, 1, \dots, 31} (g_i^{(1)} + g_i^{(2)})$.
Now, note how under this approach, we want a factor of $\frac{1}{8}$ only (i.e. reciprocal of the shard group size), not $\frac{1}{32}$, for the first microbatch's gradients.
- For bf16/fp32, since we use `ReduceOp.AVG` and we only reduce-scatter on the first microbatch, we correctly have a factor of $\frac{1}{8}$ on the first microbatch.
- For fp16, since we precompute the gradient divide factors at init time assuming always reducing over both shard and replica groups, we incorrectly have a factor of $\frac{1}{32}$ on the first microbatch, deviating from the bf16/fp32 case.
We can address this issue by matching the bf16/fp32 vs. fp16 semantics by computing the divide factors at runtime based on which process groups were passed into the reduction function (`foreach_reduce`).
**Additional Notes**
How to implement the HSDP reduce-scatter but no all-reduce is not entirely clear yet. (What is the cleanest way to do this?) We need to store the partial reduce-scatter output and check for it upon the next backward. We should also be sure to error if the set of parameters receiving gradients changes, in which case we cannot support this easily. Anyway, we will implement this in a follow-up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125484
Approved by: https://github.com/wanchaol
ghstack dependencies: #125431, #125479
This provides utilities for creating and querying properties on
sympy.Symbol. I want to use this refactor to get a better handle on how
the 's' prefix is being used in Inductor. To start, I only do
symbolic_shapes code because that's what I'm familiar with.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125395
Approved by: https://github.com/Skylion007
This PR shows a simple utility to broadcast the parameters across replicas for HSDP:
```
replicate_group = mesh.get_group("replicate")
for param in model.parameters():
# E.g. for mesh [[0, 1, 2, 3], [4, 5, 6, 7]] sharding on dim-1 and
# replicating on dim-0, broadcast with sources 0, 1, 2, 3
src_rank = dist.get_process_group_ranks(replicate_group)[0]
torch.distributed.broadcast(
param.to_local(), src=src_rank, group=replicate_group
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125431
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
Greetings!
Fixes#125403
Please assist me with the testing as it is possible for my reproducer to miss the error in the code. Several (at least two) threads should enter the same part of the code at the same time to check file lock is actually working
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125404
Approved by: https://github.com/ezyang
Fixes cutlass_utils.get_max_alignment() which was so far not checking the alignment properly. Basically
the method so far assumed that the passed layout is contiguous and row-major, which does not have to be true.
Test Plan:
CI - test_cutlass_backend.py to prevent regressions
Added unit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124930
Approved by: https://github.com/int3
ghstack dependencies: #124929
- Implement a very straightforward Metal copy of CPU int4mm kernel
- Implement int8mm kernel by constructing a graph consisting of upcast, transpose and mm
- Add `isCapturing`, `isCaptureEnabled`, `startCapture` and `stopCapture` methods to `MPSProfile` which can be used to help one debug/profile Metal kernels by wrapping the calls with the following
```cpp
if (getMPSProfiler().profiler.isCaptureEnabled()) {
getMPSProfiler().startCapture(__func__, mpsStream);
}
...
if (getMPSProfiler().isCapturing()) {
getMPSProfiler().stopCapture(mpsStream);
}
```
that, if invoked with `MTL_CAPTURE_ENABLED` environment variable set to one, will produce .gputrace files, in the current working directory, which can later be loaded and used to debug or profiler the kernel
<img width="1093" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/a2bf27e8-df8a-442c-a525-1df67b8a376a">
- Added `test_int4mm` to TestLinalgMPS, which is mostly copy-n-paste of the test from `test_linalg`
TODOs:
- Add weight pack
- Perf-tune both kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125163
Approved by: https://github.com/mikekgfb
Some toy example:
<img width="998" alt="Screenshot 2024-04-17 at 2 00 05 PM" src="https://github.com/pytorch/pytorch/assets/31054793/b5665a63-beb0-4ca1-92c6-c57a052812fd">
We define `FullyShardedDataParallel._unshard(async_op: bool = False)` that can be used to prefetch all-gathers. The user should make sure:
1. Run lazy init before the first `_unshard` call of training. For example, this can hackily be done via `root_module.check_is_root()` on the root FSDP module `root_module`.
2. Call `root_module._wait_unshard_streams_on_current_stream()` before the first `_unshard` call of the current iteration (just need to call it once after last optimizer step and before first `_unshard` of this iteration).
Differential Revision: [D56262876](https://our.internmc.facebook.com/intern/diff/D56262876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124304
Approved by: https://github.com/wanchaol
- Original `test_grad_scaling_autocast_fused_optimizers` does not work since there is no "fused" in `optim_inputs`
- We should use different `grad_scaler`, they should not share 1 `scale`, there is no issue exposed here because the default `_growth_interval` is 2000 so it will not growth and there is also no inf is found so it will not reduced. The one in `test_cuda.py` should also have this issue,
- I set a manual seed to reproduce purpose if there is any numerical failure
- I use Tensor tracker here because we failed this UT in dynamo case, the cpp generated code are not exactly same with fused/non fused kernel.
- I make it check both `cuda` and `cpu`.
- I find some SGD numerical issue with `clang`, and fixed it by using `fmadd` instead of `add/mul` in fused sgd veckernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124904
Approved by: https://github.com/jgong5, https://github.com/janeyx99
Add `PyTorchFileWriter.write_record_metadata(record_name, num_bytes)` that
- writes the zipfile header/end of central directory metadata for an entry*
- reserves `num_bytes` in the zipfile for the payload.
*Since the payload is not provided, the CRC32 computation is skipped and 0s are written in the corresponding entry of the zipfile header
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125184
Approved by: https://github.com/albanD
Summary:
Fixing the implementation of `_flatten_dynamic_shapes()`, to follow how `_process_dynamic_shapes()` does it. The previous implementation would misinterpret some nested dynamic shapes specs, causing it to miss out on some shapes specs, for example with nested inputs/constant input tuples:
```
inputs = (
(2, 1),
(
torch.randn(2, 1),
torch.randn(2, 2),
torch.randn(2, 3),
)
)
dynamic_shapes = (
(None, None),
(
None,
None,
None,
)
)
```
This would get interpreted as 2 shapes specs for 2d and 3d tensors. Fixing so this doesn't happen.
Test Plan: Existing export tests
Differential Revision: D56894923
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125415
Approved by: https://github.com/angelayi
Summary: Remove the check to make sure all GPU labels are enumerated when CUDA is available. There are some systems where CUDA is available but we do not print any GPU labels (because GPU is not available).
Test Plan: Test in regression with ciflow/periodic label
Differential Revision: D56906893
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125430
Approved by: https://github.com/izaitsevfb
Summary:
The chunk/split ops on the weights/constants is folded in a fx pass and each output tensor has the same storage size of the original tensor (which is 3x of its actual size if chunk(3)). However Backend calculates the mem size on device from tensor shape/stride/dtype. This causes the mismatch when copying weights/constants to device as allocated mem on device is always smaller than the size of weights/constants and results in a runtime error in loading weight/constant (T172125529).
This diff fixes the issue by cloning the tensors after const folding so that the tensors has correct storage size.
Test Plan:
Before this change: (18432 = 48 * 64 * 2 * 3)
```
RuntimeError: Failed to load constant getitem_idx0 split (remaining=18432) at fbcode/caffe2/torch/fb/acc_runtime/afg/afg_bindings.cpp:3422: Request failed because an invalid parameter
```
```
buck2 run mode/opt //caffe2/torch/fb/acc_runtime/afg/tests:test_operators-artemis -- -r test_mem_size_mismatch
```
```
Ran 1 test in 7.048s
OK
```
Reviewed By: jfix71
Differential Revision: D56663931
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125199
Approved by: https://github.com/jfix71
`cumsum` and `cumprod` was (is?) buggy for MPS: c8d2a55273/aten/src/ATen/native/mps/operations/UnaryOps.mm (L435-L436)
A workaround casts the input to int32 prior to performing the op to prevent overflow for certain numeric types.
It turns out this issue also affects boolean types:
```python
import torch
print(torch.ones(128, dtype=torch.bool, device="mps").cumsum(0)[-1])
# tensor(-128, device='mps:0')
```
In this PR I'm adding logic to also cast bool dtypes to int32 prior to `cumsum` and `cumprod`, although output is guaranteed not to overflow for the latter with bools. I'm also adding a test to prevent regressions.
Fixes#96614#106112#109166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125318
Approved by: https://github.com/malfet
`torch.utils.benchmark.Compare` is not directly exposed in torch.utils.benchmark documentation.
I think this is a valuable resource to add since it can help people embracing the torch benchmark way of doing things, and help people building documentation towards it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125009
Approved by: https://github.com/mikaylagawarecki
~Users may have custom use cases for the `strict` parameter in load. In my mind, if we automatically call `state_dict` and `load_state_dict` in save/load, we need to support the same functionality in `nn.Modules`.~
It turns out this is actually not related to nn.Module's strict param. Since `state_dict` is called inside `dcp.load`, it's actually impossible to create a model such that the following would raise an error:
```
state_dict = module.state_dict()
module.load_state_dict(state_dict, strict=True)
```
The issue is actually just when there are elements in `state_dict` which do not exist in the checkpoint. This PR adds the ability to configure this behavior through the DefaultSavePlanner (see tests).
Concretely, if module has extra attributes not present in the checkpoint, we will only raise an error if `DefaultLoadPlanner.allow_partial_load==False`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123869
Approved by: https://github.com/fegin
Summary:
## `-Wmissing-prototypes`
In ATen-Vulkan, we often define functions in `.cpp` files without declaring them in `.h` files without hiding them in an anonymous namespace.
Example: [`Packing.cpp`'s channel_image_repacking()](f1f142c44f/aten/src/ATen/native/vulkan/impl/Packing.cpp (L299-L348))
On Mac, this results in a `-Wmissing-prototypes` warning, which is disabled in this change.
## `-Wshadow`
In `Adapter.cpp`, we overwrite a variable called `properties`, which we fix in this change as opposed to disabling the warning.
Test Plan: CI
Differential Revision: D56850324
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125361
Approved by: https://github.com/SS-JIA
From my test with Ads production workload, I found sometime kernel_file is None and grid is a tuple. It will crash since ExecutionTraceObserver expects string for both kernel_file and grid. This PR is to make sure kernel_file and grid are always passed down as string. Need to find the root cause why kernel_file is none.
Unit test:
buck test @mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125384
Approved by: https://github.com/davidberard98, https://github.com/sraikund16
Summary:
Torchscript modules do not support forward hooks and thus can't work with flop_counter context manager for hierarchical output by passing a module to FlopCounterMode on construction.
Currently any module that includes a script module causes an exception to be thrown so adding a try/catch to ignore any script modules for forward hooks.
Test Plan: CI Signals
Differential Revision: D56850661
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125346
Approved by: https://github.com/842974287
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
* #124944
* #124939
* __->__ #122965
Differential Revision: [D55493240](https://our.internmc.facebook.com/intern/diff/D55493240/)
*This PR is now ready for merge and is not an RFC*
Major choices are:
-- the introduction of the AsyncStager protocol
-- removed `executor` from param.
-- leave async as a separate method (for now)
This proposal seeks to add extension points to dcp.async_save, allowing users to:
- Specify a specific staging method when calling async_save
- Allow a vehicle for also making the staging method async, to allow for cases where we may want to overlap with the training loop (e.g., overlap d2h with and only synchronize at the optim.step)
- Potentially specify the execution method for doing async_save in parallel. For example some users may prefer a subprocess over a thread to avoid GIL issues.
A totally reasonable alternative to this entire proposal is to expect users who want this level of customization
to write their own custom async save methods. Here's an example which addresses the issues mentioned
in PR comments.
```
def custom_async_save(...):
# this step accomplishes staging and includes the usual 'planning' calls (issue 1)
buffered_writer = CpuBufferedWriter() # this is stateful, contains a copy of state_dict
dcp.save(state_dict, storage_writer=buffered_writer)
final_storage_writer = FileSystemWriter()
mp.spawn( # issue2 is gone, do whatever you want here
dcp.save, # or some custom sub-process method which calls dcp.save under the hood
buffered_writer.state_dict, # lot's of way's to do this, not really the most important part
checkpoint_id=checkpoint_id,
storage_writer=storage_writer,
planner=planner,
process_group=process_group, # this actually wouldn't work, but again not the pt.
)
# leaving out the rest of the details for managing your extra special subprocess.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122965
Approved by: https://github.com/daulet-askarov
A common complaint when working with data-dependent code in PyTorch is that it's hard to tell how far you are from the finish line: every time a GuardOnDataDependentSymNode error is hit, you have to somehow fix or workaround it to see the next one.
This PR adds a new mode `torch._functorch.config.fake_tensor_propagate_real_tensors` which modifies fake tensors to also propagate real tensors. This means that when we try to guard on a data-dependent SymNode, we can actually produce a real result. We also produce a warning which you should consult to figure out what the crux points are.
I ran this on vision_maskrcnn. In the baseline (without this mode), the model has 27 graph breaks, resulting in 40 graphs. With this mode on, the model has only 11 graph breaks, resulting in 15 graphs (the remaining graph breaks are due to missing functionality for item() on float tensor and some other Dynamo missing features.) You get a list of things that would have errored like this:
```
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> False
```
Potential later follow ups:
* Improve the warning messages (in particular, should provide user frames)
* GC real tensors when they are no longer needed by tracing. Right now, this will use A LOT of memory, equal to as if your GC was broken and every intermediate tensor was kept live
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125115
Approved by: https://github.com/IvanKobzarev
Minor refactoring:
Remove unused "fused epilogue node" arguments from some method Kernel call signatures.
Test Plan:
Covered by current tests in test_cutlass_backend.py - no functional change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124929
Approved by: https://github.com/eellison
Fixes#121965
This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated.
C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing.
Please keep the following in mind:
1) I think this is my first time using Pytorch.
2) This is my first contribution to Pytorch.
Environment:
3080 & WSL 2. `nvcc` is at 12.4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809
Approved by: https://github.com/mikaylagawarecki
In #123319, we guard some behavior behind the `assume_aligned_inputs` config option. If we set this to `False`, then the behavior added in #123319 becomes the default behavior. See the referenced PR for more details about the behavior affected.
Side effects:
* It's possible that this will hurt performance in some scenarios. For example, if an unaligned input is used in a matmul, it might be better to perform the clone to align it first.
* This will occasionally cause recompiles. Specifically: the check we perform (`(storage_offset * get_dtype_size(dtype)) % ALIGNMENT == 0`) can be guarded on if the storage_offset becomes dynamic. storage_offset becomes dynamic during automatic_dynamic_shapes after a shape or stride changes. Previously, this was increasing graph breaks in cpu inductor torchbench tests (but is fixed by more carefully guarding checks on alignment, so that we don't run them and generate guards unless actually needed).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124336
Approved by: https://github.com/eellison
* Fixes https://github.com/pytorch/pytorch/issues/124886
* Kind of similar to https://github.com/pytorch/pytorch/pull/109393
I think what happens is `exit` and `exit /b` propagate the errorlevel correctly, but `exit /b` only exists the currently running batch script and not the entire cmd.exe (or whatever program is running the batch script), so `exit /b` exits with errorlevel 1, but the the parent cmd exits with 0, and bash sees cmd's 0
I think `goto fail` and `exit` are the same thing when the batch script is run from a bash script so either would work in this case? But the `goto fail` method might be better if someone happens to run the script on cmdline
I assumed that anywhere anyone was exiting after checking the error code, they did want to exit completely, and I'm pretty sure that being inside a parenthesis counts as being a different script, so I changed everything to goto fail just in case, this might be too aggressive?
Logs after this change for a build failure on cuda:
https://github.com/pytorch/pytorch/actions/runs/8912185834/job/24475087535?pr=125306
```
2 errors detected in the compilation of "C:/actions-runner/_work/pytorch/pytorch/aten/src/ATen/native/cuda/AdaptiveMaxPooling3d.cu".
AdaptiveMaxPooling3d.cu
[7599/8420] Linking CXX shared library bin\torch_cpu.dll
ninja: build stopped: subcommand failed.
-- Building version 2.4.0a0+git3171c11
cmake -GNinja -DBUILD_ENVIRONMENT=win-vs2019-cuda11.8-py3 -DBUILD_PYTHON=True -DBUILD_TEST=True -DBUILD_TYPE=release -DBUILD_WHEEL=1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER=C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin/nvcc.exe -DCMAKE_CUDA_COMPILER_LAUNCHER=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/randomtemp.exe;C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe -DCMAKE_CXX_COMPILER_LAUNCHER=sccache -DCMAKE_C_COMPILER_LAUNCHER=sccache -DCMAKE_GENERATOR=Ninja -DCMAKE_INSTALL_PREFIX=C:\actions-runner\_work\pytorch\pytorch\torch -DCMAKE_PREFIX_PATH=C:\Jenkins\Miniconda3\Lib\site-packages -DCUDA_NVCC_EXECUTABLE=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/nvcc.bat -DCUDNN_LIBRARY=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\lib\x64 -DNUMPY_INCLUDE_DIR=C:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -DPYTHON_EXECUTABLE=C:\Jenkins\Miniconda3\python.exe -DPYTHON_INCLUDE_DIR=C:\Jenkins\Miniconda3\Include -DPYTHON_LIBRARY=C:\Jenkins\Miniconda3/libs/python39.lib -DTORCH_BUILD_VERSION=2.4.0a0+git3171c11 -DTORCH_CUDA_ARCH_LIST=8.6 -DUSE_CUDA=1 -DUSE_NUMPY=True C:\actions-runner\_work\pytorch\pytorch
cmake --build . --target install --config Release -- -j 8
(base) C:\actions-runner\_work\pytorch\pytorch>if errorlevel 1 goto fail
(base) C:\actions-runner\_work\pytorch\pytorch>exit /b 1
Error: Process completed with exit code 1.
```
vs original
https://github.com/pytorch/pytorch/actions/runs/8910674030/job/24470387612
```
2 errors detected in the compilation of "C:/actions-runner/_work/pytorch/pytorch/aten/src/ATen/native/cuda/AdaptiveMaxPooling3d.cu".
AdaptiveMaxPooling3d.cu
[7604/8420] Linking CXX shared library bin\torch_cpu.dll
ninja: build stopped: subcommand failed.
-- Building version 2.4.0a0+gite09f98c
cmake -GNinja -DBUILD_ENVIRONMENT=win-vs2019-cuda11.8-py3 -DBUILD_PYTHON=True -DBUILD_TEST=True -DBUILD_TYPE=release -DBUILD_WHEEL=1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_COMPILER=C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin/nvcc.exe -DCMAKE_CUDA_COMPILER_LAUNCHER=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/randomtemp.exe;C:/actions-runner/_work/pytorch/pytorch/build/win_tmp\bin\sccache.exe -DCMAKE_CXX_COMPILER_LAUNCHER=sccache -DCMAKE_C_COMPILER_LAUNCHER=sccache -DCMAKE_GENERATOR=Ninja -DCMAKE_INSTALL_PREFIX=C:\actions-runner\_work\pytorch\pytorch\torch -DCMAKE_PREFIX_PATH=C:\Jenkins\Miniconda3\Lib\site-packages -DCUDA_NVCC_EXECUTABLE=C:/actions-runner/_work/pytorch/pytorch/build/win_tmp/bin/nvcc.bat -DCUDNN_LIBRARY=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\lib\x64 -DNUMPY_INCLUDE_DIR=C:\Jenkins\Miniconda3\lib\site-packages\numpy\core\include -DPYTHON_EXECUTABLE=C:\Jenkins\Miniconda3\python.exe -DPYTHON_INCLUDE_DIR=C:\Jenkins\Miniconda3\Include -DPYTHON_LIBRARY=C:\Jenkins\Miniconda3/libs/python39.lib -DTORCH_BUILD_VERSION=2.4.0a0+gite09f98c -DTORCH_CUDA_ARCH_LIST=8.6 -DUSE_CUDA=1 -DUSE_NUMPY=True C:\actions-runner\_work\pytorch\pytorch
cmake --build . --target install --config Release -- -j 8
(base) C:\actions-runner\_work\pytorch\pytorch>if errorlevel 1 exit /b
+ assert_git_not_dirty
+ [[ win-vs2019-cuda11.8-py3 != *rocm* ]]
+ [[ win-vs2019-cuda11.8-py3 != *xla* ]]
++ git status --porcelain
++ grep -v '?? third_party'
++ true
+ git_status=
+ [[ -n '' ]]
+ echo 'BUILD PASSED'
BUILD PASSED
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125306
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/atalman
To fix data-dependent errors we want to recommend that people use `torch._check*` APIs. The `constrain_as*` APIs should be fully subsumed by them, and in the future we should kill them entirely.
Differential Revision: D56774333
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125253
Approved by: https://github.com/ezyang
In the given test case, we have a ModuleList of 3 modules (`norm.0`, `norm.1`, `norm.2`) which share the same `weight` and `bias` tensors. However when we trace, they all end up pointing to one state dict name, (ex. `norm.2`).
```
graph():
%p_norms_0_weight : [num_users=0] = placeholder[target=p_norms_0_weight]
%p_norms_0_bias : [num_users=0] = placeholder[target=p_norms_0_bias]
%p_norms_1_weight : [num_users=0] = placeholder[target=p_norms_1_weight]
%p_norms_1_bias : [num_users=0] = placeholder[target=p_norms_1_bias]
%p_norms_2_weight : [num_users=3] = placeholder[target=p_norms_2_weight]
%p_norms_2_bias : [num_users=3] = placeholder[target=p_norms_2_bias]
%input_ : [num_users=1] = placeholder[target=input_]
%native_layer_norm : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%input_, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
%getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
%native_layer_norm_1 : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%getitem, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
%getitem_3 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm_1, 0), kwargs = {})
%native_layer_norm_2 : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%getitem_3, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
%getitem_6 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm_2, 0), kwargs = {})
return (getitem_6,)
```
This causes an error in the unflattener where after constructing the submodules for `norm.0`, it will have the graph pointing to `norm.2.weight` and `norm.2.bias`:
```
graph():
%p_norms_2_bias : [num_users=1] = placeholder[target=p_norms_2_bias]
%p_norms_2_weight : [num_users=1] = placeholder[target=p_norms_2_weight]
%input_ : [num_users=1] = placeholder[target=input_]
%native_layer_norm : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%input_, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
%getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
return getitem
```
Since the attributes are not within the same scope of the graph, (`norm.0` vs. `norm.2`), they will not be added to the subgraph, causing an error.
So this PR handles the duplicate state dict attributes by modifying the `inputs_to_state` dict to map from node names to a list of possible state dict target names.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125192
Approved by: https://github.com/zhxchen17
the first append not having a space incorrectly merges it to any previous arguments, like `-allow-unsupported-compiler` in my case which results in a silly error: `unrecognized command-line option '-allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS'`
full log:
```
python setup.py develop
Building wheel torch-2.4.0a0+git75fa54a
-- Building version 2.4.0a0+git75fa54a
cmake3 -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/code/pytorch/torch -DCMAKE_PREFIX_PATH=/code/pytorch/.venv/lib/python3.12/site-packages;/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gcc-13.2.0-noa2f4oqalxzqvsebhuntndewgt4gq4h:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/zstd-1.5.6-z3guwm4l5rmmsv4g4wvkej3ri3bppeja:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/zlib-ng-2.1.6-kwi4ljobodjgv5eetnga4bow6crdlacl:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/mpc-1.3.1-nuwa2snyzm265lsupa2dkmxxyhiqcv7e:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/mpfr-4.2.1-wepuwobwttxbtz3nguimxa2mlljjozsi:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gmp-6.2.1-ashy6kiitonxv2f365f4q3beggzf3646:/code/spack/opt/spack/linux-fedora40-zen2/gcc-14.0.1/gcc-runtime-14.0.1-wmogkqrzn7t57dogaake2hmhjbod27gs -DNUMPY_INCLUDE_DIR=/code/pytorch/.venv/lib64/python3.12/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/code/pytorch/.venv/bin/python -DPYTHON_INCLUDE_DIR=/usr/include/python3.12 -DPYTHON_LIBRARY=/usr/lib64/libpython3.12.so.1.0 -DTORCH_BUILD_VERSION=2.4.0a0+git75fa54a -DUSE_NUMPY=True /code/pytorch
-- /usr/lib64/ccache/c++ /code/pytorch/torch/abi-check.cpp -o /code/pytorch/build/abi-check
-- Determined _GLIBCXX_USE_CXX11_ABI=1
-- Current compiler supports avx2 extension. Will build perfkernels.
-- Current compiler supports avx512f extension. Will build fbgemm.
-- The CUDA compiler identification is NVIDIA 12.4.131
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - failed
-- Check for working CUDA compiler: /usr/local/cuda-12/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda-12/bin/nvcc - broken
CMake Error at /usr/share/cmake/Modules/CMakeTestCUDACompiler.cmake:59 (message):
The CUDA compiler
"/usr/local/cuda-12/bin/nvcc"
is not able to compile a simple test program.
It fails with the following output:
Change Dir: '/code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl'
Run Build Command(s): /code/pytorch/.venv/bin/ninja -v cmTC_ee207
[1/2] /usr/local/cuda-12/bin/nvcc -forward-unknown-to-host-compiler -allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all "--generate-code=arch=compute_52,code=[compute_52,sm_52]" -MD -MT CMakeFiles/cmTC_ee207.dir/main.cu.o -MF CMakeFiles/cmTC_ee207.dir/main.cu.o.d -x cu -c /code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl/main.cu -o CMakeFiles/cmTC_ee207.dir/main.cu.o
FAILED: CMakeFiles/cmTC_ee207.dir/main.cu.o
/usr/local/cuda-12/bin/nvcc -forward-unknown-to-host-compiler -allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all "--generate-code=arch=compute_52,code=[compute_52,sm_52]" -MD -MT CMakeFiles/cmTC_ee207.dir/main.cu.o -MF CMakeFiles/cmTC_ee207.dir/main.cu.o.d -x cu -c /code/pytorch/build/CMakeFiles/CMakeScratch/TryCompile-mSGoFl/main.cu -o CMakeFiles/cmTC_ee207.dir/main.cu.o
gcc: error: unrecognized command-line option '-allow-unsupported-compiler-DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS'
ninja: build stopped: subcommand failed.
CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
cmake/public/cuda.cmake:47 (enable_language)
cmake/Dependencies.cmake:44 (include)
CMakeLists.txt:758 (include)
-- Configuring incomplete, errors occurred!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125294
Approved by: https://github.com/albanD
We encountered some model accuracy failures as the tolerance is critical. In general, we align with CUDA practice. This PR intends to adjust the tolerance for Torchbench models for training mode on Intel GPU devices and aligns with CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125213
Approved by: https://github.com/desertfire
Summary: In the recent weeks, we have encountered bugs in both the normal synchronous trace and on-demand tracing. This diff on its own does sanity checking to make sure the profiler does not have spans that extend past the boundaries that we expect. It also checks some basic properties of the tracings we expect to see. Right now the sanity tests check some basic properties to make sure that the tracings are not completely broken. Requests/suggestions for other properties are welcome.
Test Plan: Run the tests in OSS and Buck
Reviewed By: aaronenyeshi
Differential Revision: D56374298
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124773
Approved by: https://github.com/aaronenyeshi
Fixes#121965
This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated.
C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing.
Please keep the following in mind:
1) I think this is my first time using Pytorch.
2) This is my first contribution to Pytorch.
Environment:
3080 & WSL 2. `nvcc` is at 12.4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809
Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki
As apparently `vshlq_u32` is faster than `vcvt_f32_f16`
Refactor NEON `tinygemm_kernel` to rely on `load_as_float32x4` and `load_as_float32x4x2` and implement them for float16 (using vcvt), bfloat16 (using left shift) and plain float32 (not using anything)
As result stories110M run at 60 tokens/sec with f16, but at 66 tokens/sec with bf16 and 75 tokens/sec with f32, though more bandwith demand starts to favor reduced floating types as model size gets bigger.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125290
Approved by: https://github.com/mikekgfb
Fix the patch failure, and we should patch the function where it is used, not where it is defined.
Failure info:
```bash
root@cambricon-PowerEdge-C4140:/workspace# python file_based_timer_test.py -k test_expired_timers
/opt/conda/lib/python3.10/site-packages/torch/_custom_ops.py:253: DeprecationWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
return torch.library.impl_abstract(qualname, func, _stacklevel=2)
E
======================================================================
ERROR: test_expired_timers (__main__.FileTimerServerTest)
tests that a single expired timer on a process should terminate
----------------------------------------------------------------------
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2757, in wrapper
method(*args, **kwargs)
File "/opt/conda/lib/python3.10/unittest/mock.py", line 1376, in patched
with self.decoration_helper(patched,
File "/opt/conda/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/opt/conda/lib/python3.10/unittest/mock.py", line 1358, in decoration_helper
arg = exit_stack.enter_context(patching)
File "/opt/conda/lib/python3.10/contextlib.py", line 492, in enter_context
result = _cm_type.__enter__(cm)
File "/opt/conda/lib/python3.10/unittest/mock.py", line 1447, in __enter__
original, local = self.get_original()
File "/opt/conda/lib/python3.10/unittest/mock.py", line 1420, in get_original
raise AttributeError(
AttributeError: <module 'torch.distributed.elastic.timer' from '/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/timer/__init__.py'> does not have the attribute 'log_debug_info_for_expired_timers'
To execute this test, run the following from the base repo dir:
python file_based_timer_test.py -k test_expired_timers
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.792s
FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125144
Approved by: https://github.com/gag1jain
On MacOS 14.4, system Python is configured to point to a non-existing include dir
```
% /usr/bin/python3 -c "import sysconfig;print(sysconfig.get_path('include'))"
/Library/Python/3.9/include
```
Workaround the issue by composing path to include folder from `stlib` config, which points to
```
% /usr/bin/python3 -c "import sysconfig;print(sysconfig.get_path('stdlib'))"
/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125285
Approved by: https://github.com/kit1980
While studying some tlparse, I noticed that CompilationMetrics was reporting that there was no error for frames that have no nodes. I'm pretty sure we don't actually install a frame in this situation. has_guarded_code will tell us if that's the case, because it says if the GuardedCode object is None or not.
Actually, while working on this, I was wondering if we can ever trigger the "skip this frame entirely, do not trace it ever again" codepath, as best as I could tell, it's impossible for this to happen by the time we get to compilation metrics block.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125279
Approved by: https://github.com/yanboliang
Summary: There's a shortcoming in the FX graph cache tests in that they don't fully clear all inductor in-memory caches when testing the cache-hit path: We were previously accessing the FX graph cache correctly, but when loading the source object using the PyCodeCache.load_by_key_path() method, _that_ path was serving entries out of memory. To better mimic what happens during warm start (i.e., a new process), we should clear all in-memory caches.
Test Plan: updated the unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125280
Approved by: https://github.com/eellison
1. This PR removes the logic for saving and removing the pre-backward hook handles (which is registered via `register_multi_grad_hook(mode="any")`).
2. This PR removes the logic for _trying_ to guard against mistargeted prefetches that relies on querying if the engine will execute the module output tensors' `grad_fn`s. (See https://github.com/pytorch/pytorch/pull/118118 for original motivation.)
For 1, the logic was error prone since it relied on `set_is_last_backward(False)` being set correctly or else pre-backward hooks could be de-registered too early. We would prefer to match the hook lifetimes with that of the autograd graph. This solves a bug with a 1f1b interleaved schedule.
If we directly remove the manual saving/removing hook handle logic, then we have a ref cycle where the tensors' `grad_fn`s are passed to the hook function. We decide to simply remove this `grad_fn` logic since (1) it cannot perfectly prevent mistargeted prefetches and (2) it introduces undesired complexity. In the future, we may prefer a different mechanism to override the prefetching for more complex/dynamic use cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125269
Approved by: https://github.com/weifengpy
ghstack dependencies: #125190, #125191
Summary: Discovered breakages by enabling codecache by default and doing a CI run. I'll commit these fixes first and eventually enabling caching by default will (hopefully) be a one-liner.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125258
Approved by: https://github.com/eellison
Fix https://github.com/pytorch/pytorch/issues/124900.
When we reconstruct `ContextWrappingVariables`s, we only reconstruct the context class, not the object. Normally, contexts are active (via `with ctx:`) and we initialize the context object in the resume function. But for the case of inactive contexts (contexts declared ahead of time before the `with` block), we do not reconstruct them properly in the optimized bytecode or resume function. So this PR adds initialization for inactive contexts in the resume function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125203
Approved by: https://github.com/jansel
Summary: When CapabilityBasedPartitioner creates the fused subgraph as the call_module node, it didn't populate the node.meta["val"] field.
Test Plan: OSS CI
Differential Revision: D56789259
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125261
Approved by: https://github.com/zhxchen17
Summary: Due to the compatitbility issue, we hard coded the passes to do the pattern optimization. Here, we revisit the method since it has been a while for the changes into production packages. We instead read from the config to decide whether we do the specific pattern optimization, which makes followup pattern add easier.
Differential Revision: D56659934
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125136
Approved by: https://github.com/jackiexu1992
Fixes two build problems on ROCM 6.1 + Ubuntu 22.04
### Inconsistency value of CMAKE_PREFIX_PATH between `.ci/pytorch/build.sh` and Build Instructions
Current `CMAKE_PREFIX_PATH` points to the base environment of the conda (commonly `/opt/conda`). However the conda environment used in the CI should be `/opt/conda/envs/py_<VRESION>`, which is supplied by `$CONDA_PREFIX`.
This divergence may cause libstdc++ version conflicts because the base conda environment may ship a different libstdc++ than the `pv_<VERSION>`, and/or the system default environment. One notable issue is on our internal CI system this script failed to build AOTriton library on Ubuntu 22.04 due to libstdc++ version conflicts between HIP compiler and conda base environment.
This PR fixes this and make sure the CI script follows the official build instruction.
### Incorrect `tinfo` was linked on Ubuntu 22.04 due to flaws in parsing of `os-release`
The code to parse /etc/os-release is incorrect and the distribution info was parsed as `PRETTY_Ubuntu` instead of `Ubuntu`. `libtinfo` will not be linked into the binary due to this flaw. Thus, cpp unit tests failed to build because of missing symbols from `libtinfo`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118216
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet, https://github.com/atalman
This doesn't introduce any new behavior, but sets up a basic cache key generation mechanism that I can test. From here I will:
- Add checks on the ops in an input FXGraph to make sure they are safe to cache. We'll be conservative in the first version here.
- Add serialization for FX graphs
- Save these FX graphs to disk in the cache
- Support graphs with more complicated ops like higher order ops and specialized nn modules
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124642
Approved by: https://github.com/aorenste
Summary: This commit fixes the pattern matching for conv-bn
during QAT fusion where both weight and bias are quantized per
channel. Previously this failed because weights and biases used
the same example kwargs for their scales and zero points,
causing these qparams to be tied during pattern matching.
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_qat_conv_bn_per_channel_weight_bias
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_qat_conv_bn_per_channel_weight_bias
Reviewers: jerryzh168, angelayi
Subscribers: jerryzh168, angelayi, supriyar
Differential Revision: [D56740694](https://our.internmc.facebook.com/intern/diff/D56740694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125208
Approved by: https://github.com/angelayi
Summary:
Add exclusion list to minimizer:
1. some operations cannot be lowered when constructing subgraphs; this usually happens when they are isolated from operation group
2. exclude them in search strategies for automation
Reviewed By: jimone1
Differential Revision: D56327289
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124504
Approved by: https://github.com/jfix71
This PR fakify ScriptObject inputs and attributes in export non-strict mode by default.
The basic idea is to `only fakify the script object during tracing (i.e. aot_export)`. After we get the traced graph module, eagerly executing, serializing, or running more passes will use the real script objects. This is essentially treating the script object as constant tensor.
Concretely, we
1. fakify all the script object inputs, and module attributes (gathered by constant_attrs).
2. patch the module's attributes with fakified script object
3. right after aot_export, remove the patching (to avoid changing the original module) then modify the exported graph module's attribute to real script object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124239
Approved by: https://github.com/zou3519
Fixes#124528
Going over the options for our MapAllocator and what they do, I don't think any other of them need to be piped up to `torch.load`
4f29103749/aten/src/ATen/MapAllocator.h (L8-L16)
~However, I wonder if this `MmapVisibility(Enum)` is a good way to represent "or-ing" together of `mmap` flags if we want to extend it in the future. I looked over the flags for [`mmap(2)`](https://man7.org/linux/man-pages/man2/mmap.2.html), and could not immediately see how most of them would be useful for `torch.load` (would maybe `MAP_LOCKED` (like `mlock`) or `MAP_HUGE` ever be worthwhile?)~
Using the flags provided by the python `mmap` library so that we can extend the allowed flags and pipe them down to the cpp `mmap` call if there is a need for other flags in the future
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124889
Approved by: https://github.com/albanD
This PR fixes an issue presented when calling `aten.alias(int)` raises a TypeError.
```python
import torch
import torch.autograd.forward_ad as fwAD
def f(x):
return 4312491 * x
device = "cpu"
with torch._subclasses.fake_tensor.FakeTensorMode():
with fwAD.dual_level():
x = torch.randn(3, device=device)
y = torch.ones_like(x)
dual = fwAD.make_dual(x, y)
f(dual)
```
The test case above illustrates this bug.
1) `4312491` turns into a tensor that is a wrapped number
2) Forward mode AD calls `aten::alias` internally
3) The wrapped number (`4312491`) becomes a python integer
4) `aten.alias(int)` raises a `TypeError`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124774
Approved by: https://github.com/albanD, https://github.com/zou3519
This PR introduces a new way of building `dynamic_shapes` for export. The idea is to build up a mapping from input tensors to the dynamic shapes that should be assigned to their corresponding fake tensors.
This mapping is automatically converted to the current form of `dynamic_shapes`, which must exactly match the structure of inputs. We do this by using pytree utils.
With the current `dynamic_shapes`, we had to be careful about user-defined classes that are registered with pytree, since such classes are not necessarily polymorphic containers; they may be fine containing tensors, but not dynamic shapes. Thus we had decided to allow input instances of such classes to be associated with dynamic shapes in flattened form. This decision needs to be mirrored in this PR as well. To make it easier to keep these code paths in sync, we refactor the current recursive procedure for associating inputs with dynamic shapes to use the same pytree utils. This needs minor fixes to a few tests where `dynamic_shapes` were not exactly matching the structure of inputs.
Differential Revision: D56551992
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124898
Approved by: https://github.com/zhxchen17
Part of a multi-PR work to improve #59168
Meant to complete
Write native kernels for AvgPool3d
Write native kernels for MaxPool3d
Write native kernels for AdaptiveAvgPool3d
Write native kernels for AdaptiveMaxPool3d
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116305
Approved by: https://github.com/ezyang
I was seeing for a reduction kernel and a given block size, on AMDGPU, the vectorization bandwidth (16-byte) for a thread was not fully leveraged while it was not a problem for NVGPU. It appeared that each thread got fewer data to process as a whole row were processed by more threads, and the number of elements each thread got was not enough to saturate full vectorization. On AMDGPU, a warp has 64 lanes compared to 32 on the NV side. Therefore I'm tuning down the default number of warps (8 for NV) for AMD. I'm seeing 10% speed up for an internal benchmark.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125084
Approved by: https://github.com/shunting314
For microbatching use cases (e.g. PP), we may use fp32 reduce-scatter (i.e. `MixedPrecisionPolicy(reduce_dtype=torch.float32)`), where we want to accumulate the unsharded gradients in fp32 across microbatches until reduce-scattering in fp32 upon the last microbatch.
Note that the `unsharded_param` is in bf16, so we must save the fp32 accumulated gradient to an attribute different from `.grad`. Moreover, saving a new attribute on the `torch.Tensor` leads to some annoying type checking issues (where the attribute may not be defined), so this PR prefers to save the attribute on the `FSDPParam` class instead.
One could argue that this behavior should be configurable, but since I think for large-scale training, everyone is leaning toward fp32 accumulation across microbatches, let us avoid adding another argument for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125191
Approved by: https://github.com/weifengpy
ghstack dependencies: #125190
The unit test for fp32 `param_dtype` and bf16 `reduce_dtype` was disabled. This PR debugs the issue and identifies the root cause as numeric differences between NCCL bf16 all-reduce vs. bf16 reduce-scatter. We address this by having the baseline use reduce-scatter -> all-gather to implement all-reduce.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125190
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
By using `Py_NewRef`
Also, wrap `THPDtype_to_real`/`THPDtype_to_complex` calls with `HANDLE_TH_ERRORS`
Add regression test for the above issues, by calling to_complex for integral dtypes, that raises an exception and by preserving reference count to the same to_complex/to_real call to detect if leak is happeneing.
Replace
```cpp
auto dtype = (PyObject*)torch::getTHPDtype(current_dtype);
Py_INCREF(dtype);
return dtype;
```
with a more compact/streamlined equivalent
```cpp
return Py_NewRef(torch::getTHPDtype(current_dtype));
```
Fixes https://github.com/pytorch/pytorch/issues/124868
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125154
Approved by: https://github.com/Skylion007, https://github.com/albanD
yolo
Also
* Ensure that at least 1 test always gets run (`//` does truncation which results in 0 if you have too few tests discovered)
* Don't run test removal on slow tests - I'm not touching that yet
I am avoid everything other than pull + trunk workflows, so not doing this on windows CUDA, which runs on periodic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125049
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
When we do cudagraph warmup, we record which outputs are in the cudagraph pool, so subsequently when we invoke a cudagraph and need to reclaim its memory we can free the prior run's outputs and make them error on access.
In warmup, we detect this by ignoring outputs which are an alias of an input that is not a prior output. We did this by checking data pointer. In very rare situations, a data pointer of a non cudagraph input might get reallocated to a cudagraph pool and causes us to ignore it.
This was happening with gpt-fast error with gemma 2 when coordinate_descent_tuning was set to False.
This updates so that we check aliasing with non-cudagraph inputs by looking at storage pointer..
Unrelated: saw very weird behavior where an output had the same data pointer as a supposedly live input but not the same cdata 🤔 I would think that is not possible.
```
out[0]._cdata in [ref()._cdata for ab in non_cudagraph_inps_storage_refs] # False
out[0].data_ptr() in [ref().data_ptr() for ab in non_cudagraph_inps_storage_refs] # True
```
Differential Revision: [D56607721](https://our.internmc.facebook.com/intern/diff/D56607721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124981
Approved by: https://github.com/ezyang
This PR use str for reduce_op directly instead of the c10d enum. Since
our functional collective already uses str, there's no reason that we
need the c10d enum anymore as that requires a conversion
Also the str hash + eq performance is actually significantly faster than
the c10d type, so this would somewhat improves the CPU overhead too
Some local cpu benchmarks on `1000000` hash operations:
```
Hash performance for string type: 0.039897 seconds
Hash performance for integer type: 0.304665 seconds
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125172
Approved by: https://github.com/awgu, https://github.com/XilunWu, https://github.com/tianyu-l
# Summary
This is part one of adding backwards support to FlexAttention.
This PR focuses on the eager implementation and wiring up enough of the templated_attention_backward(name change soon 😉) to get through aot_eager.
Notably this does not actually wire up the triton template just yet in order to make this PR easier to review. That will be the next follow up PR.
#### Structure
We pass both the forward and backward graph to the backwardsHOP since these are both needed to be inlined into the calculation for backwards:
- the forward graph is needed in order to re-compute the scores
- the joint graph is needed in order to construct the correct gradients post softmax_grad calc
### Attatched AOT Graph
https://gist.github.com/drisspg/ce4c041f8df8a5a7983c5174705cf2b5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123902
Approved by: https://github.com/Chillee
Summary:
Fixes https://github.com/pytorch/pytorch/issues/122842
Currently, calling ep.module() on an ExportedProgram leads to a GraphModule with a default forward signature (e.g. arg_0, arg_1, ...). This leads to original placeholder names disappearing for retracing/re-exporting.
Fixing this issue by creating a forward_arg_names field (will take renaming suggestions for this), that stores the positional & keyword arg names that are used. These names aren't present in the call_spec currently stored, and requires a major version bump for the ExportedProgram schema.
Test Plan: Tests exist for export, but names are now changed from generic (e.g. arg_0, arg_1) to follow user inputs (e.g. x, y)
Differential Revision: D56484994
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124765
Approved by: https://github.com/zhxchen17
If there's an exception during collection it can result in the profiler never being stopped properly. As a result all subsequent tests that use profiling will also fail - even if they pass in isolation.
I'm hoping this fixes the flakyness in #124253, #124220, #82720, #119346, #119364, #119490, #119526, #119537 (and the currently closed#82864).
Before:
```
(py312) $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/profiler/test_profiler.py
===================================================================================================================== FAILURES =====================================================================================================================
============================================================================================================= short test summary info ==============================================================================================================
FAILED test/profiler/test_profiler.py::TestExecutionTrace::test_execution_trace_with_kineto - AssertionError: Element counts were not equal:
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_conv2d_bias_followed_by_batchnorm2d_pattern - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern_benchmark - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_fp32_matmul_pattern - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_matmul_dim_fp16_pattern - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_kineto_multigpu - torch._dynamo.exc.InternalTorchDynamoError: 'NoneType' object has no attribute 'events'
FAILED test/profiler/test_profiler.py::TestProfiler::test_oom_tracing - AssertionError: RuntimeError not raised
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_basic_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_close_in_scope_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_complex_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_multiple_preexisting_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestProfiler::test_source_multithreaded_open_in_scope_work_in_main_thread_False - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_optimizer_parameters_sgd - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_refcounts - RuntimeError: Can't disable Kineto profiler when it's not running
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_sparse_tensors - RuntimeError: Can't disable Kineto profiler when it's not running
==================================================================================================== 16 failed, 26 passed, 53 skipped in 25.51s ====================================================================================================
```
After:
```
(py312) $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/profiler/test_profiler.py
===================================================================================================================== FAILURES =====================================================================================================================
============================================================================================================= short test summary info ==============================================================================================================
FAILED test/profiler/test_profiler.py::TestExecutionTrace::test_execution_trace_with_kineto - AssertionError: Element counts were not equal:
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern - RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/data/users/aorenste/pytorch/torch/csrc/autograd/profiler_python.cpp":969...
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_extra_cuda_copy_pattern_benchmark - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_fp32_matmul_pattern - AttributeError: 'NoneType' object has no attribute 'profiler'
FAILED test/profiler/test_profiler.py::TestProfiler::test_kineto_multigpu - torch._dynamo.exc.InternalTorchDynamoError: 'NoneType' object has no attribute 'events'
FAILED test/profiler/test_profiler.py::TestProfiler::test_oom_tracing - AssertionError: RuntimeError not raised
FAILED test/profiler/test_profiler.py::TestTorchTidyProfiler::test_optimizer_parameters_sgd - RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/data/users/aorenste/pytorch/torch/csrc/autograd/profiler_python.cpp":969, please...
==================================================================================================== 7 failed, 35 passed, 53 skipped in 31.51s =====================================================================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125131
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
Fixes#100152
1. Fix the wrong tests about lazy init for PrivateUse1 named foo
2. Fix wrong backend meta registry mechanism when compiling with clang++( compiling with g++ work well)(introduced by static variable in inline function)
3. Refactor the tests and make it more flexible
4. Disable the two tests temporarily
- test_open_device_storage_pin_memory
- test_compile_autograd_function_aliasing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124712
Approved by: https://github.com/albanD, https://github.com/malfet
Summary:
fixes two issues:
- when exporting with debug=True, the list of error-causing modules and a dependency path to them is not printed correctly, there's a missing newline after the path, meaning the name of the module for the next error is on the wrong line, which makes the output a confusing mess to read
- when a pickled object references more than one mocked module directly, the error message incorrectly repeats the same information, claiming the referenced attribute is present in several different libraries, because the if condition references the last seen module name while walking the pickle ops, not the module name from the enclosing block `for module_name in all_dependencies:`. this is confusing because one error will print as O(all_dependencies) errors, all with different module names but the same attribute name
Differential Revision: D56578035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124943
Approved by: https://github.com/JonAmazon, https://github.com/houseroad
This PR renames the `FSDP` class to `FSDPModule`. This is a BC breaking change. The rationale is that `FSDPModule` is more descriptive since `fully_shard` is a module-level API (applied to a `module` arg), so the `FSDP` class will always correspond to a module.
Also, users commonly import `FullyShardedDataParallel` as `FSDP`, so this can help avoid some name conflict in some cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124955
Approved by: https://github.com/wanchaol, https://github.com/wconstab
ghstack dependencies: #124651, #124741, #124767, #124768, #124780, #124787
This completely subsumes https://github.com/pytorch/pytorch/pull/120816
This makes use of the unbacked binding machinery to teach Inductor how to generate deferred runtime asserts directly. There is some back story about why I did it this way, let me explain.
Previously, our strategy for generating runtime asserts was that Dynamo would insert them into the FX graph after finishing tracing, and we would attempt to code generate them based on the FX graph. This is a good strategy for export, where we immediately export the graph. However, this strategy was afflicted by problems in eager, where we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops, because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already. Oops!
So, with this PR, we take the attitude that as long as the ShapeEnv sticks around, the ShapeEnv's list of deferred runtime asserts is the source of truth, and we don't put anything in the graph. So we just need to decide when to actually generate asserts, and the place I picked was Inductor lowering, since we already have an AssertScalar buffer concept, and so I just need to insert them at this point. AssertScalar also uses raw sympy.Expr rather than SymInt/Bool, so it is easier to prevent unrestricted simplification at this point.
There are a few things jumbled together in this PR. I can split them if you want, but some of the changes are before I changed my strategy, but they're useful changes anyway.
**torch/_dynamo/output_graph.py** and **torch/_inductor/lowering.py** - Here, we stop putting deferred runtime asserts in the graph. I also have to make sure we don't DCE unused symbol arguments; we're going to get some goofy graph arguments this way, will be good to restore that optimization eventually. We also just disable codegen for `_assert_scalar` entirely; we assume that ShapeEnv will be good enough to capture all of these.
**torch/_inductor/codegen/wrapper.py** and **torch/_inductor/ir.py** - Add a way to codegen sizevars without forcing simplification
**torch/_inductor/graph.py** - The main logic. Our strategy is to interpose in the same place we are testing that unbacked SymInts are properly showing up in lowered code. The logic is directly analogous to the logic in the existing insert deferred runtime asserts FX pass, but it's simpler because sympy expressions can be directly stored on inductor IR nodes.
**torch/fx/experimental/symbolic_shapes.py** - For extra safety, we have a way of freezing runtime asserts, so that if you try to add more we error. This prevents us from adding runtime asserts after we've done lowering. There's a funny interaction with backwards which there's a comment for in graph.py
**torch/fx/passes/runtime_assert.py** - This is not really needed in this PR, but I rewrote the runtime assert logic to use unbacked_bindings rather than inferring it by looking for unbacked SymInts. Now, keypaths are translated into FX node acessors. Unfortunately, I couldn't delete the old inference code, because you still need it to find backed SymInts from arguments (as this pass may be used on graphs which don't explicitly bind all their shape variables as argments). There are some new tests exercising this.
TODO: I think we need to generate asserts for replacements too. This is a preexisting problem that the old FX pass had too.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124874
Approved by: https://github.com/jansel
ghstack dependencies: #124864
I want to generate runtime assert nodes during lowering, which means
that I need a finalized list of asserts by the time I start lowering.
This means this runtime assert introduced in
https://github.com/pytorch/pytorch/pull/113839 must go. Fortunately,
this runtime assert was never exercisable, apparently, and the test
still "passes" without it. I replace it with a compile time test. We
can revisit if this assert fails in practice.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124864
Approved by: https://github.com/jansel
Summary: When I was debugging an issue, this silent error makes the debugging harder. It is better to error earlier with more descriptive error message.
Test Plan: None
Differential Revision: D56312433
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124411
Approved by: https://github.com/zhxchen17
Earlier globals of inlined functions from other files were not handled correctly. We were not tracking mutations on them. They were colliding with the same global name in the parent function etc. This PR overrides the LOAD/STORE_GLOBAL for inline tx and tracks mutation on them separately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125002
Approved by: https://github.com/jansel
ghstack dependencies: #125097, #125107
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/124286
The TorchBenchmark includes a method called `run_n_iterations` which runs model multiple times.
43f4e71daa/benchmarks/dynamo/common.py (L2272-L2276)https://github.com/pytorch/pytorch/pull/123399 enables tracing into a `UserDefinedObjectVariable` that's an instance method. It will trace the model into FX graph multiple times within `run_n_iterations`. Then, in the Inductor, `Conv-BN folding` at the module level will fuse the same Conv-BN module multiple times in this case, which leads to accuracy failures. This PR addresses the issue by ensuring that each Conv-BN module is fused only once.
**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_folded_conv_bn_with_module_sharing
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_folded_conv_functional_bn_with_module_sharing
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_conv_bn_with_multi_bn_share_conv
python -u -m pytest -s -v test/inductor/test_inductor_freezing.py -k test_conv_functional_bn_with_multi_bn_share_conv
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124808
Approved by: https://github.com/jansel, https://github.com/jgong5
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.
This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Reviewed By: palmje
Differential Revision: D56587751
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125041
Approved by: https://github.com/Skylion007
Summary:
The LLVM warning `-Wmissing-field-initializers` has found one or more structs in this diff's files which were missing field initializers.
This can be unintended such as:
```
my_struct s1 = {0}; // Initializes *only* the first field to zero; others to default values
my_struct s2 = {}; // Initializes *all* fields to default values (often zero)
```
or it may be because only some of the members of a struct are initialized, perhaps because the items were added to the struct but not every instance of it was updated.
To fix the problem, I've either used `{}` to initialize all fields to default or added appropriate default initializations to the missing fields.
Test Plan: Sandcastle
Reviewed By: palmje
Differential Revision: D56614179
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125047
Approved by: https://github.com/Skylion007
Summary:
This is a forward Hotfix for T186742340.
Some recent changes in Pytorch / Inductor ( D56458606) led to aten.addmm operators being inserted twice into the list of choices to select from during autotuning. This appears to have triggered a test failure in fbcode.
This fix prevents the aten operators being added twice to the list of choices for autotuning.
Test Plan:
* Pytorch CI
* CUDA_LAUNCH_BLOCKING=1 buck2 test 'fbcode//mode/opt' fbcode//accelerators/pytorch/lib/pt2_utils/tests:compile_pt2_test -- --exact 'accelerators/pytorch/lib/pt2_utils/tests:compile_pt2_test - test_compile_pt2 (accelerators.pytorch.lib.pt2_utils.tests.compile_pt2_test.TestCompilePT2)'
Differential Revision: D56642879
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125065
Approved by: https://github.com/eellison
This PR removes the legacy impls of c10d_functional ops which are now irrelevant. For backward compatibility purpose, c10d_functional ops now call into _c10d_functional ops.
We also changed c10d_functional ops to be CompositeExplicitAutograd, so that when traced, only _c10d_functional ops appear in the graph. After this, we'll be able to remove the Inductor IR for the legacy functional collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124979
Approved by: https://github.com/wanchaol
This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following #124701. In addition, libfmt dependency is added in CMake code to enable using it in the headers. The libfmt has to be added as private dependency to torch_cuda and torch_hip because they include torch/csrc/distributed/c10d/Utils.hpp which uses libfmt.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124987
Approved by: https://github.com/malfet
Summary:
This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts.
This uses 2 approaches for different areas:
* local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers.
* exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add.
At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds.
Test Plan:
This is testing using many small tests running on a remote cluster.
{D56549942}
```
torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1
```
Differential Revision: D56605193
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124982
Approved by: https://github.com/kiukchung, https://github.com/kurman
- sets it as a fake stack trace as we don't have a generic comment feature
- when verbose is disabled, still adds a contextmanager and flag checks. the alternative is to use MACROS, but that wouldn't be usable with TORCH_LOGS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124954
Approved by: https://github.com/jansel
## Description
Framework overhead is found to be big for the onednn qconv op (used for quantization with PT2E X86Inductor backend). This PR reduces the integration overhead by modifying the implementation of qconv.
## performance results
Running quantized Resnet50 on an Intel(R) Xeon(R) Platinum 8490H machine
Before
```
Average latency: 8.378 ms.
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
onednn::qconv2d_pointwise 86.54% 6.954ms 87.42% 7.025ms 132.547us 53
```
After
```
Average latency: 6.255 ms.
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
onednn::qconv2d_pointwise 85.05% 6.381ms 85.98% 6.451ms 121.717us 53
```
Test script:
```python
import torch
import torchvision
import time
import copy
import numpy as np
from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantize_pt2e import (
prepare_pt2e,
convert_pt2e,
)
import torch.ao.quantization.quantizer.x86_inductor_quantizer as xiq
from torch.ao.quantization.quantizer.x86_inductor_quantizer import X86InductorQuantizer
torch._inductor.config.cpp.enable_kernel_profile=True
torch._inductor.config.profiler_mark_wrapper_call = True
torch._inductor.config.freezing = True
torch._inductor.config.cpp_wrapper = True
def bench_model(model, inputs):
times =[]
with torch.no_grad():
for _ in range(5): # warm-up
output = model(inputs)
for _ in range(20):
start_time = time.time()
output = model(inputs)
end_time = time.time()
times.append(end_time - start_time)
print ('Average latency: %0.3f ms.' % (np.median(times) * 1000.0))
with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU]) as p:
out_ipex = model(inputs)
print(p.key_averages().table(sort_by="self_cpu_time_total", row_limit=-1))
def pt2e_ptq(m, example_inputs):
m = m.eval()
exported_model = capture_pre_autograd_graph(m, example_inputs)
quantizer = X86InductorQuantizer()
quantizer.set_global(xiq.get_default_x86_inductor_quantization_config())
prepared_model = prepare_pt2e(exported_model, quantizer)
_ = prepared_model(*example_inputs)
converted_model = convert_pt2e(prepared_model)
torch.ao.quantization.move_exported_model_to_eval(converted_model)
with torch.no_grad():
optimized_model = torch.compile(converted_model)
_ = optimized_model(*example_inputs)
_ = optimized_model(*example_inputs)
bench_model(optimized_model, *example_inputs)
return optimized_model
if __name__ == "__main__":
data = torch.randn(16, 3, 224, 224)
model_fp = torchvision.models.resnet50(weights=torchvision.models.ResNet50_Weights.DEFAULT)
pt2e_ptq(copy.deepcopy(model_fp), (data,))
```
Differential Revision: [D56288440](https://our.internmc.facebook.com/intern/diff/D56288440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123240
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
Summary:
This diff fixes a bug in PyTorch where when creating a tensor from a List of booleans, PyTorch was throwing an error.
This fix resolves that issue. All credit goes to swolchok for identifying the root cause of the issue and suggesting this fix.
Test Plan: Running our model end to end works as expected and no error occurs.
Differential Revision: D55990810
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124899
Approved by: https://github.com/zhxchen17
**Summary**
This PR is attempt to land an experimental feature designed in #103686 . `local_map` is designed to allow users to apply to `DTensor` objects a function that was written to apply to `torch.Tensor`.
As a function, `local_map` takes in 2 required arguments (`func` and `out_placements`) and 3 optional arguments (`device_mesh`, `in_placements`, `redistribute_inputs`). `func` is the function to be applied to each local shard of input `DTensor`. `out_placements` is the sharding specification of output `DTensor`.
`local_map` returns a new function that does the following:
1. Infer `device_mesh` and `in_placements` from `DTensor` input if they're not provided. If `device_mesh` is provided, it must be identical to the device mesh of every `DTensor` input. If `in_placements` is provided, it serves as the required sharding specification of corresponding `DTensor` input before feeding its local shard into `func`. In case it is different from `DTensor`'s sharding specification, if `redistribute_inputs=False` an exception will be raised, otherwise perform a resharding to the required sharding.
2. Call `func` with the arguments passed in along with `device_mesh` except `DTensor`s. For `DTensor`, pass in its local shard. This `func` may include collectives.
3. For each output of `func` that has validate (i.e. not `None) sharding specification in `out_placements`, construct a new `DTensor` using the output and the specification. Use this `DTensor` as the output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123676
Approved by: https://github.com/wanchaol
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.