pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
albanD	e0cabebad9	Add missing include for internal build (#119721 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119721 Approved by: https://github.com/huydhn	2024-02-12 22:36:16 +00:00
PyTorch MergeBot	24bdd03d23	Revert "Reify view_func() closures as ViewFuncs (#118404 )" This reverts commit d5a6762263a98e5153bc057c8ba4f377542c7e55. Reverted https://github.com/pytorch/pytorch/pull/118404 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/118404#issuecomment-1938600260))	2024-02-12 12:38:51 +00:00
Yu, Guangye	8fd11cb307	[2/2] Intel GPU Runtime Upstreaming for Stream (#117619 ) # Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers stream-related APIs, including - `torch.xpu.StreamContext` - `torch.xpu.current_stream` - `torch.xpu.set_stream` - `torch.xpu.synchronize` - `torch._C._xpu_getCurrentRawStream` # Additional Context We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`. The differences with CUDA: no default and external stream in XPU and lack of below APIs: - `torch.cuda.ExternalStream` - `torch.cuda.default_stream` - `toch.cuda.is_current_stream_capturing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #117611	2024-02-10 03:39:42 +00:00
albanD	4b9568a360	Add Accelerator device and shell hooks (#119329 ) This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8 It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329 Approved by: https://github.com/ezyang	2024-02-09 18:54:28 +00:00
Joel Schlosser	d5a6762263	Reify view_func() closures as ViewFuncs (#118404 ) Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on. ```cpp /// Base class for view functions, providing reapplication of a view on a new base. /// Each view op should get a codegenerated subclass of this class containing /// any state needed to reconstruct the view. The class also provides convenience /// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification, /// where we want to use symbolic values or fake tensors instead. struct TORCH_API ViewFunc { virtual ~ViewFunc() {} /// Returns any SymInts in the saved state. virtual std::vector<c10::SymInt> get_symints() const { return {}; } /// Returns the number of SymInts in the saved state. virtual size_t num_symints() const { return 0; } /// Returns any tensors in the saved state. virtual std::vector<at::Tensor> get_tensors() const { return {}; } /// Returns the number of tensors in the saved state. virtual size_t num_tensors() const { return 0; } /// Reapplies the view on the given base using the saved state. virtual at::Tensor operator()(const at::Tensor&) const = 0; /// Returns a clone of this ViewFunc, optionally with the specified saved state. virtual std::unique_ptr<ViewFunc> clone_and_set( std::optional<std::vector<c10::SymInt>> = c10::nullopt, std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0; protected: /// Sets the values of any SymInts in the saved state. The input vector size must /// match the number of SymInts in the saved state (i.e. the size of the list /// returned by get_symints()). virtual void set_symints(std::vector<c10::SymInt>) {} /// Sets the values of any Tensors in the saved state. The input vector size must /// match the number of Tensors in the saved state (i.e. the size of the list /// returned by get_tensors()). virtual void set_tensors(std::vector<at::Tensor>) {} }; ``` New codegen files: * `torch/csrc/autograd/generated/ViewFunc.h` * `torch/csrc/autograd/generated/ViewFuncs.cpp` The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd. Example codegen for `slice.Tensor`: ```cpp // torch/csrc/autograd/generated/ViewFuncs.h #define SLICE_TENSOR_VIEW_FUNC_AVAILABLE struct SliceTensorViewFunc : public torch::autograd::ViewFunc { SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step) {}; virtual ~SliceTensorViewFunc() override {}; virtual std::vector<c10::SymInt> get_symints() const override; virtual size_t num_symints() const override; virtual std::vector<at::Tensor> get_tensors() const override; virtual size_t num_tensors() const override; virtual at::Tensor operator()(const at::Tensor&) const override; virtual std::unique_ptr<ViewFunc> clone_and_set( std::optional<std::vector<c10::SymInt>> = c10::nullopt, std::optional<std::vector<at::Tensor>> = c10::nullopt) const override; protected: virtual void set_symints(std::vector<c10::SymInt>) override; virtual void set_tensors(std::vector<at::Tensor>) override; private: int64_t dim; c10::optional<c10::SymInt> start; c10::optional<c10::SymInt> end; c10::SymInt step; }; ... // torch/csrc/autograd/generated/ViewFuncs.cpp std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const { ::std::vector<c10::SymInt> symints; symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1); if(start.has_value()) symints.insert(symints.end(), (start)); if(end.has_value()) symints.insert(symints.end(), (end)); symints.push_back(step); return symints; } size_t SliceTensorViewFunc::num_symints() const { return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1); } void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) { TORCH_INTERNAL_ASSERT(symints.size() == num_symints()); auto i = 0; if(start.has_value()) start = symints[i]; i += (start.has_value() ? 1 : 0); if(end.has_value()) end = symints[i]; i += (end.has_value() ? 1 : 0); step = symints[i]; } std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const { ::std::vector<at::Tensor> tensors; return tensors; } size_t SliceTensorViewFunc::num_tensors() const { return static_cast<size_t>(0); } void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) { TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors()); } at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const { return at::_ops::slice_Tensor::call(input_base, dim, start, end, step); } std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set( std::optional<std::vector<c10::SymInt>> symints, std::optional<std::vector<at::Tensor>> tensors) const { auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step); if (symints.has_value()) { output->set_symints(std::move((symints))); } if (tensors.has_value()) { output->set_tensors(std::move((tensors))); } return output; } ``` The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification. For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly. ```sh python test/test_autograd.py -k test_view_func_replay python test/test_ops.py -k test_view_replay ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404 Approved by: https://github.com/ezyang	2024-02-09 18:51:36 +00:00
Yu, Guangye	5c46600f84	[RELAND] refactor lazy init to device-agnostic (#119248 ) # Motivation This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability. # Design We maintain a flag for each backend to manage the lazy initialization state separately. # Additional Context No need more UTs. This is a reland PR, the original PR is [refactor lazy init to device-agnostic](https://github.com/pytorch/pytorch/pull/118846). This is a common PR, and does not trigger xpu ciflow. Differential Revision: [D53478332](https://our.internmc.facebook.com/intern/diff/D53478332) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119248 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/atalman	2024-02-07 15:58:51 +00:00
William Wen	ae4e866bba	[dynamo] refactor CacheEntry and ExtraState to eval_frame.c to C++ (#118438 ) Part of implementing CacheEntry invalidation to fix https://github.com/pytorch/pytorch/issues/112090. Changes: - Move CacheEntry and ExtraState to C++ - Use pybind to control reference counting - Use std::list instead of manually implementing a linked list Pull Request resolved: https://github.com/pytorch/pytorch/pull/118438 Approved by: https://github.com/jansel	2024-02-06 20:48:11 +00:00
PyTorch MergeBot	ab613a4019	Revert "refactor lazy init to device-agnostic (#118846 )" This reverts commit 520771d7b35034c96c5b4604ecf8960e6aab856f. Reverted https://github.com/pytorch/pytorch/pull/118846 on behalf of https://github.com/atalman due to Failing, tests https://github.com/pytorch/torchdistx/blob/main/src/python/torchdistx/_C/fake.cc#L11 ([comment](https://github.com/pytorch/pytorch/pull/118846#issuecomment-1927651305))	2024-02-05 18:06:30 +00:00
Yu, Guangye	520771d7b3	refactor lazy init to device-agnostic (#118846 ) # Motivation This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability. # Design We maintain a flag for each backend to manage the lazy initialization state separately. # Additional Context No need more UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118846 Approved by: https://github.com/malfet	2024-02-02 12:10:39 +00:00
Yu, Guangye	a205e7bf56	[3/4] Intel GPU Runtime Upstreaming for Device (#116850 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR covers the changes under `libtorch_python`. # Design This PR primarily offers device-related APIs in python frontend, including - `torch.xpu.is_available` - `torch.xpu.device_count` - `torch.xpu.current_device` - `torch.xpu.set_device` - `torch.xpu.device` - `torch.xpu.device_of` - `torch.xpu.get_device_name` - `torch.xpu.get_device_capability` - `torch.xpu.get_device_properties` - ==================== - `torch.xpu._DeviceGuard` - `torch.xpu._is_compiled` - `torch.xpu._get_device` # Additional Context We will implement the support of lazy initialization in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet	2024-02-01 12:31:26 +00:00
CaoE	29516bd2a0	add _amp_foreach_non_finite_check_and_unscale_cpu_ and _amp_update_scale_cpu_ kernels on CPU (#109281 ) Step1 of https://github.com/pytorch/pytorch/issues/111559. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109281 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-16 15:25:08 +00:00
Bin Bao	79ba39710e	[AOTI] Forward fix a Windows build failure (#116790 ) Summary: forward fix https://github.com/pytorch/pytorch/pull/116269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116790 Approved by: https://github.com/khabinov, https://github.com/huydhn	2024-01-05 06:00:58 +00:00
Bin Bao	70f3a530d7	[AOTI] Add pybind for AOTIModelContainerRunnerCpu and AOTIModelContainerRunnerCuda (#116269 ) Summary: Now we can allocate an AOTIModelContainerRunner object instead of relying on torch.utils.cpp_extension.load_inline. Also renamed AOTInductorModelRunner to AOTIRunnerUtil in this PR. Test Plan: CI Reviewed By: khabinov Differential Revision: D52339116 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116269 Approved by: https://github.com/khabinov	2024-01-04 18:58:24 +00:00
Bin Bao	2dce364634	[AOTI][refactor] Remove model_container_runner_cuda.cpp (#116113 ) Differential Revision: [D52301272](https://our.internmc.facebook.com/intern/diff/D52301272) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116113 Approved by: https://github.com/khabinov ghstack dependencies: #116047	2023-12-21 14:56:25 +00:00
Bin Bao	fabf9433e7	[AOTI][refactor] Organize model runner files (#116022 ) Summary: Move runner util files into a subdirectory and put AOTIModelContainerRunnerCpu into a separate file Differential Revision: [D52300693](https://our.internmc.facebook.com/intern/diff/D52300693) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116022 Approved by: https://github.com/khabinov	2023-12-20 15:35:34 +00:00
Yifu Wang	6e1ba79b7f	[re-land] Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 ) (#116125 ) This is an attempt to re-land https://github.com/pytorch/pytorch/pull/114001. The previous attempt used `std::array` in cuda kernels which wasn't compatible with Meta's internal build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116125 Approved by: https://github.com/yf225	2023-12-20 07:13:50 +00:00
PyTorch MergeBot	91e184fd74	Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 )" This reverts commit 4edc921857f39ba9510b6ab1c454149cfb2de157. Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/jeanschmidt due to Breaking multiple internal tests, might be flakiness but multiple retries did not elicit an improvement, please check internal diff ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1863036417))	2023-12-19 16:01:19 +00:00
Yifu Wang	4edc921857	Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 ) ## Summary This PR added 3 intra-node GPU allreduce algorithms to PyTorch: - One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks. - Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather). - Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology. ## Micro Benchmarks ![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e) ![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e) ![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c) ## Details The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for: - Managing handshaking and cuda IPC handle exchange among ranks. - Querying NVLink connection and detecting topology. - Performing algo selection based on available info. - Launching the selected allreduce kernel. `c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows: - When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks. - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently. - `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly. We currently detect two types of topoloies from the nNVLink connection mesh: - Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh) - `msg <= 256KB`: one-shot allreduce. - `256KB < msg <= 10MB`: two-shot allreduce. - `msg > 10MB`: instructs the caller to fallback to NCCL. - Hybrid cube mesh - `msg <= 256KB`: one-shot allreduce. - `msg > 256KB`: instructs the caller to fallback to NCCL. ## Next Steps - Fine tune algo selection based on GPU model, topology, link speed. - Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints: - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access. - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001 Approved by: https://github.com/yf225	2023-12-15 08:17:35 +00:00
PyTorch MergeBot	7ecddaef23	Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 )" This reverts commit adfbd2b219f4995d3f13870927022b67550f8b0e. Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/atalman due to OSSCI oncall, breaks periodic jobs ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1856539040))	2023-12-14 20:33:10 +00:00
Yifu Wang	adfbd2b219	Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001 ) ## Summary This PR added 3 intra-node GPU allreduce algorithms to PyTorch: - One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks. - Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather). - Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology. ## Micro Benchmarks ![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e) ![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e) ![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c) ## Details The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for: - Managing handshaking and cuda IPC handle exchange among ranks. - Querying NVLink connection and detecting topology. - Performing algo selection based on available info. - Launching the selected allreduce kernel. `c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows: - When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks. - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently. - `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly. We currently detect two types of topoloies from the nNVLink connection mesh: - Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh) - `msg <= 256KB`: one-shot allreduce. - `256KB < msg <= 10MB`: two-shot allreduce. - `msg > 10MB`: instructs the caller to fallback to NCCL. - Hybrid cube mesh - `msg <= 256KB`: one-shot allreduce. - `msg > 256KB`: instructs the caller to fallback to NCCL. ## Next Steps - Fine tune algo selection based on GPU model, topology, link speed. - Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints: - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access. - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001 Approved by: https://github.com/yf225	2023-12-14 08:13:08 +00:00
soulitzer	4d8ad4fb82	Move SingletonSymNodeImpl from c10 to aten (#114895 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114895 Approved by: https://github.com/jbschlosser	2023-12-13 20:01:18 +00:00
Antonio Kim	7fc292930c	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-21 23:07:21 +00:00
Mu-Chu Lee	eddce3c054	[AOTInductor] Rename model_runner to model_container_runner (#111324 ) Summary: We rename the model_runner to model_container_runner to prepare for adding tests of pure model without container. Test Plan: commit itself is a test. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/111324 Approved by: https://github.com/desertfire, https://github.com/chenyang78	2023-11-16 19:14:22 +00:00
PyTorch MergeBot	252e68a83b	Revert "Add support for `torch.Generator` type in TorchScript (#110413 )" This reverts commit 54493fe8c4b1cca4c5ff993b99eb3e3dbc984226. Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is, unfortunately, still breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1811625557))	2023-11-15 00:51:23 +00:00
Antonio Kim	54493fe8c4	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-13 23:18:14 +00:00
PyTorch MergeBot	9a28a7b498	Revert "Add support for `torch.Generator` type in TorchScript (#110413 )" This reverts commit 27e31ab6e86259b27d816d6fb6e7a69de526a0e4. Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1799003164))	2023-11-07 15:53:32 +00:00
soulitzer	d5007d8d8e	Split out input_metadata.cpp from input_metadata.h (#113031 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113031 Approved by: https://github.com/albanD	2023-11-07 00:03:21 +00:00
Antonio Kim	27e31ab6e8	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-06 21:27:02 +00:00
jjsjann123	9d23440c81	Nvfuser code base nuke (#111447 ) removing nvfuser code base. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111447 Approved by: https://github.com/albanD	2023-11-01 20:53:14 +00:00
Yifu Wang	ec18ef62f4	Native c10d_functional ops (#110570 ) This PR introduces a native version of c10d_functional ops. The main goal is to add collective support in AOTInductor and allow collective ops to work in multi-threaded native runtimes. The native version also incorporated API improvements we wished to implement in Python c10d_functional: - Removed `ranks` and `group_size` from collective op signatures which were proven to be redundant. - Use tensor storage as opposed to `void*` to resolve in-flight work. The native process group registration/resolution mechansim is only used for native c10d_functional in the PR. It will become the single source of truth in upcoming PRs. The upcoming PRs will implement Inductor/AOTInductor support for c10d_functional, after which native c10d_functional will replace Python c10d_functional. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110570 Approved by: https://github.com/wanchaol	2023-10-25 22:56:06 +00:00
PyTorch MergeBot	5ce8002d24	Revert "Remove deprecated fbgemm operators (#104535 )" This reverts commit 57c7aa12dbf71617bd21fe7e076df8e823b5b7bb. Reverted https://github.com/pytorch/pytorch/pull/104535 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/104535#issuecomment-1779650412))	2023-10-25 16:34:16 +00:00
jjsjann123	39c09d4da6	Revert "Revert "Nvfuser code removal (#111093 )"" (#111604 ) This reverts commit 715dfced72657e5adacd5bef16e3d458cd94851b. The original PR #111093 is reverted due to broken internal build. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111604 Approved by: https://github.com/davidberard98	2023-10-23 18:32:41 +00:00
Peter Bell	57c7aa12db	Remove deprecated fbgemm operators (#104535 ) These operators are not used and have been deprecated since #72690 (Feb 2022). Additionally, the `torch.jit.quantized` interface has been deprecated since #40102 (June 2020). Pull Request resolved: https://github.com/pytorch/pytorch/pull/104535 Approved by: https://github.com/ezyang	2023-10-22 06:10:09 +00:00
PyTorch MergeBot	715dfced72	Revert "Nvfuser code removal (#111093 )" This reverts commit 572628e52054b0e061fbaeb0497267380fe45180. Reverted https://github.com/pytorch/pytorch/pull/111093 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, @albanD please help to support the author with the next steps to get this diff merged ([comment](https://github.com/pytorch/pytorch/pull/111093#issuecomment-1771434853))	2023-10-19 17:39:49 +00:00
jjsjann123	572628e520	Nvfuser code removal (#111093 ) Removes the existing integration code & build of nvfuser in TorchScript. Note that I intentionally left the part where we wipe out `third_party/nvfuser` repo. I'll do that in a separate PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111093 Approved by: https://github.com/albanD	2023-10-18 01:00:47 +00:00
PyTorch MergeBot	1e70f4d02c	Revert "Reland #2 "[C10] PG observability hooks. (#108815 , #110907 )" (#111072 )" This reverts commit bb1424d46e656dfcdd4c12efe58ada9f1720c4d8. Reverted https://github.com/pytorch/pytorch/pull/111072 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111072#issuecomment-1765399829))	2023-10-16 23:03:26 +00:00
PyTorch MergeBot	f68d6e8108	Revert "Move at::{Refcounted,}MapAllocator to c10 (#109881 )" This reverts commit 68a1219f74467a4d2124288f3ab6f8bc471fe4a1. Reverted https://github.com/pytorch/pytorch/pull/109881 on behalf of https://github.com/kit1980 due to breaking internal builds, undefined symbol: _ZN3c1022RefcountedMapAllocator6decrefEv ([comment](https://github.com/pytorch/pytorch/pull/109881#issuecomment-1761950014))	2023-10-13 17:57:53 +00:00
Will Constable	bb1424d46e	Reland #2 "[C10] PG observability hooks. (#108815 , #110907 )" (#111072 ) This reverts commit 314a502eb04c6382e2cc9af0573533efba54109d. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111072 Approved by: https://github.com/malfet ghstack dependencies: #111061	2023-10-12 16:59:23 +00:00
Peter Bell	68a1219f74	Move at::{Refcounted,}MapAllocator to c10 (#109881 ) `libshm.so` depends on the torch library exclusively for `at::RefcountedMapAllocator`, so it makes sense to move it to c10 along with the other memory allocators. This means `libshm.so` only depends on `c10` and we don't need to relink `libshm.so` for every ATen change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109881 Approved by: https://github.com/albanD	2023-10-12 10:51:13 +00:00
Bin Bao	3058700f7f	[aotinductor] Add AOTIModelRunner as a utility class (#110891 ) Summary: Introduce a utility class AOTIModelRunner to take care of running an AOTInductor compiled model. It does things like dlopen a model, initialize the model container, setup inputs and outputs, and destroy the model container. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110891 Approved by: https://github.com/chenyang78 ghstack dependencies: #110652	2023-10-11 15:58:28 +00:00
PyTorch MergeBot	314a502eb0	Revert "Reland "[C10] PG observability hooks. (#108815 )" (#110907 )" This reverts commit 7678cd22af46c9df4fb47a409d3e8ad71a6127ea. Reverted https://github.com/pytorch/pytorch/pull/110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this `7678cd22af` ([comment](https://github.com/pytorch/pytorch/pull/110907#issuecomment-1756497387))	2023-10-11 00:23:42 +00:00
PyTorch MergeBot	02a02a23ee	Revert "Move at::{Refcounted,}MapAllocator to c10 (#109881 )" This reverts commit 0341deb1c720d8c908ed40e853eaacfc8ac37181. Reverted https://github.com/pytorch/pytorch/pull/109881 on behalf of https://github.com/albanD due to It does break buck build ([comment](https://github.com/pytorch/pytorch/pull/109881#issuecomment-1756195823))	2023-10-10 20:39:12 +00:00
Will Constable	7678cd22af	Reland "[C10] PG observability hooks. (#108815 )" (#110907 ) This reverts commit ff0358b0384d6a3a5b8ceeae625c93221612ba8e. (original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907 Approved by: https://github.com/fduwjj	2023-10-10 20:09:40 +00:00
Peter Bell	0341deb1c7	Move at::{Refcounted,}MapAllocator to c10 (#109881 ) `libshm.so` depends on the torch library exclusively for `at::RefcountedMapAllocator`, so it makes sense to move it to c10 along with the other memory allocators. This means `libshm.so` only depends on `c10` and we don't need to relink `libshm.so` for every ATen change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109881 Approved by: https://github.com/albanD	2023-10-09 23:53:47 +00:00
George White	f4796df914	Add support for generators on the IPU device (#110704 ) This change adds hooks similar to those used on other device types, to allow the Torch to create and use generators provided by the IPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110704 Approved by: https://github.com/ezyang	2023-10-06 21:36:14 +00:00
PyTorch MergeBot	ff0358b038	Revert "[C10] PG observability hooks. (#108815 )" This reverts commit 0c7a877745f98b8fce8868291408945c0dd817d6. Reverted https://github.com/pytorch/pytorch/pull/108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](https://github.com/pytorch/pytorch/pull/108815#issuecomment-1751327751))	2023-10-06 19:49:49 +00:00
Rodrigo Kumpera	0c7a877745	[C10] PG observability hooks. (#108815 ) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2023-10-06 18:52:46 +00:00
zdevito	3fe3439242	Use LLVMSymbolizer directly for unwind inside fbcode (#108800 ) Using LLVMSymbolizer directly avoids having to call fork which has caused timeouts in some circumstances. Differential Revision: [D49070589](https://our.internmc.facebook.com/intern/diff/D49070589/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108800 Approved by: https://github.com/aaronenyeshi	2023-10-04 04:04:08 +00:00
cyy	7f5fd92372	Reland use std::make_unique after internal changes (#109742 ) check internal follow up of #109780 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109742 Approved by: https://github.com/ezyang	2023-09-28 17:24:08 +00:00
rzou	8124a6c40c	[TORCH_LIBRARY] Add impl_abstract_pystub (#109529 ) We want users to be able to define custom ops in C++ but put the abstract impl in Python (since it is easier to write them in Python and the abstract impl better models device semantics and data-dependent operators). `m.impl_abstract_pystub(opname, python_module, context)` declares the abstract_impl of the operator to exist in the given python module. When the abstract_impl needs to be accessed (either via FakeTensor or Meta), and it does not exist, the PyTorch Dispatcher will yell with a descriptive error message. Some details: - We construct a new global AbstractImplPyStub mapping in Dispatcher.cpp. Read/write to this map is protected by the Dispatcher lock. - We add a new Meta Tensor fallback kernel. The fallback errors out if there is no meta kernel, but also offers a nicer error message if we see that there is a pystub. - We create a `torch._utils_internal.throw_abstract_impl_not_imported_error` helper function to throw errors. This way, we can throw different error messages in OSS PyTorch vs internal PyTorch. To invoke this from C++, we added a PyInterpreter::throw_abstract_impl_not_imported_error. Differential Revision: [D49464753](https://our.internmc.facebook.com/intern/diff/D49464753/) Differential Revision: [D49464753](https://our.internmc.facebook.com/intern/diff/D49464753) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109529 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2023-09-22 04:55:36 +00:00

... 3 4 5 6 7 ...

428 Commits