pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-16 07:24:54 +08:00

Author	SHA1	Message	Date
Will Constable	3b19956cb2	[c10d] Add assertRaisesRegexOnRank helper for distributed Allow asserting that an exception is raised only on the specified rank, but not on other ranks. Useful expecially for pipeline parallelism. ghstack-source-id: 7a27f9e128f465e52e503617914261af4dbbbb41 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126731	2024-05-20 17:00:26 -07:00
Ke Wen	04bf7713e8	[c10d] Reduce test time by reusing ProcessGroup (#125648 ) ## Problem this PR resolves Today, most of distributed tests are arranged like this: ``` def test_allreduce(self): pg = self._create_process_group_nccl(store, self.opts()) pg.allreduce(tensor) ... ``` Thus, we are paying PG creation time per test. That's bad. But why were we doing that? Is there a constraint? If we look deeper, we would find that most of our test cases inherit from `torch.testing._internal.common_distributed.MultiProcessTestCase`. From the name, nothing seems wrong, and probably fits distributed well. But a "problem" exists in its `setUp()` and `tearDown()` methods, which basically do the following: ``` def setUp(self): self._spawn_processes() def tearDown(self): for p in self.processes: p.terminate() ``` Since `setUp` and `tearDown` are "test-scope fixtures", meaning, they are called per test, each test will have brand new processes. Of course we'd have to recreate ProcessGroup every time. ## How we are fixing it First, obviously, we need to put a PG's lifetime into a longer scope. Python `unittest` provides such a helper, called "class-scope fixtures." It is embodied by a `setUpClass` method and a `tearDownClass` method (note the name difference), which are called only once for all tests in the same test class. Therefore, we would do: ``` @classmethod def setUpClass(self): dist.init_process_group(...) @classmethod def tearDownClass(self): dist.destroy_process_group() ``` In this PR, we create a new test template for distributed: `MultiProcContinousTest`, to hold this class-scope fixture. Second, we'd need to avoid per-test process spawn and terminate. That's easy, we can either: 1. launch the whole test file with `torchrun --nproc-per-node=...` or 2. use `mp.spawn()` under `if __name__ == "__main__":`. Point is, launch the processes only once. ## Result We moved the "positive tests" from test_c10d_nccl.py to test_c10d_ops_nccl.py. Before this PR: ``` $ python test_c10d_nccl.py -k ProcessGroupNCCLTest Ran 24 tests in 174.457s ``` After this PR: ``` $ torchrun --nproc-per-node 2 test_c10d_ops_nccl.py or $ python test_c10d_ops_nccl.py Ran 24 tests in 16.247s ``` 10X speedup. ## Limitation For tests intended to test destroy or abort of PGs, we'd need to go back to the old style. So it would make sense to divide our tests into two classes: one for positive tests where we would reuse the PGs, and the other one for abort/destroy and negative tests like watchdog timeout. ## Next step Migrate the tests of distributed that would fit with this test style! Pull Request resolved: https://github.com/pytorch/pytorch/pull/125648 Approved by: https://github.com/wconstab	2024-05-08 22:33:40 +00:00
Aaron Gokaslan	5a1216bb2e	[BE]: Update ruff to 0.4.1 (#124549 ) Update ruff to 0.4.1 . This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes. Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0 \| Repository \| Linter (v0.3) \| Linter (v0.4) \| Formatter (v0.3) \| Formatter (v0.4) \| \|----------------------------------------------------\|---------------\|---------------\|------------------\|------------------\| \| [pytorch/pytorch](https://github.com/pytorch/pytorch) \| 328.7 \| 251.8 \| 351.1 \| 274.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549 Approved by: https://github.com/ezyang	2024-04-21 14:06:23 +00:00
Yifu Wang	2a2e1d8e4f	[functional collective] change the Python APIs to only use the native funcol ops (#123777 ) ## Summary After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR: - Removed `use_native_funcol()`. - Removed the code path in the Python APIs when `use_native_funcol()` is `False`. - Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol. ## Test Changes `test_functional_api.py` - Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol. - Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` `b7fac76fc2/test/distributed/test_c10d_functional_native.py (L114-L116)` `test/distributed/_tensor/test_dtensor.py` `test/distributed/_tensor/test_dtensor_compile.py` `test/distributed/test_device_mesh.py` `test/distributed/_tensor/experimental/test_tp_transform.py` `test/distributed/_tensor/test_matrix_ops.py` `test/distributed/test_inductor_collectives.py` - All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol. `test/distributed/test_c10d_functional_native.py` - Removed the `run_with_native_funcol` decorators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123777 Approved by: https://github.com/wanchaol ghstack dependencies: #123776	2024-04-13 03:08:36 +00:00
arunppsg	b78e8c0d37	remove duplicate method run_subtests (#122421 ) Fixes #121654 I have removed the duplicate test `run_subtests` from `common_dtensor.py` and `common_fsdp.py` and moved it to `common_distributed.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122421 Approved by: https://github.com/soulitzer	2024-03-22 07:00:49 +00:00
Yifu Wang	22cd2658b4	Disable GroupRegistry's thread isolation by default (#121457 ) Today `GroupRegistry` employs thread isolation by default, i.e. every thread sees its own process group registry. This is intended to work for one-device-per-process (for python use cases) and one-device-per-thread case (for custom native runtimes). However, there's a problem - there are python use cases that initializes/registers process groups in one thread, and runs collectives in another thread. This use case should be supported. However, since `GroupRegistry` employs thread isolation by default, collectives in different threads can't find the registered process groups. This PR fixes the issue by: - Make `GroupRegistry` work in non-thread isolation mode by default. This would match the behavior w/o the native process group registry. - Introduces `set_thread_isolation_mode` so one-device-per-thread runtimes can enable thread isolation mode explicitly. Differential Revision: [D54658515](https://our.internmc.facebook.com/intern/diff/D54658515) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121457 Approved by: https://github.com/wanchaol	2024-03-08 19:31:24 +00:00
Yifu Wang	fad228c7cc	Fix a potential race condition in the test decorators for enabling/disabling native funcol (#120833 ) Previous, we parametrize some tests to run with both native and py funcol by flipping a global variable. However, some of these tests are multi-threaded tests, and the parametrization mechanism could lead to race condition. This PR changes the mechansim to use `mock.patch` which is applied on a per-thread basis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120833 Approved by: https://github.com/wconstab	2024-02-29 03:19:44 +00:00
Yifu Wang	637cf4a3f2	Test parametrization utils for native funcol migration (#119950 ) ``` Between the time we switch to the native funcol by default and the time when we are confident that we can remove the legacy implementation, we want to ensure that the legacy funcol remains covered by unit tests. This is to prepare for any potential (but unlikely) reverts. The following utilities help achieve this goal. run_with_{native,legacy}_funcol - mark a test to run with only {native,legacy} funcol. These decorators are for impl specific tests (e.g. verifying generated code with FileCheck). run_with_both_funcol_impls - parametrize a test to run with both legacy and native funcol. run_with_both_funcol_impls_with_arg - same as run_with_both_funcol_impls, but passes `enable_native_funcol` to the test so impl specific checks can be carried out. ``` This PR also marks some tests we want to cover in this fashion. More tests will be marked in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119950 Approved by: https://github.com/wanchaol ghstack dependencies: #119881	2024-02-19 02:46:03 +00:00
Edward Z. Yang	9bce208dfb	Replace follow_imports = silent with normal (#118414 ) This is a lot of files changed! Don't panic! Here's how it works: * Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file. * When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded. * The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors. * Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list. * Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves. * torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state. * There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many. In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file. The codemod was done with this script authored by GPT-4: ``` import glob exclude_patterns = [ ... ] for pattern in exclude_patterns: for filepath in glob.glob(pattern, recursive=True): if filepath.endswith('.py'): with open(filepath, 'r+') as f: content = f.read() f.seek(0, 0) f.write('# mypy: ignore-errors\n\n' + content) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414 Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD	2024-01-27 02:44:11 +00:00
Aaron Gokaslan	6de28e92d2	[BE]: Apply FURB118 (prev): replaces unnecessary lambdas with operator. (#116027 ) This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027 Approved by: https://github.com/malfet	2023-12-20 19:35:08 +00:00
Chip Turner	9cc040fef6	Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880 ) Previously: ``` [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) ``` With this PR, those warnings disappear. They were introduced in #114077 This change was generated with this sed script, applied with `sed -i -f /tmp/x */.{py,hpp,cpp,cc}` and hand inspected. ``` s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880 Approved by: https://github.com/kwen2501	2023-12-01 20:08:23 +00:00
PyTorch MergeBot	2c4930a91d	Revert "[fx/DDP] add nested ctx_manager test for DDP Dynamo (#114056 )" This reverts commit d5d62e85615fdf345e0556a9d8edbee2d3c64ae2. Reverted https://github.com/pytorch/pytorch/pull/114056 on behalf of https://github.com/malfet due to Breaks inductor_distributed, see `d5d62e8561` ([comment](https://github.com/pytorch/pytorch/pull/114056#issuecomment-1822006423))	2023-11-22 02:52:31 +00:00
Jon Chuang	d5d62e8561	[fx/DDP] add nested ctx_manager test for DDP Dynamo (#114056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114056 Approved by: https://github.com/wconstab	2023-11-22 01:08:25 +00:00
Ke Wen	dc65f6c601	[c10d] Remove deprecated multi-gpu-per-thread APIs (#114156 ) As of today, PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document. The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated for three versions. Removing them now before 2.2 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114156 Approved by: https://github.com/albanD, https://github.com/fduwjj, https://github.com/H-Huang	2023-11-21 03:50:23 +00:00
Aaron Gokaslan	18d7b8e4f7	[BE]: ruff apply rule PLW1510 to find silent subprocess errors (#113644 ) Reopens #111682 that I messed up due to a bad rebase and triggered some issues with CLA. This explicitly adds check=True or False to any subprocess calls where appropriate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113644 Approved by: https://github.com/ezyang, https://github.com/kit1980	2023-11-14 20:59:40 +00:00
Rodrigo Kumpera	fe2cda64dc	[C10D] Implement new libuv backend for TCPStore. (#108066 ) The new backend is currently under a flag 'use_libuv' in TCPStore constructor to reduce the impact on existing users as we test it. This is a reland of #105870 with a fix for a bad test. Differential Revision: [D48742554](https://our.internmc.facebook.com/intern/diff/D48742554) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108066 Approved by: https://github.com/H-Huang, https://github.com/fduwjj	2023-08-29 14:55:14 +00:00
PyTorch MergeBot	d3f92ca9e9	Revert "[C10D] Implement new libuv backend for TCPStore. (#105870 )" This reverts commit 3c841163cef9167ea50adbcfc4384b63c0b6e93a. Reverted https://github.com/pytorch/pytorch/pull/105870 on behalf of https://github.com/huydhn due to I think the distributed failure is related as this is now failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/105870#issuecomment-1683117192))	2023-08-17 23:41:00 +00:00
Rodrigo Kumpera	3c841163ce	[C10D] Implement new libuv backend for TCPStore. (#105870 ) The new backend is currently under a flag 'use_libuv' in TCPStore constructor to reduce the impact on existing users as we test it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105870 Approved by: https://github.com/H-Huang	2023-08-17 20:40:32 +00:00
Rohan Varma	4137d6e499	[Composable FSDP] Enable HSDP (#105206 ) Need to pass in strategy to _init_process_group_state to enable hsdp for composable. Differential Revision: [D47462394](https://our.internmc.facebook.com/intern/diff/D47462394/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105206 Approved by: https://github.com/awgu, https://github.com/fegin	2023-07-26 21:03:55 +00:00
Aaron Gokaslan	6d43c89f37	[BE]: Update Ruff to 0.0.280 (#105724 ) Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724 Approved by: https://github.com/ezyang, https://github.com/janeyx99	2023-07-22 23:03:34 +00:00
Justin Chu	4cc1745b13	[BE] f-stringify torch/ and scripts (#105538 ) This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`. - https://docs.python.org/3/reference/lexical_analysis.html#f-strings - https://pypi.org/project/flynt/ Command used: ``` flynt torch/ -ll 120 flynt scripts/ -ll 120 flynt tools/ -ll 120 ``` and excluded `collect_env.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-07-21 19:35:24 +00:00
Justin Chu	be03a56955	[BE] Enable ruff's UP rules and autoformat testing/ (#105425 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105425 Approved by: https://github.com/malfet	2023-07-18 21:04:39 +00:00
Iris	15eed5b73e	[Oncall][MTPG] Fix flaky test multi_threaded - test_broadcast_object_list (#103568 ) This test(`8340762211/test/distributed/test_multi_threaded_pg.py (L133)` ) is failing on internal sandbox with the following error msg: ``` File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/caffe2/test/distributed/__multi_threaded__/multi_threaded#link-tree/torch/testing/_internal/distributed/multi_threaded_pg.py", line 255, in _start_coll raise Exception( Exception: world not ready, only 3 PG's registered but world has 4 ranks exiting thread 1 ERROR ``` Internal error report: https://www.internalfb.com/intern/test/562950031915334?ref_report_id=0 We believe this is because we no longer perform barrier after init (see https://github.com/pytorch/pytorch/pull/99937). This PR temporarily turn back on ```TORCH_DIST_INIT_BARRIER``` to avoid flaky test for the time being, but we should look into it to find a way to properly do this. cc. @kumpera @kwen2501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103568 Approved by: https://github.com/H-Huang	2023-06-18 07:05:28 +00:00
Aaron Gokaslan	e2a3817dfd	[BE] Enable C419 rule for any all shortcircuiting (#99890 ) Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890 Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet	2023-04-25 15:02:13 +00:00
Edward Z. Yang	5a7aad9681	Convert logging f-strings to use % format, part four (#98705 ) This does multi-line concatenated string literals. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98705 Approved by: https://github.com/voznesenskym	2023-04-11 13:17:59 +00:00
Edward Z. Yang	b09722f540	Convert logging f-strings to use % format, part two (#98700 ) This hits multi-line logging strings Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98700 Approved by: https://github.com/voznesenskym	2023-04-10 12:19:31 +00:00
Edward Z. Yang	9a8f71f23e	Convert logging f-strings to use % format (#98697 ) Codemod done with https://gist.github.com/ezyang/2e8b0463cdc6be278478495b23ff0530 with assistance from ChatGPT. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98697 Approved by: https://github.com/voznesenskym	2023-04-10 12:19:31 +00:00
Horace He	5bbec680d7	Fix usages of contextmanager without finally (#96170 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96170 Approved by: https://github.com/ngimel, https://github.com/malfet	2023-03-08 20:59:27 +00:00
Sergii Dymchenko	35bf5bac26	Fix "sandcastle_skip_if decorator name is confusing" (#95649 ) Fixes https://github.com/pytorch/pytorch/issues/89473 See the issue https://github.com/pytorch/pytorch/issues/89473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95649 Approved by: https://github.com/atalman, https://github.com/malfet	2023-03-03 09:29:40 +00:00
fduwjj	a88bfc60c7	[2/N][ST deprecate][BE] Remove Replicate Tensor convert from DDP and PTD (#95450 ) No use is found for this ST/Replicated Tensor based DDP. As part of ShardedTensor migration, let's remove this logic. Trying to undo everything in https://github.com/pytorch/pytorch/pull/75753. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95450 Approved by: https://github.com/wanchaol	2023-02-26 03:03:37 +00:00
Howard Huang	5c7f4534e9	[small] multithreaded-pg guard attr (#93883 ) currently the test ``` pytest test/distributed/test_multi_threaded_pg.py -vs ``` has errors ``` Traceback (most recent call last): File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 917, in run self._target(self._args, *self._kwargs) File "/private/home/howardhuang/pytorch-projects/pytorch/torch/testing/_internal/common_distributed.py", line 1029, in _run self._tls.precision = TestCase._precision AttributeError: 'TestCollectivesWithBaseClass' object has no attribute '_tls' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93883 Approved by: https://github.com/awgu, https://github.com/wanchaol	2023-02-03 23:01:02 +00:00
Xilun Wu	966030f7c7	[DTensor][fix] MultiThreadedTestCase misses _tls object and it won't reflect in CI (#93832 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93832 Approved by: https://github.com/wanchaol	2023-02-02 07:56:44 +00:00
Will Constable	ac791bddce	Refactor dynamo distributed test helpers to be reusable (#93187 ) The point is to let Test helpers previously defined and used in `test_dynamo_distributed.py` be used from a new file `test_traceable_collectives.py` later in this stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93187 Approved by: https://github.com/kumpera	2023-02-01 06:09:42 +00:00
Wanchao Liang	06d54b4061	[threaded_pg] fix the comments of MultiThreadTestCase (#92373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92373 Approved by: https://github.com/wz337	2023-01-19 03:42:54 +00:00
Wanchao Liang	801d831d7a	[dtensor] enable op db tests by using multithreaded test case (#92198 ) Time comparison between using MultithreadedTestCase and MultiProcessTestCase on op db tests is amazing! using MultiThreadTestCase on a AWS dev node: ``` time pytest test/distributed/_tensor/test_dtensor_ops.py ============= 175 passed, 42 skipped, 397 xfailed in 80.30s (0:01:20) ======= real 1m22.330s user 1m38.782s sys 0m18.762s ``` MultiProcessTestCase spends from 40mins to more than 1h, even if using pytest parallel testing tools. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92198 Approved by: https://github.com/XilunWu	2023-01-17 03:26:38 +00:00
Wanchao Liang	e16979c9a0	[threaded_pg] full rewrite of MultiThreadedTestCase to enable device_type tests (#91650 ) This PR did a full rewrite of MultiThreadedTestCase, to make it more aligned with the MultiProcessTestCase, also changed how it do spawning and testing, so that we could embed thread local states when running tests. This PR enables device_type tests to work with MultiThreadedTestCase Pull Request resolved: https://github.com/pytorch/pytorch/pull/91650 Approved by: https://github.com/XilunWu	2023-01-17 03:26:36 +00:00
Xilun Wu	a6dcebf997	[threaded pg] make exception handling consistent with MultiProcessTestCase (#90712 ) Differential Revision: [D42153661](https://our.internmc.facebook.com/intern/diff/D42153661) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90712 Approved by: https://github.com/wanchaol	2022-12-20 23:37:40 +00:00
Xilun Wu	34da446072	[threaded pg] add assertion util to MultiThreadedTestCase (#90595 ) Differential Revision: [D42153662](https://our.internmc.facebook.com/intern/diff/D42153662) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90595 Approved by: https://github.com/wanchaol	2022-12-20 23:37:40 +00:00
Shen Li	80542add73	[FSDP] Allow MixedPrecision to skip inputs (#90620 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90620 Approved by: https://github.com/rohan-varma, https://github.com/awgu	2022-12-11 06:39:38 +00:00
Shen Li	082450609c	[FSDP] Allow nested FSDP wrapper to use different mixed precision (#90523 ) The main change is to move `args` and `kwargs` dtype convertion from `_root_pre_forward` to `_pre_forward`, so that every FSDP has a chance to apply its own precision. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90523 Approved by: https://github.com/awgu, https://github.com/rohan-varma	2022-12-09 20:06:05 +00:00
Xilun Wu	3759777edc	[threaded PG] fix long hang issue in testing (#90515 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90515 Approved by: https://github.com/wanchaol	2022-12-09 05:24:08 +00:00
Iris	0cc0e5ef65	[PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests (#87987 ) This PR includes: Changes from @kumpera (https://github.com/pytorch/pytorch/pull/86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS. Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py Modify @with_comms in ShardedTensorTestBase to take in args and *kwargs. Tests: ``` python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py ``` test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR. [T134844615] ## Add docstring and update comments in the following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87987 Approved by: https://github.com/fduwjj	2022-11-30 08:19:41 +00:00
Kazuaki Ishizaki	1cd6ebe095	Fix typos in messages under torch (#89049 ) This PR fixes typos of messages in `.py` files under torch directory. Only in `torch/onnx/symbolic_opset16.py`, fix a typo in comment to make the operator name correct. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89049 Approved by: https://github.com/lezcano	2022-11-17 04:18:14 +00:00
Fuzzkatt	8ba62bdff5	add test_c10d_spawn_ucc.py (#86508 ) Initial PR to create UCC equivalent of https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_spawn_gloo.py and https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_spawn_nccl.py. Currently only added common ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86508 Approved by: https://github.com/kwen2501	2022-11-16 22:50:11 +00:00
Charlie Yan	ee05f47bdd	Rebase and re-land thread PG (#88795 ) The previous PR (https://github.com/pytorch/pytorch/pull/88627) has been reverted due to a failed check. After rebasing and rerun, all checks passed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88795 Approved by: https://github.com/huydhn, https://github.com/wanchaol	2022-11-15 21:58:58 +00:00
PyTorch MergeBot	c7fc710459	Revert "[3/n] Thread PG: add threaded PG implementation (#88627 )" This reverts commit 6dd081846e3ae6192b375d658d4b4f3d6bd9df6e. Reverted https://github.com/pytorch/pytorch/pull/88627 on behalf of https://github.com/huydhn due to This breaks one macos m1 test `6dd081846e` in trunk. PR also fails with the same issue so I think trymerge code has a bug here letting this one merged	2022-11-09 22:38:41 +00:00
Charlie Yan	6dd081846e	[3/n] Thread PG: add threaded PG implementation (#88627 ) Summary: After the previous 2 diffs, finally we can add the threaded ProcessGroup implementation. Test Plan: TBD Reviewed By: XilunWu Differential Revision: D40992593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88627 Approved by: https://github.com/XilunWu, https://github.com/H-Huang	2022-11-09 20:51:11 +00:00
Will Constable	70b00b1383	Add hf_bert + DDP multigpu test (#88435 ) Spot-checks an e2e model working with ddp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/88435 Approved by: https://github.com/davidberard98	2022-11-04 03:17:48 +00:00
Fuzzkatt	d13f1e6ab4	Add sequence number support for UCC (#85047 ) Add sequence number support for UCC, mostly following format of ProcressGroupNCCL. Pass new test: `test_all_gather_object_subgroup` Add skips for gather tests: `test_gather_object` and `test_gather_object_subgroup` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu Pull Request resolved: https://github.com/pytorch/pytorch/pull/85047 Approved by: https://github.com/kwen2501	2022-10-31 03:56:55 +00:00
PyTorch MergeBot	f451e824f3	Revert " C10D extension to enable per-thread PG (#86348 )" This reverts commit 97abc21f2bda38e73de2a86da7f43c8126930681. Reverted https://github.com/pytorch/pytorch/pull/86348 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks macos tests `97abc21f2b`	2022-10-14 01:26:46 +00:00

1 2 3

132 Commits