pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Maggie Moss	f02e3947f6	Expand type checking to mypy strict files (#165697 ) Expands Pyrefly type checking to check the files outlined in the mypy-strict.ini configuration file: Pull Request resolved: https://github.com/pytorch/pytorch/pull/165697 Approved by: https://github.com/ezyang	2025-10-18 04:34:45 +00:00
Yuanyuan Chen	b2953f5643	[9/N] Apply ruff UP035 rule (#165515 ) This is follow-up of #165214 to continue applying ruff UP035 rule to the code base. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165515 Approved by: https://github.com/Lucaskabela	2025-10-17 00:09:51 +00:00
Yuanyuan Chen	a029675f6f	More ruff SIM fixes (#164695 ) This PR applies ruff `SIM` rules to more files. Most changes are about simplifying `dict.get` because `None` is already the default value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164695 Approved by: https://github.com/ezyang	2025-10-09 03:24:50 +00:00
fduwjj	86c789849e	[fr] Re-order mismatch check in fr analysis script (#164606 ) In reality we found the current mismatch order does not match the actual error distribution, so we reorder it a bit as following: 1. We do collective type check first 2. Then size check (excluding all2all) 3. dtype check 4. state check Pull Request resolved: https://github.com/pytorch/pytorch/pull/164606 Approved by: https://github.com/VieEeEw	2025-10-04 01:16:15 +00:00
fduwjj	c8e75c48b9	[fr] Skip the dtype check for some one to all or all to one collective (#163839 ) As title, in practice we found that sometimes, the dtype of gather does not match when it comes to output among all ranks, which is a undefined behavior. Same with broadcast and scatter. And they are all completed, so we should not think they are errors, we can skip it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163839 Approved by: https://github.com/VieEeEw	2025-09-25 16:02:06 +00:00
Phillip Liu	2c45628813	[Flight Recorder][WP] Added mismatch tail as an arg (#162991 ) Summary: Mismatch tail is used as a fixed variable and there are cases that there are more than 10 mismatches FR gives up producing results (e.g. https://fburl.com/ai_infra/7gjl5ucb). This diff added the mismatch tail in the parsed args so make this configuarble. Also tho the variable name is `mismatch_tail`(last 10) it is used as `mismatch_head` (the first 10). Updated it to be `num_mismatch_to_print` Test Plan: `buck2 run @//mode/opt //caffe2/fb/flight_recorder:fr_trace -- --mast_job_id aps-ctx_fm_pipeline_change-1c8ea38a94 --mast_job_version 0 --mast_job_attempt 2 --bucket tlcm_log_blob --world_size 128 --dump_file_name_offset 0 --allow-incomplete-ranks --num_mismatch_to_print 20 1>out 2>err` Confirm no error and output 20 mismatches. Rollback Plan: Differential Revision: D82335995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162991 Approved by: https://github.com/fduwjj	2025-09-16 04:46:05 +00:00
frost-intel	9b4adc4db7	[fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL (#158568 ) Adds support for FlightRecorder in ProcessGroupXCCL. See https://github.com/intel/torch-xpu-ops/pull/1867 for XCCL implementation and more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158568 Approved by: https://github.com/guangyey, https://github.com/fduwjj	2025-08-22 09:03:35 +00:00
Howard Huang	65053c03a3	[FR] Don't check incomplete ranks for printing (#160195 ) When just printing the ranks (`-j` option) we should skip the check for "incomplete ranks" since that doesn't affect the print Pull Request resolved: https://github.com/pytorch/pytorch/pull/160195 Approved by: https://github.com/fduwjj ghstack dependencies: #160097	2025-08-14 18:19:45 +00:00
Howard Huang	96f9fbe21a	Fix flight recorder for P2P ops (#160097 ) Fixes errors in debugging a trace as mentioned in https://docs.google.com/document/d/1EKVJYmW2hj_VsvDvnSggXhZzJyvMu9dA0iDJWOZAtjY/edit?tab=t.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160097 Approved by: https://github.com/fduwjj	2025-08-14 18:19:45 +00:00
fduwjj	1c2cba17ea	[FR] Add stack_id and an optional print of stack_id to stack_trace mapping (#160119 ) To better help users debug with FR, we want to add stack_id and print a map between stack_id and stack_trace (optional) Screenshot: <img width="1029" height="529" alt="image" src="https://github.com/user-attachments/assets/8404a1d3-cc33-4f5f-971b-29609ec316c1" /> <img width="1620" height="358" alt="image" src="https://github.com/user-attachments/assets/3dd29c8c-ff68-41a2-acfd-e770036cfeb1" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160119 Approved by: https://github.com/H-Huang, https://github.com/wconstab	2025-08-11 07:27:10 +00:00
Tianhao Huang	14c7358c64	Enable fr_trace to read local traces from multiple hosts. (#159490 ) Summary: For training jobs particularly from GenAI, NCCL trace dumps are generated in the format of `<hostname>.pci3_rank_<rank>`. For multi-node training jobs, the hostname varies across traces. The current prefix matching logic can't handle this case. Test Plan: Create a local folder `dumps` and several empty files: `host0.pci3_rank_0`, `host0.pci3_rank_1`, `host1.pci3_rank_0`, `host1.pci3_rank_1` inside it. Then run ``` buck2 run fbcode//caffe2/fb/flight_recorder:fr_trace -- trace_dir dumps ``` Before this diff, fr_trace cannot locate any trace files, giving the following assertion error: ``` AssertionError: no files loaded from /home/tianhaoh/dumps with prefix pci3_rank_ ``` After this diff, fr_trace is able to locate the trace files, resulting in the exceptions like ``` dump = pickle.load(infile) ^^^^^^^^^^^^^^^^^^^ EOFError: Ran out of input ``` (since the trace files are fake and empty). Rollback Plan: Differential Revision: D79224727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159490 Approved by: https://github.com/fduwjj	2025-08-06 03:15:34 +00:00
Junjie Wang (PyTorch)	3106a33e41	[fr] Fix one error in analysis script when subPG world size is smaller than global size (#156156 ) Summary: We run into an interesting case when we see so many mismatches while lot of mismatch turns out to be a fully match. The reason is that we use the dump ranks (which is from 0 to 79) to compare against the local pg ranks (0 to 7) this leads to false positive of mismatches. We can just check whether dump ranks contain all expected ranks or not, that should be sufficient. Test Plan: Test with the failed case with the script and we now see the correct behavior + new unit test case. Rollback Plan: Differential Revision: D76775373 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156156 Approved by: https://github.com/VieEeEw	2025-06-17 21:17:58 +00:00
Xuehai Pan	a69785b3ec	[BE] fix typos in tools/ (#156082 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156082 Approved by: https://github.com/soulitzer ghstack dependencies: #156079	2025-06-17 19:25:50 +00:00
Junjie Wang (PyTorch)	95abc0f515	[c10d][fr] Fix another bug when we should continue when the op list is empty (#151798 ) Differential Revision: D73375318 We shouldn't check the op list when it is empty. And later, when it is empty we pops it out from the queue we will check for collective matching. Added a unit test for this case and also covered the case fixed https://github.com/pytorch/pytorch/pull/151683 in the unit test as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151798 Approved by: https://github.com/d4l3k, https://github.com/wconstab, https://github.com/fegin	2025-04-22 04:43:31 +00:00
Junjie Wang (PyTorch)	6e7b6e8d57	[c10d][fr] Fix a bug when first rank is not zero in the script (#151683 ) Summary: Further testing the script, we found that we shouldn't always assume rank 0 is the first rank, so we need to check all entries and see if it P2P op for this coalesced group. Test Plan: Directly test with corner case. Differential Revision: D73266257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151683 Approved by: https://github.com/fegin	2025-04-18 20:55:06 +00:00
fduwjj	6f9ffaa991	[c10d][fr] Fix script for uneven reduce scatter and update test cases (#151475 ) Somehow the type string for reduce scatter is "REDUCE_SCATTER" not "REDUCESCATTER". This PR fixed it and added more test cases. Differential Revision: [D73141245](https://our.internmc.facebook.com/intern/diff/D73141245) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151475 Approved by: https://github.com/fegin	2025-04-17 02:11:08 +00:00
fduwjj	ae648f047c	[c10d][fr] Enable FR analysis script for rest of all coalesce op (#151247 ) We revisited how coalesced collective is working in https://github.com/pytorch/pytorch/pull/151243 and we now want to enable the script to work for slow path. The change is indeed bc-breaking but this is needed to make it work and the API is an internal use API. It is not user facing. For slow path the individual has input-sizes and output sizes recorded but no state. The final one has the state ready. We check the correctness of each individual collective one by one but we don't check the state match for these collectives, we can only check the state match for the last one which is the work item with coalesced label. Added more unit test for slow path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151247 Approved by: https://github.com/d4l3k, https://github.com/XilunWu	2025-04-15 20:53:03 +00:00
fduwjj	48b4bc1640	[c10d][fr] Enable FR analysis script for all fast-path coalesce op (#151243 ) This PR is to enable FR for all coalesce ops for fast path. (batch p2p is enabled in the current script, so we will mainly focus on non-P2P ops). To explain what is fast path, let's revisit how coalesced collective is working today: For non-P2P coalesced ops, there are are several ways to call it (due to legendary reasons): - Way one: Directly call python api like all_reduce_coalesced in python, this will be deprecated soon. - Way two: Directly call api inside PGNCCL like allreduce_coalesced. The way case 1 will eventually call into this. This is not deprecated and will not be deprecated, IIUC. - Way three: Using _coalescing_manager in python, like: ``` with _coalescing_manager(): for i in range(num_colls): dist.all_reduce(tensors[i]) ``` This way has two path: - Fast path: when users call all-reduce, all-gather-into-tensor or reduce-scatter, we will only launch one big collective by calling the api from case 1. - Slow path: we call startCoalescing() in the beginning and then a bunch of collectives (each one will generate a FR entry) and then endCoalescing(). Inside startCoalescing(), groupStart() is called and inside endCoalescing(), groupEnd() is then called. So although this is going to be one collective, we call into PGNCCL for each collective coalesced in the slow path case. - For uneven all-gather (allgather_v) and reduce-scatter, it follows the pattern mention in slow path. It directly call cpp api inside PGNCCL. This PR addressed the fast path because this is just an easy case, we store the collectives info on the python side, and we will only call into PGNCCL once so there will only be one work and one FR entry. We can just treat them as regular coalesced collective. We add some e2e unit test for build_db function so that the change to FR is more thoroughly tested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151243 Approved by: https://github.com/d4l3k, https://github.com/wz337	2025-04-15 04:08:28 +00:00
fduwjj	48132de4af	[c10d][fr] Fix the false positive in the dtype check in fr analysis script (#151063 ) When checking dtype in fr analysis script, we should only check it when the input of output numbel is larger than zero. For the case when it is gather or scatter, the output/input size will be an empty list for non-src or non-dst ranks which we should just skip the check. Differential Revision: [D72826823](https://our.internmc.facebook.com/intern/diff/D72826823) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151063 Approved by: https://github.com/d4l3k, https://github.com/kwen2501	2025-04-11 02:11:58 +00:00
fduwjj	8aaf296efc	[c10d][fr] Refactor analysis script for modularization and reusing for coalesce collectives (#150881 ) Trying to make the code of FR analysis more reusable and modularized. So we split core error analysis logic into separate functions. This PR mostly is shuffle around the code a bit. Differential Revision: [D72690120](https://our.internmc.facebook.com/intern/diff/D72690120) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150881 Approved by: https://github.com/wz337	2025-04-09 16:10:19 +00:00
Phillip Liu	31634b8c6a	[fr] Added protection against missing stack frames in fr cont. (#150133 ) Summary: Previously we had D70358287, which didn't fully resolved the issue. Test Plan: # FR `buck2 run @//mode/opt //caffe2/fb/flight_recorder:fr_trace -- --mast_job_id f710320638-TrainingApplication --mast_job_version 0 --mast_job_attempt 0 --bucket tlcm_log_blob --world_size 128 --dump_file_name_offset 0 --allow-incomplete-ranks` Confirm no error # FR analyzer `buck2 run @//mode/opt //investigations/dr_patternson/analyzers/ai_observability:ai_observability-all-analyzers-cli -- flight_recorder_analyzer --mast_job_name f710320638-TrainingApplication --mast_job_version 0 --mast_job_attempt 0` Confirm no error Differential Revision: D71998980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150133 Approved by: https://github.com/fduwjj	2025-04-01 03:07:59 +00:00
Phillip Liu	ce2f680e00	[fr] Added protection against missing stack frames in fr (#148203 ) Summary: We have quite a while failures due to this unprotected access. https://fburl.com/scuba/ai_rca_debug_tracing/qtnb63qf Test Plan: Reviewed By: fduwjj Differential Revision: D70358287 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148203 Approved by: https://github.com/fduwjj	2025-03-02 01:03:49 +00:00
Xuehai Pan	c73a92fbf5	[BE][CI] bump `ruff` to 0.9.2: multiline `assert` statements (#144546 ) Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements > Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target: > > ```python > # Input > assert ( > len(policy_types) >= priority + num_duplicates > ), f"This tests needs at least {priority+num_duplicates} many types." > > > # Black > assert ( > len(policy_types) >= priority + num_duplicates > ), f"This tests needs at least {priority+num_duplicates} many types." > > # Ruff > assert len(policy_types) >= priority + num_duplicates, ( > f"This tests needs at least {priority + num_duplicates} many types." > ) > ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546 Approved by: https://github.com/malfet	2025-02-27 20:46:16 +00:00
fduwjj	fb55bac3de	[fr][fix] Split MatchState and dynamic info for fr analysis downstream (#147439 ) The original MatchState type was declared as a python Enum. Although we did make it callable but we consume it right away. There are downstream cases when we need it to be a python class which is not supported in Python enum. So we did a small refactoring so that we keep both the enum state and dynamic info (culprit) for the fr analysis script. Differential Revision: [D69830994](https://our.internmc.facebook.com/intern/diff/D69830994) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147439 Approved by: https://github.com/fegin	2025-02-19 22:09:16 +00:00
Aaron Orenstein	07669ed960	PEP585 update - benchmarks tools torchgen (#145101 ) This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc). Most of the PRs were completely automated with RUFF as follows: Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes: ``` --- a/tools/linter/adapters/ruff_linter.py +++ b/tools/linter/adapters/ruff_linter.py @@ -313,6 +313,7 @@ "ruff", "check", "--fix-only", + "--unsafe-fixes", "--exit-zero", *([f"--config={config}"] if config else []), "--stdin-filename", ``` Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent): ``` --- a/pyproject.toml +++ b/pyproject.toml @@ -40,7 +40,7 @@ [tool.ruff] -target-version = "py38" +target-version = "py39" line-length = 88 src = ["caffe2", "torch", "torchgen", "functorch", "test"] @@ -87,7 +87,6 @@ "SIM116", # Disable Use a dictionary instead of consecutive `if` statements "SIM117", "SIM118", - "UP006", # keep-runtime-typing "UP007", # keep-runtime-typing ] select = [ ``` Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101 Approved by: https://github.com/bobrenjc93	2025-01-18 05:05:07 +00:00
fduwjj	e3c4d1b7d6	[c10d][fr] Fix the bug when we still mark mismatch when there are match case (#144916 ) When we introduce partial match, we accidentally introduce the mark of mismatch for the full match case. This is wrong and this PR fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144916 Approved by: https://github.com/c-p-i-o	2025-01-16 04:36:30 +00:00
Chirag Pandya	0bdc173ab6	[fr] recognize all_reduce_barrier as a valid op (#143354 ) Summary: D67068632 introduced a better profiling name for barrier operations to be able to distinguish various ops. Unfortunately, this broke Flight Recorder Analysis with the following error as reported by dmwu ``` fr_trace -m torchx-param_bench_16g_mi300x-all_to_all -a 0 --mast_job_version 98 -w 16 Traceback (most recent call last): File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 86, in _run_code ``` Test Plan: Test manually. Differential Revision: D67305997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143354 Approved by: https://github.com/wconstab	2024-12-17 21:09:18 +00:00
Uttam Thakore	314e08eb52	[fr_trace][bugfix] Log missing ranks when provided (#141924 ) Summary: For missing ranks issues, `build_collectives` doesn't log any errors (`5c2584a14c/tools/flight_recorder/components/builder.py (L293C23-L306C24)`), which means that when `EntryState.to_collective` is called [here](`5c2584a14c/tools/flight_recorder/components/builder.py (L400C21-L405C22)`), errors will be empty and `to_collective` will enter the first if statement. But that codepath doesn't log `missing_ranks`, meaning it will be absent from the `Collective` returned. This diff fixes that oversight. Test Plan: eyes Sandcastle run Differential Revision: D66679224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141924 Approved by: https://github.com/c-p-i-o	2024-12-03 17:54:43 +00:00
Aaron Gokaslan	08db735629	[BE]: Update mypy to 1.13.0 (#140808 ) Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-12-03 02:50:10 +00:00
PyTorch MergeBot	daa77f3d9f	Revert "[BE]: Update mypy to 1.13.0 (#140808 )" This reverts commit 00134d68af2ce50560fa5a74473665ea229e6c9d. Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))	2024-12-02 20:47:43 +00:00
Aaron Gokaslan	00134d68af	[BE]: Update mypy to 1.13.0 (#140808 ) Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-12-02 18:47:54 +00:00
Junjie Wang (PyTorch)	53f8a5fde2	[FR] Include mismatch rank into mismatch_collectives and update log message (#141631 ) Summary: We want to return the mismatch ranks info in the `mismatch_collectives` field. Also update the logging message when no error is found and it's not partial analysis. Test Plan: CI Differential Revision: D66522602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141631 Approved by: https://github.com/c-p-i-o	2024-11-27 18:57:21 +00:00
Chirag Pandya	150ffb6e07	[flight recorder] Updated MatchState to have a member variable (#141297 ) Summary: Without this change calling `str(MatchState.SOMETHING)` will cause exception. Test Plan: Can we add unittest somewhere? Ensure `str(MatchState.FULLY_MATCHED)` and `str(MatchState.FULLY_MATCHED())` won't raise exception. Differential Revision: D66321609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141297 Approved by: https://github.com/fduwjj	2024-11-22 03:14:34 +00:00
Chirag Pandya	32094626f2	[fr] fix OSS broken flight recorder (#140973 ) Summary: OSS flight recorder does not work because we renamed `trace_dir` to `folder` in the internal version to reuse the loader. Fixes item #2 in reported issue: https://github.com/pytorch/pytorch/issues/140879 Test Plan: BEFORE: ``` ❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node1_ tabulate is not installed. Proceeding without it. Traceback (most recent call last): File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module> main() File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 44, in main details, version = read_dir(args) File "/home/cpio/local/pytorch/tools/flight_recorder/components/loader.py", line 89, in read_dir assert len(details) > 0, f"no files loaded from {args.folder} with prefix {prefix}" AttributeError: 'Namespace' object has no attribute 'folder' ``` AFTER: ``` python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_ tabulate is not installed. Proceeding without it. Traceback (most recent call last): File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module> main() File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main db = build_db(details, args, version) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db check_no_missing_dump_files(entries, memberships) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files dumps_ranks == all_ranks AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119} ❯ git status fatal: not a git repository (or any parent up to mount point /data/users/cpio) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). ❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_ tabulate is not installed. Proceeding without it. Traceback (most recent call last): File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module> main() File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main db = build_db(details, args, version) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db check_no_missing_dump_files(entries, memberships) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files dumps_ranks == all_ranks AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119} ``` Differential Revision: D66117013 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140973 Approved by: https://github.com/Skylion007, https://github.com/fduwjj	2024-11-20 02:58:11 +00:00
Chirag Pandya	8bc4033814	[fr][ez] better log messages + minor fixups (#140969 ) Summary: 1. Clearly specify error messages that we are refering to a collective_sequence_id and an internal_record id for entry. The entry id is semi-useless for the end consumer so at least let them know that this is an internal record id. 2. Add some missing fields in types.py. self.missing_ranks = set() self.input_numel = tuple() self.output_numel = tuple() self.errors = set() These were showing up as linter errors when I opened the file in vs-code Test Plan: ``` buck2 run //caffe2/fb/flight_recorder:fr_trace -- -m f665492593-nerf_training-96ab95e0 -w 8 --mast_job_version 0 -a 0 Buck UI: https://www.internalfb.com/buck2/2cac9273-1b7b-47bf-867f-82f9a4c1d581 Network: Up: 0B Down: 0B Not all ranks joining collective: sequence number: 31117 internal record id: 31116 group info: 0:default_pg collective: nccl:all_reduce missing ranks: {3, 4, 5, 6, 7} input sizes: [[1571911]] output sizes: [[1571911]] world size: 8 expected ranks: {0, 1, 2, 3, 4, 5, 6, 7} collective state: scheduled collective stack trace: all_reduce at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/distributed_c10d.py:2707 wrapper at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/c10d_logger.py:81 sync_buffers at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/models/gaussian_splatting.py:650 decorate_context at /packages/fblearner.flow.canary/workflow#link-tree/torch/utils/_contextlib.py:116 step at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/training/training_manager/splatting.py:356 main at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/nerf_training.py:260 main_impl at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:57 main at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:34 wrapper at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py:355 <module> at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:118 _run_code at /packages/fblearner.flow.canary/workflow#link-tree/runtime/lib/python3.10/runpy.py:86 _run_module_as_main at /packages/fblearner.flow.canary/workflow#link-tree/runtime/lib/python3.10/runpy.py:196 run_as_main at /packages/fblearner.flow.canary/workflow#link-tree/__par__/bootstrap.py:69 run_as_main at /packages/fblearner.flow.canary/workflow#link-tree/__par__/meta_only/bootstrap.py:98 __invoke_main at /packages/fblearner.flow.canary/workflow#link-tree/__run_lpar_main__.py:28 <module> at /packages/fblearner.flow.canary/workflow#link-tree/__run_lpar_main__.py:31 ... Differential Revision: D66018461 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140969 Approved by: https://github.com/Skylion007, https://github.com/fduwjj	2024-11-19 04:39:16 +00:00
Junjie Wang (PyTorch)	c61ccaf10e	[FR] Polish the log message for dtype mismatch and don't exit when too many mismatch (#140451 ) Summary: 1. We don't want to exit with exceptions when there are so many mismatches. We should just break and return. 2. Polish the message of dtype mismatch. This is because dtype of input/output is actually a list not a string. So we don't want to show a list of ['double'] in the output message. Test Plan: Testing on the case when we see too many collective dtype mismatch {F1958467224} Differential Revision: D65841830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140451 Approved by: https://github.com/c-p-i-o	2024-11-13 07:24:53 +00:00
Junjie Wang (PyTorch)	23db92bad2	[FR] refactor build collective and return more info to db (#140082 ) (#140303 ) Summary: This change is trying to return the result of analysis with more details. Internally the contract is listed in https://docs.google.com/document/d/19ON5jKlYirT76D4Q-OoGMgD-U2L_sCDnUd_RE1gfiLE/edit?tab=t.0. For OSS, this change is BC to the current behavior. Also create a new state object which handle logging and convert to object to Collective and NCCLCall. Test Plan: CI and more thorough testing is on the way. Reviewed By: VieEeEw Differential Revision: D65612448 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140303 Approved by: https://github.com/c-p-i-o	2024-11-12 03:43:02 +00:00
fduwjj	ceb44b22dc	[FR] Enable best effort parital analysis and verbose mode for trace printing (#139853 ) Based on user feedback, we want to enable two things for FR analysis script: 1. Print out more information when verbose is specified. 2. Perform best effort based analysis when not all ranks have FR trace dumped. Differential Revision: [D65516081](https://our.internmc.facebook.com/intern/diff/D65516081/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139853 Approved by: https://github.com/c-p-i-o	2024-11-11 14:38:32 +00:00
Chirag Pandya	47446cb5f3	[fr][c10d] move logger out from utils.py (#139806 ) Summary: Move flight recorder logger class out from utils.py into its own file. This makes the program more modular. This is mostly a refactoring/non-functional change. Test Plan: Build fr_trace locally and ran it. ``` buck build //caffe2/fb/flight_recorder:fr_trace Buck UI: https://www.internalfb.com/buck2/875ca6a3-e86e-4263-95a0-579502494c5c Network: Up: 0B Down: 0B Jobs completed: 6818. Time elapsed: 0.2s. BUILD SUCCEEDED ``` Ran it as follows: ``` cd buck-out/v2/gen/fbcode/caffe2/fb/flight_recorder ./fr_trace.par -p trace_ /tmp Not all ranks joining collective 3 at entry 2 group info: 0:default_pg collective: nccl:all_reduce missing ranks: {1} input sizes: [[4, 5]] output sizes: [[4, 5]] expected ranks: 2 collective state: scheduled collective stack trace: <module> at /home/cpio/test/c.py:66 ``` Differential Revision: D65503768 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139806 Approved by: https://github.com/fduwjj	2024-11-07 01:44:12 +00:00
Chirag Pandya	d549ddfb14	[fr][rfc] use a logger to control output for flight recorder analyzer (#139656 ) Summary: Use a logger to control output to console. This is useful for hiding out debug/detail messages from the console v/s showing everything together. Test Plan: Ran `torchfrtrace` with various switches. The `-v` verbose swtch ``` torchfrtrace --prefix "trace_" /tmp/ -v loaded 2 files in 0.2567298412322998s built groups, memberships Not all ranks joining collective 3 at entry 2 group info: 0:default_pg collective: nccl:all_reduce missing ranks: {1} input sizes: [[4, 5]] output sizes: [[4, 5]] expected ranks: 2 collective state: scheduled collective stack trace: <module> at /home/cpio/test/c.py:66 appending a non-matching collective built collectives, nccl_calls Groups id desc size -------------------- ---------- ------ 09000494312501845833 default_pg 2 Memberships group_id global_rank -------------------- ------------- 09000494312501845833 0 09000494312501845833 1 Collectives id group_id ---- ---------- 0 0 1 0 NCCLCalls id collective_id group_id global_rank traceback_id collective_type sizes ---- --------------- ---------- ------------- -------------- ----------------- -------- 0 0 0 0 0 nccl:all_reduce [[3, 4]] 1 0 0 1 0 nccl:all_reduce [[3, 4]] 2 1 0 0 0 nccl:all_reduce [[3, 4]] 3 1 0 1 0 nccl:all_reduce [[3, 4]] 4 0 0 0 nccl:all_reduce [[4, 5]] ``` Without the verbose switch ``` ❯ torchfrtrace --prefix "trace_" /tmp/ Not all ranks joining collective 3 at entry 2 group info: 0:default_pg collective: nccl:all_reduce missing ranks: {1} input sizes: [[4, 5]] output sizes: [[4, 5]] expected ranks: 2 collective state: scheduled collective stack trace: <module> at /home/cpio/test/c.py:66 ``` With the `-j` switch: ``` ❯ torchfrtrace --prefix "trace_" /tmp/ -j Rank 0 Rank 1 ------------------------------------------------- ------------------------------------------------- all_reduce(input_sizes=[[3, 4]], state=completed) all_reduce(input_sizes=[[3, 4]], state=completed) all_reduce(input_sizes=[[3, 4]], state=completed) all_reduce(input_sizes=[[3, 4]], state=completed) all_reduce(input_sizes=[[4, 5]], state=scheduled) ``` Differential Revision: D65438520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139656 Approved by: https://github.com/fduwjj	2024-11-05 20:14:18 +00:00
Chirag Pandya	6727f343b5	[c10d][fr][easy] Move check_no_missing_dump_files (#139417 ) Summary: Move check_no_missing_dump_files to after the "just print" location. This allows us to print dump_files when there are actual missing files. Test Plan: ``` torchfrtrace -j ~/pyper-training-online-924394600 --selected-ranks 1 2 Inferred common prefix nccl_trace_rank_ loaded 95 files in 0.040270328521728516s built groups, memberships Rank 1 Rank 2 ------------------------------------------------------------------ ------------------------------------------------------------------ broadcast(input_sizes=[[2]], state=completed) broadcast(input_sizes=[[2]], state=completed) ``` Without this change, the command was erroring out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139417 Approved by: https://github.com/Skylion007, https://github.com/fduwjj	2024-10-31 22:55:01 +00:00
Xuehai Pan	267f82b860	[BE] Format `.ci/` / `.github/` / `benchmarks/` / `functorch/` / `tools/` / `torchgen/` with `ruff format` (#132577 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577 Approved by: https://github.com/malfet	2024-10-11 18:30:26 +00:00
fduwjj	c7714b8d8d	[FR] Fix duplicate output for the case when not all ranks join on collective (#137256 ) As title, when testing on an internal case, we found that we have very similar output for the error when certain ranks does not join one collective. This is because we didn't put all ranks into `candidate_ranks` so that they didn't get wiped out from entries and gets checked again. Ideally for the given case, we should report this is an out of order case, because rank 0, 1 calls all-to-all while all the rest ranks call all-gather-base. But when we select entries to compare, we don't have global view of the entries. In the specific case, on rank 0 and 1, it has collective of PG 7 on entry 1130 with seq ID = 1130. However, on other ranks, they have collective of PG 0 on entry 1130 with seq ID = 2. It's hard to use entry idx to do the match because if we later consider p2p, this assumption will collapse, so we now still defer it for users or further down debugging stream to figure it out. To make the message clearer, I also include both seqID and record_id (aka, entry index) in the message. (That does not mean this is not possible to implement in the code, for example, we can let all record_id to minus the maximum p2p seq id before it; but users will easily see the wrong order, so we don't think it's necessary to have that logic now) P1626755348 Differential Revision: [D63815335](https://our.internmc.facebook.com/intern/diff/D63815335/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137256 Approved by: https://github.com/c-p-i-o	2024-10-03 18:06:45 +00:00
fduwjj	9372692c7b	[FR] Make OSS fr_trace function available for internal script and improve pg filtering (#136473 ) Differential Revision: [D63287384](https://our.internmc.facebook.com/intern/diff/D63287384/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136473 Approved by: https://github.com/c-p-i-o	2024-09-24 02:34:43 +00:00
fduwjj	da51fe1c42	[FR] Fix errors in all2all check, improve some log output (#136399 ) We found that we show the hashed pg name in our script output, which is not UX friendly. Also we found a bug in our all2all check and we made a bunch of changes to error messages to make it better readable. Differential Revision: [D63206469](https://our.internmc.facebook.com/intern/diff/D63206469) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136399 Approved by: https://github.com/c-p-i-o	2024-09-23 16:31:31 +00:00
Will Constable	dc0e818738	[FR] Automatically infer a common filename prefix (#135158 ) Save the annoyance of specifying this on the command line each time Pull Request resolved: https://github.com/pytorch/pytorch/pull/135158 Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o ghstack dependencies: #135157	2024-09-06 21:44:27 +00:00
Will Constable	06e414d7fe	[FR] Make trace_dir a required argument (#135157 ) Ensures users get a clean error if they forget to specify the dir, and improves the help message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135157 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-09-06 21:44:27 +00:00
fduwjj	e020a8755a	[Fix][FR][ez] Remove debugging logs (#135308 ) Removing the print added during debugging process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135308 Approved by: https://github.com/wz337	2024-09-06 06:14:33 +00:00
fduwjj	4a661e089a	[FR] Add version based logic to FR script and make traces print can be filtered (#135154 ) This PR makes version passing around the version, so that we can have different behaviors for different versions of FR dump. This PR also adds the logic of filtering to certain PG(desc) and ranks to show their traces. Some minor refactors to make the name more accurate and util function working. <img width="1180" alt="image" src="https://github.com/user-attachments/assets/4ef8a2d6-1296-4a45-b9a7-6d3b48fbe233"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135154 Approved by: https://github.com/wconstab	2024-09-05 00:59:32 +00:00
fduwjj	1993a2aa9e	[FR] Make pg_name unique, show P2P collective status and fix bugs when running the script as command (#134780 ) Fixes a bunches of bugs in the script when running with the generated command and 3D parallel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134780 Approved by: https://github.com/c-p-i-o ghstack dependencies: #134528	2024-08-30 18:03:17 +00:00

1 2

60 Commits