60 Commits

Author SHA1 Message Date
f02e3947f6 Expand type checking to mypy strict files (#165697)
Expands Pyrefly type checking to check the files outlined in the mypy-strict.ini configuration file:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165697
Approved by: https://github.com/ezyang
2025-10-18 04:34:45 +00:00
b2953f5643 [9/N] Apply ruff UP035 rule (#165515)
This is follow-up of #165214 to continue applying ruff UP035 rule to the code base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165515
Approved by: https://github.com/Lucaskabela
2025-10-17 00:09:51 +00:00
a029675f6f More ruff SIM fixes (#164695)
This PR applies ruff `SIM` rules to more files. Most changes are about simplifying `dict.get` because `None` is already the default value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164695
Approved by: https://github.com/ezyang
2025-10-09 03:24:50 +00:00
86c789849e [fr] Re-order mismatch check in fr analysis script (#164606)
In reality we found the current mismatch order does not match the actual error distribution, so we reorder it a bit as following:
1. We do collective type check first
2. Then size check (excluding all2all)
3. dtype check
4. state check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164606
Approved by: https://github.com/VieEeEw
2025-10-04 01:16:15 +00:00
c8e75c48b9 [fr] Skip the dtype check for some one to all or all to one collective (#163839)
As title, in practice we found that sometimes, the dtype of gather does not match when it comes to output among all ranks, which is a undefined behavior. Same with broadcast and scatter. And they are all completed, so we should not think they are errors, we can skip it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163839
Approved by: https://github.com/VieEeEw
2025-09-25 16:02:06 +00:00
2c45628813 [Flight Recorder][WP] Added mismatch tail as an arg (#162991)
Summary:
Mismatch tail is used as a fixed variable and there are cases that there are more than 10 mismatches FR gives up producing results (e.g. https://fburl.com/ai_infra/7gjl5ucb). This diff added the mismatch tail in the parsed args so make this configuarble.

Also tho the variable name is `mismatch_tail`(last 10) it is used as `mismatch_head` (the first 10). Updated it to be `num_mismatch_to_print`

Test Plan:
`buck2 run @//mode/opt //caffe2/fb/flight_recorder:fr_trace -- --mast_job_id aps-ctx_fm_pipeline_change-1c8ea38a94 --mast_job_version 0 --mast_job_attempt 2 --bucket tlcm_log_blob --world_size 128 --dump_file_name_offset 0 --allow-incomplete-ranks --num_mismatch_to_print 20 1>out 2>err`
Confirm no error and output 20 mismatches.

Rollback Plan:

Differential Revision: D82335995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162991
Approved by: https://github.com/fduwjj
2025-09-16 04:46:05 +00:00
9b4adc4db7 [fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL (#158568)
Adds support for FlightRecorder in ProcessGroupXCCL.

See https://github.com/intel/torch-xpu-ops/pull/1867 for XCCL implementation and more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158568
Approved by: https://github.com/guangyey, https://github.com/fduwjj
2025-08-22 09:03:35 +00:00
65053c03a3 [FR] Don't check incomplete ranks for printing (#160195)
When just printing the ranks (`-j` option) we should skip the check for "incomplete ranks" since that doesn't affect the print

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160195
Approved by: https://github.com/fduwjj
ghstack dependencies: #160097
2025-08-14 18:19:45 +00:00
96f9fbe21a Fix flight recorder for P2P ops (#160097)
Fixes errors in debugging a trace as mentioned in https://docs.google.com/document/d/1EKVJYmW2hj_VsvDvnSggXhZzJyvMu9dA0iDJWOZAtjY/edit?tab=t.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160097
Approved by: https://github.com/fduwjj
2025-08-14 18:19:45 +00:00
1c2cba17ea [FR] Add stack_id and an optional print of stack_id to stack_trace mapping (#160119)
To better help users debug with FR, we want to add stack_id and print a map between stack_id and stack_trace (optional)

Screenshot:

<img width="1029" height="529" alt="image" src="https://github.com/user-attachments/assets/8404a1d3-cc33-4f5f-971b-29609ec316c1" />

<img width="1620" height="358" alt="image" src="https://github.com/user-attachments/assets/3dd29c8c-ff68-41a2-acfd-e770036cfeb1" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160119
Approved by: https://github.com/H-Huang, https://github.com/wconstab
2025-08-11 07:27:10 +00:00
14c7358c64 Enable fr_trace to read local traces from multiple hosts. (#159490)
Summary: For training jobs particularly from GenAI, NCCL trace dumps are generated in the format of `<hostname>.pci3_rank_<rank>`. For multi-node training jobs, the hostname varies across traces. The current prefix matching logic can't handle this case.

Test Plan:
Create a local folder `dumps` and several empty files: `host0.pci3_rank_0`, `host0.pci3_rank_1`, `host1.pci3_rank_0`, `host1.pci3_rank_1` inside it. Then run
```
buck2 run fbcode//caffe2/fb/flight_recorder:fr_trace -- trace_dir dumps
```

Before this diff, fr_trace cannot locate any trace files, giving the following assertion error:
```
AssertionError: no files loaded from /home/tianhaoh/dumps with prefix pci3_rank_
```

After this diff, fr_trace is able to locate the trace files, resulting in the exceptions like
```
    dump = pickle.load(infile)
           ^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input
```
(since the trace files are fake and empty).

Rollback Plan:

Differential Revision: D79224727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159490
Approved by: https://github.com/fduwjj
2025-08-06 03:15:34 +00:00
3106a33e41 [fr] Fix one error in analysis script when subPG world size is smaller than global size (#156156)
Summary: We run into an interesting case when we see so many mismatches while lot of mismatch turns out to be a fully match. The reason is that we use the dump ranks (which is from 0 to 79) to compare against the local pg ranks (0 to 7) this leads to false positive of mismatches. We can just check whether dump ranks contain all expected ranks or not, that should be sufficient.

Test Plan:
Test with the failed case with the script and we now see the correct behavior + new unit test case.

Rollback Plan:

Differential Revision: D76775373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156156
Approved by: https://github.com/VieEeEw
2025-06-17 21:17:58 +00:00
a69785b3ec [BE] fix typos in tools/ (#156082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156082
Approved by: https://github.com/soulitzer
ghstack dependencies: #156079
2025-06-17 19:25:50 +00:00
95abc0f515 [c10d][fr] Fix another bug when we should continue when the op list is empty (#151798)
Differential Revision: D73375318

We shouldn't check the op list when it is empty. And later, when it is empty we pops it out from the queue we will check for collective matching. Added a unit test for this case and also covered the case fixed https://github.com/pytorch/pytorch/pull/151683 in the unit test as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151798
Approved by: https://github.com/d4l3k, https://github.com/wconstab, https://github.com/fegin
2025-04-22 04:43:31 +00:00
6e7b6e8d57 [c10d][fr] Fix a bug when first rank is not zero in the script (#151683)
Summary: Further testing the script, we found that we shouldn't always assume rank 0 is the first rank, so we need to check all entries and see if it P2P op for this coalesced group.

Test Plan: Directly test with corner case.

Differential Revision: D73266257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151683
Approved by: https://github.com/fegin
2025-04-18 20:55:06 +00:00
6f9ffaa991 [c10d][fr] Fix script for uneven reduce scatter and update test cases (#151475)
Somehow the type string for reduce scatter is "REDUCE_SCATTER" not "REDUCESCATTER". This PR fixed it and added more test cases.

Differential Revision: [D73141245](https://our.internmc.facebook.com/intern/diff/D73141245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151475
Approved by: https://github.com/fegin
2025-04-17 02:11:08 +00:00
ae648f047c [c10d][fr] Enable FR analysis script for rest of all coalesce op (#151247)
We revisited how coalesced collective is working in https://github.com/pytorch/pytorch/pull/151243 and we now want to enable the script to work for slow path. The change is indeed bc-breaking but this is needed to make it work and the API is an internal use API. It is not user facing. For slow path the individual has input-sizes and output sizes recorded but no state. The final one has the state ready. We check the correctness of each individual collective one by one but we don't check the state match for these collectives, we can only check the state match for the last one which is the work item with coalesced label.

Added more unit test for slow path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151247
Approved by: https://github.com/d4l3k, https://github.com/XilunWu
2025-04-15 20:53:03 +00:00
48b4bc1640 [c10d][fr] Enable FR analysis script for all fast-path coalesce op (#151243)
This PR is to enable FR for all coalesce ops for fast path. (batch p2p is enabled in the current script, so we will mainly focus on non-P2P ops). To explain what is fast path, let's revisit how coalesced collective is working today:

For non-P2P coalesced ops, there are are several ways to call it (due to legendary reasons):

- Way one: Directly call python api like all_reduce_coalesced in python, this will be deprecated soon.
- Way two: Directly call api inside PGNCCL like allreduce_coalesced. The way case 1 will eventually call into this. This is not deprecated and will not be deprecated, IIUC.
- Way three: Using _coalescing_manager in python, like:
```
with _coalescing_manager():
    for i in range(num_colls):
           dist.all_reduce(tensors[i])
```
This way has two path:
   - Fast path: when users call all-reduce, all-gather-into-tensor or reduce-scatter, we will only launch one big collective by calling the api from case 1.
   - Slow path: we call startCoalescing() in the beginning and then a bunch of collectives (each one will generate a FR entry) and then endCoalescing(). Inside startCoalescing(), groupStart() is called and inside endCoalescing(), groupEnd() is then called. So although this is going to be one collective, we call into PGNCCL for each collective coalesced in the slow path case.
   - For uneven all-gather (allgather_v) and reduce-scatter, it follows the pattern mention in slow path. It directly call cpp api inside PGNCCL.

This PR addressed the fast path because this is just an easy case, we store the collectives info on the python side, and we will only call into PGNCCL once so there will only be one work and one FR entry. We can just treat them as regular coalesced collective.

We add some e2e unit test for build_db function so that the change to FR is more thoroughly tested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151243
Approved by: https://github.com/d4l3k, https://github.com/wz337
2025-04-15 04:08:28 +00:00
48132de4af [c10d][fr] Fix the false positive in the dtype check in fr analysis script (#151063)
When checking dtype in fr analysis script, we should only check it when the input of output numbel is larger than zero. For the case when it is gather or scatter, the output/input size will be an empty list for non-src or non-dst ranks which we should just skip the check.

Differential Revision: [D72826823](https://our.internmc.facebook.com/intern/diff/D72826823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151063
Approved by: https://github.com/d4l3k, https://github.com/kwen2501
2025-04-11 02:11:58 +00:00
8aaf296efc [c10d][fr] Refactor analysis script for modularization and reusing for coalesce collectives (#150881)
Trying to make the code of FR analysis more reusable and modularized. So we split core error analysis logic into separate functions.

This PR mostly is shuffle around the code a bit.

Differential Revision: [D72690120](https://our.internmc.facebook.com/intern/diff/D72690120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150881
Approved by: https://github.com/wz337
2025-04-09 16:10:19 +00:00
31634b8c6a [fr] Added protection against missing stack frames in fr cont. (#150133)
Summary: Previously we had D70358287, which didn't fully resolved the issue.

Test Plan:
# FR
`buck2 run @//mode/opt //caffe2/fb/flight_recorder:fr_trace -- --mast_job_id f710320638-TrainingApplication --mast_job_version 0 --mast_job_attempt 0 --bucket tlcm_log_blob --world_size 128 --dump_file_name_offset 0 --allow-incomplete-ranks`
Confirm no error
# FR analyzer
`buck2 run @//mode/opt //investigations/dr_patternson/analyzers/ai_observability:ai_observability-all-analyzers-cli -- flight_recorder_analyzer --mast_job_name f710320638-TrainingApplication --mast_job_version 0 --mast_job_attempt 0`
Confirm no error

Differential Revision: D71998980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150133
Approved by: https://github.com/fduwjj
2025-04-01 03:07:59 +00:00
ce2f680e00 [fr] Added protection against missing stack frames in fr (#148203)
Summary: We have quite a while failures due to this unprotected access. https://fburl.com/scuba/ai_rca_debug_tracing/qtnb63qf

Test Plan:

Reviewed By: fduwjj

Differential Revision: D70358287

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148203
Approved by: https://github.com/fduwjj
2025-03-02 01:03:49 +00:00
c73a92fbf5 [BE][CI] bump ruff to 0.9.2: multiline assert statements (#144546)
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements

> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
>     f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
2025-02-27 20:46:16 +00:00
fb55bac3de [fr][fix] Split MatchState and dynamic info for fr analysis downstream (#147439)
The original MatchState type was declared as a python Enum. Although we did make it callable but we consume it right away. There are downstream cases when we need it to be a python class which is not supported in Python enum. So we did a small refactoring so that we keep both the enum state and dynamic info (culprit) for the fr analysis script.

Differential Revision: [D69830994](https://our.internmc.facebook.com/intern/diff/D69830994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147439
Approved by: https://github.com/fegin
2025-02-19 22:09:16 +00:00
07669ed960 PEP585 update - benchmarks tools torchgen (#145101)
This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc).  Most of the PRs were completely automated with RUFF as follows:

Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes:

```
--- a/tools/linter/adapters/ruff_linter.py
+++ b/tools/linter/adapters/ruff_linter.py
@@ -313,6 +313,7 @@
                     "ruff",
                     "check",
                     "--fix-only",
+                    "--unsafe-fixes",
                     "--exit-zero",
                     *([f"--config={config}"] if config else []),
                     "--stdin-filename",
```

Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent):

```
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -40,7 +40,7 @@

 [tool.ruff]
-target-version = "py38"
+target-version = "py39"
 line-length = 88
 src = ["caffe2", "torch", "torchgen", "functorch", "test"]

@@ -87,7 +87,6 @@
     "SIM116", # Disable Use a dictionary instead of consecutive `if` statements
     "SIM117",
     "SIM118",
-    "UP006", # keep-runtime-typing
     "UP007", # keep-runtime-typing
 ]
 select = [
```

Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101
Approved by: https://github.com/bobrenjc93
2025-01-18 05:05:07 +00:00
e3c4d1b7d6 [c10d][fr] Fix the bug when we still mark mismatch when there are match case (#144916)
When we introduce partial match, we accidentally introduce the mark of mismatch for the full match case. This is wrong and this PR fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144916
Approved by: https://github.com/c-p-i-o
2025-01-16 04:36:30 +00:00
0bdc173ab6 [fr] recognize all_reduce_barrier as a valid op (#143354)
Summary:
D67068632 introduced a better profiling name for barrier operations to be able to distinguish various ops.

Unfortunately, this broke Flight Recorder Analysis with the following error as reported by dmwu
```
fr_trace -m torchx-param_bench_16g_mi300x-all_to_all -a 0 --mast_job_version 98 -w 16
Traceback (most recent call last):
  File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 86, in _run_code
```

Test Plan: Test manually.

Differential Revision: D67305997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143354
Approved by: https://github.com/wconstab
2024-12-17 21:09:18 +00:00
314e08eb52 [fr_trace][bugfix] Log missing ranks when provided (#141924)
Summary: For missing ranks issues, `build_collectives` doesn't log any errors (5c2584a14c/tools/flight_recorder/components/builder.py (L293C23-L306C24)), which means that when `EntryState.to_collective` is called [here](5c2584a14c/tools/flight_recorder/components/builder.py (L400C21-L405C22)), errors will be empty and `to_collective` will enter the first if statement. But that codepath doesn't log `missing_ranks`, meaning it will be absent from the `Collective` returned. This diff fixes that oversight.

Test Plan:
eyes

Sandcastle run

Differential Revision: D66679224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141924
Approved by: https://github.com/c-p-i-o
2024-12-03 17:54:43 +00:00
08db735629 [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-03 02:50:10 +00:00
daa77f3d9f Revert "[BE]: Update mypy to 1.13.0 (#140808)"
This reverts commit 00134d68af2ce50560fa5a74473665ea229e6c9d.

Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))
2024-12-02 20:47:43 +00:00
00134d68af [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-02 18:47:54 +00:00
53f8a5fde2 [FR] Include mismatch rank into mismatch_collectives and update log message (#141631)
Summary: We want to return the mismatch ranks info in the `mismatch_collectives` field. Also update the logging message when no error is found and it's not partial analysis.

Test Plan: CI

Differential Revision: D66522602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141631
Approved by: https://github.com/c-p-i-o
2024-11-27 18:57:21 +00:00
150ffb6e07 [flight recorder] Updated MatchState to have a member variable (#141297)
Summary: Without this change calling `str(MatchState.SOMETHING)` will cause exception.

Test Plan:
Can we add unittest somewhere?
Ensure `str(MatchState.FULLY_MATCHED)` and `str(MatchState.FULLY_MATCHED())` won't raise exception.

Differential Revision: D66321609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141297
Approved by: https://github.com/fduwjj
2024-11-22 03:14:34 +00:00
32094626f2 [fr] fix OSS broken flight recorder (#140973)
Summary:
OSS flight recorder does not work because we renamed `trace_dir` to `folder` in the internal version to reuse the loader.

Fixes item #2 in reported issue:
https://github.com/pytorch/pytorch/issues/140879

Test Plan:
BEFORE:
```
❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node1_
tabulate is not installed. Proceeding without it.
Traceback (most recent call last):
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module>
    main()
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 44, in main
    details, version = read_dir(args)
  File "/home/cpio/local/pytorch/tools/flight_recorder/components/loader.py", line 89, in read_dir
    assert len(details) > 0, f"no files loaded from {args.folder} with prefix {prefix}"
AttributeError: 'Namespace' object has no attribute 'folder'
```

AFTER:
```
python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_
tabulate is not installed. Proceeding without it.
Traceback (most recent call last):
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module>
    main()
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main
    db = build_db(details, args, version)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db
    check_no_missing_dump_files(entries, memberships)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files
    dumps_ranks == all_ranks
AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119}
❯ git status
fatal: not a git repository (or any parent up to mount point /data/users/cpio)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_
tabulate is not installed. Proceeding without it.
Traceback (most recent call last):
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module>
    main()
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main
    db = build_db(details, args, version)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db
    check_no_missing_dump_files(entries, memberships)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files
    dumps_ranks == all_ranks
AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119}
```

Differential Revision: D66117013

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140973
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
2024-11-20 02:58:11 +00:00
8bc4033814 [fr][ez] better log messages + minor fixups (#140969)
Summary:
1. Clearly specify error messages that we are refering to a collective_sequence_id and an internal_record id for entry.
The entry id is semi-useless for the end consumer so at least let them know that this is an internal record id.
2. Add some missing fields in types.py.
  self.missing_ranks = set()
  self.input_numel = tuple()
  self.output_numel = tuple()
  self.errors = set()

These were showing up as linter errors when I opened the file in vs-code

Test Plan:
```
buck2 run //caffe2/fb/flight_recorder:fr_trace -- -m f665492593-nerf_training-96ab95e0 -w 8 --mast_job_version 0 -a 0
Buck UI: https://www.internalfb.com/buck2/2cac9273-1b7b-47bf-867f-82f9a4c1d581
Network: Up: 0B  Down: 0B
Not all ranks joining collective: sequence number: 31117
internal record id: 31116
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {3, 4, 5, 6, 7}
input sizes: [[1571911]]
output sizes: [[1571911]]
world size: 8
expected ranks: {0, 1, 2, 3, 4, 5, 6, 7}
collective state: scheduled
collective stack trace:
 all_reduce at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/distributed_c10d.py:2707
wrapper at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/c10d_logger.py:81
sync_buffers at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/models/gaussian_splatting.py:650
decorate_context at /packages/fblearner.flow.canary/workflow#link-tree/torch/utils/_contextlib.py:116
step at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/training/training_manager/splatting.py:356
main at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/nerf_training.py:260
main_impl at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:57
main at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:34
wrapper at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py:355
<module> at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:118
_run_code at /packages/fblearner.flow.canary/workflow#link-tree/runtime/lib/python3.10/runpy.py:86
_run_module_as_main at /packages/fblearner.flow.canary/workflow#link-tree/runtime/lib/python3.10/runpy.py:196
run_as_main at /packages/fblearner.flow.canary/workflow#link-tree/__par__/bootstrap.py:69
run_as_main at /packages/fblearner.flow.canary/workflow#link-tree/__par__/meta_only/bootstrap.py:98
__invoke_main at /packages/fblearner.flow.canary/workflow#link-tree/__run_lpar_main__.py:28
<module> at /packages/fblearner.flow.canary/workflow#link-tree/__run_lpar_main__.py:31

...

Differential Revision: D66018461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140969
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
2024-11-19 04:39:16 +00:00
c61ccaf10e [FR] Polish the log message for dtype mismatch and don't exit when too many mismatch (#140451)
Summary:
1. We don't want to exit with exceptions when there are so many mismatches. We should just break and return.
2. Polish the message of dtype mismatch. This is because dtype of input/output is actually a list not a string. So we don't want to show a list of ['double'] in the output message.

Test Plan:
Testing on the case when we see too many collective dtype mismatch

 {F1958467224}

Differential Revision: D65841830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140451
Approved by: https://github.com/c-p-i-o
2024-11-13 07:24:53 +00:00
23db92bad2 [FR] refactor build collective and return more info to db (#140082) (#140303)
Summary:

This change is trying to return the result of analysis with more details. Internally the contract is listed in https://docs.google.com/document/d/19ON5jKlYirT76D4Q-OoGMgD-U2L_sCDnUd_RE1gfiLE/edit?tab=t.0. For OSS, this change is BC to the current behavior.

Also create a new state object which handle logging and convert to object to Collective and NCCLCall.

Test Plan: CI and more thorough testing is on the way.

Reviewed By: VieEeEw

Differential Revision: D65612448

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140303
Approved by: https://github.com/c-p-i-o
2024-11-12 03:43:02 +00:00
ceb44b22dc [FR] Enable best effort parital analysis and verbose mode for trace printing (#139853)
Based on user feedback, we want to enable two things for FR analysis script:
1. Print out more information when verbose is specified.
2. Perform best effort based analysis when not all ranks have FR trace dumped.

Differential Revision: [D65516081](https://our.internmc.facebook.com/intern/diff/D65516081/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139853
Approved by: https://github.com/c-p-i-o
2024-11-11 14:38:32 +00:00
47446cb5f3 [fr][c10d] move logger out from utils.py (#139806)
Summary:
Move flight recorder logger class out from utils.py into its own file.
This makes the program more modular.
This is mostly a refactoring/non-functional change.

Test Plan:
Build fr_trace locally and ran it.
```
buck build //caffe2/fb/flight_recorder:fr_trace
Buck UI: https://www.internalfb.com/buck2/875ca6a3-e86e-4263-95a0-579502494c5c
Network: Up: 0B  Down: 0B
Jobs completed: 6818. Time elapsed: 0.2s.
BUILD SUCCEEDED
```
Ran it as follows:
```
cd buck-out/v2/gen/fbcode/caffe2/fb/flight_recorder

./fr_trace.par  -p trace_ /tmp
Not all ranks joining collective 3 at entry 2
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {1}
input sizes: [[4, 5]]
output sizes: [[4, 5]]
expected ranks: 2
collective state: scheduled
collective stack trace:
 <module> at /home/cpio/test/c.py:66
```

Differential Revision: D65503768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139806
Approved by: https://github.com/fduwjj
2024-11-07 01:44:12 +00:00
d549ddfb14 [fr][rfc] use a logger to control output for flight recorder analyzer (#139656)
Summary: Use a logger to control output to console. This is useful for hiding out debug/detail messages from the console v/s showing everything together.

Test Plan:
Ran `torchfrtrace` with various switches.

The `-v` verbose swtch
```
torchfrtrace --prefix "trace_" /tmp/ -v
loaded 2 files in 0.2567298412322998s
built groups, memberships
Not all ranks joining collective 3 at entry 2
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {1}
input sizes: [[4, 5]]
output sizes: [[4, 5]]
expected ranks: 2
collective state: scheduled
collective stack trace:
 <module> at /home/cpio/test/c.py:66
appending a non-matching collective
built collectives, nccl_calls
Groups
                  id  desc          size
--------------------  ----------  ------
09000494312501845833  default_pg       2
Memberships
            group_id    global_rank
--------------------  -------------
09000494312501845833              0
09000494312501845833              1
Collectives
  id    group_id
----  ----------
   0           0
   1           0
NCCLCalls
  id    collective_id    group_id    global_rank    traceback_id  collective_type    sizes
----  ---------------  ----------  -------------  --------------  -----------------  --------
   0                0           0              0               0  nccl:all_reduce    [[3, 4]]
   1                0           0              1               0  nccl:all_reduce    [[3, 4]]
   2                1           0              0               0  nccl:all_reduce    [[3, 4]]
   3                1           0              1               0  nccl:all_reduce    [[3, 4]]
   4                            0              0               0  nccl:all_reduce    [[4, 5]]
```

Without the verbose switch
```
❯ torchfrtrace --prefix "trace_" /tmp/
Not all ranks joining collective 3 at entry 2
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {1}
input sizes: [[4, 5]]
output sizes: [[4, 5]]
expected ranks: 2
collective state: scheduled
collective stack trace:
 <module> at /home/cpio/test/c.py:66
```

With the `-j` switch:
```
❯ torchfrtrace --prefix "trace_" /tmp/ -j
Rank 0                                             Rank 1
-------------------------------------------------  -------------------------------------------------
all_reduce(input_sizes=[[3, 4]], state=completed)  all_reduce(input_sizes=[[3, 4]], state=completed)
all_reduce(input_sizes=[[3, 4]], state=completed)  all_reduce(input_sizes=[[3, 4]], state=completed)
all_reduce(input_sizes=[[4, 5]], state=scheduled)
```

Differential Revision: D65438520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139656
Approved by: https://github.com/fduwjj
2024-11-05 20:14:18 +00:00
6727f343b5 [c10d][fr][easy] Move check_no_missing_dump_files (#139417)
Summary:
Move check_no_missing_dump_files to after the "just print" location.
This allows us to print dump_files when there are actual missing files.

Test Plan:
```
torchfrtrace -j ~/pyper-training-online-924394600  --selected-ranks 1 2

Inferred common prefix nccl_trace_rank_
loaded 95 files in 0.040270328521728516s
built groups, memberships
Rank 1                                                              Rank 2
------------------------------------------------------------------  ------------------------------------------------------------------
broadcast(input_sizes=[[2]], state=completed)                       broadcast(input_sizes=[[2]], state=completed)
```
Without this change, the command was erroring out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139417
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
2024-10-31 22:55:01 +00:00
267f82b860 [BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577
Approved by: https://github.com/malfet
2024-10-11 18:30:26 +00:00
c7714b8d8d [FR] Fix duplicate output for the case when not all ranks join on collective (#137256)
As title, when testing on an internal case, we found that we have very similar output for the error when certain ranks does not join one collective. This is because we didn't put all ranks into `candidate_ranks` so that they didn't get wiped out from entries and gets checked again.

Ideally for the given case, we should report this is an out of order case, because rank 0, 1 calls all-to-all while all the rest ranks call all-gather-base. But when we select entries to compare, we don't have global view of the entries.

In the specific case, on rank 0 and 1, it has collective of PG 7 on entry 1130 with seq ID = 1130. However, on other ranks, they have collective of PG 0 on entry 1130 with seq ID = 2. It's hard to use entry idx to do the match because if we later consider p2p, this assumption will collapse, so we now still defer it for users or further down debugging stream to figure it out. To make the message clearer, I also include both seqID and record_id (aka, entry index) in the message. (That does not mean this is not possible to implement in the code, for example, we can let all record_id to minus the maximum p2p seq id before it; but users will easily see the wrong order, so we don't think it's necessary to have that logic now)

P1626755348

Differential Revision: [D63815335](https://our.internmc.facebook.com/intern/diff/D63815335/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137256
Approved by: https://github.com/c-p-i-o
2024-10-03 18:06:45 +00:00
9372692c7b [FR] Make OSS fr_trace function available for internal script and improve pg filtering (#136473)
Differential Revision: [D63287384](https://our.internmc.facebook.com/intern/diff/D63287384/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136473
Approved by: https://github.com/c-p-i-o
2024-09-24 02:34:43 +00:00
da51fe1c42 [FR] Fix errors in all2all check, improve some log output (#136399)
We found that we show the hashed pg name in our script output, which is not UX friendly.
Also we found a bug in our all2all check and we made a bunch of changes to error messages to make it better readable.

Differential Revision: [D63206469](https://our.internmc.facebook.com/intern/diff/D63206469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136399
Approved by: https://github.com/c-p-i-o
2024-09-23 16:31:31 +00:00
dc0e818738 [FR] Automatically infer a common filename prefix (#135158)
Save the annoyance of specifying this on the command line each time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135158
Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o
ghstack dependencies: #135157
2024-09-06 21:44:27 +00:00
06e414d7fe [FR] Make trace_dir a required argument (#135157)
Ensures users get a clean error if they forget to specify the dir, and
improves the help message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135157
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-09-06 21:44:27 +00:00
e020a8755a [Fix][FR][ez] Remove debugging logs (#135308)
Removing the print added during debugging process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135308
Approved by: https://github.com/wz337
2024-09-06 06:14:33 +00:00
4a661e089a [FR] Add version based logic to FR script and make traces print can be filtered (#135154)
This PR makes version passing around the version, so that we can have different behaviors for different versions of FR dump. This PR also adds the logic of filtering to certain PG(desc) and ranks to show their traces.

Some minor refactors to make the name more accurate and util function working.

<img width="1180" alt="image" src="https://github.com/user-attachments/assets/4ef8a2d6-1296-4a45-b9a7-6d3b48fbe233">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135154
Approved by: https://github.com/wconstab
2024-09-05 00:59:32 +00:00
1993a2aa9e [FR] Make pg_name unique, show P2P collective status and fix bugs when running the script as command (#134780)
Fixes a bunches of bugs in the script when running with the generated command and 3D parallel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134780
Approved by: https://github.com/c-p-i-o
ghstack dependencies: #134528
2024-08-30 18:03:17 +00:00