In reality we found the current mismatch order does not match the actual error distribution, so we reorder it a bit as following:
1. We do collective type check first
2. Then size check (excluding all2all)
3. dtype check
4. state check
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164606
Approved by: https://github.com/VieEeEw
As title, in practice we found that sometimes, the dtype of gather does not match when it comes to output among all ranks, which is a undefined behavior. Same with broadcast and scatter. And they are all completed, so we should not think they are errors, we can skip it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163839
Approved by: https://github.com/VieEeEw
Summary:
Mismatch tail is used as a fixed variable and there are cases that there are more than 10 mismatches FR gives up producing results (e.g. https://fburl.com/ai_infra/7gjl5ucb). This diff added the mismatch tail in the parsed args so make this configuarble.
Also tho the variable name is `mismatch_tail`(last 10) it is used as `mismatch_head` (the first 10). Updated it to be `num_mismatch_to_print`
Test Plan:
`buck2 run @//mode/opt //caffe2/fb/flight_recorder:fr_trace -- --mast_job_id aps-ctx_fm_pipeline_change-1c8ea38a94 --mast_job_version 0 --mast_job_attempt 2 --bucket tlcm_log_blob --world_size 128 --dump_file_name_offset 0 --allow-incomplete-ranks --num_mismatch_to_print 20 1>out 2>err`
Confirm no error and output 20 mismatches.
Rollback Plan:
Differential Revision: D82335995
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162991
Approved by: https://github.com/fduwjj
Summary: For training jobs particularly from GenAI, NCCL trace dumps are generated in the format of `<hostname>.pci3_rank_<rank>`. For multi-node training jobs, the hostname varies across traces. The current prefix matching logic can't handle this case.
Test Plan:
Create a local folder `dumps` and several empty files: `host0.pci3_rank_0`, `host0.pci3_rank_1`, `host1.pci3_rank_0`, `host1.pci3_rank_1` inside it. Then run
```
buck2 run fbcode//caffe2/fb/flight_recorder:fr_trace -- trace_dir dumps
```
Before this diff, fr_trace cannot locate any trace files, giving the following assertion error:
```
AssertionError: no files loaded from /home/tianhaoh/dumps with prefix pci3_rank_
```
After this diff, fr_trace is able to locate the trace files, resulting in the exceptions like
```
dump = pickle.load(infile)
^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input
```
(since the trace files are fake and empty).
Rollback Plan:
Differential Revision: D79224727
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159490
Approved by: https://github.com/fduwjj
Summary: We run into an interesting case when we see so many mismatches while lot of mismatch turns out to be a fully match. The reason is that we use the dump ranks (which is from 0 to 79) to compare against the local pg ranks (0 to 7) this leads to false positive of mismatches. We can just check whether dump ranks contain all expected ranks or not, that should be sufficient.
Test Plan:
Test with the failed case with the script and we now see the correct behavior + new unit test case.
Rollback Plan:
Differential Revision: D76775373
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156156
Approved by: https://github.com/VieEeEw
Summary: Further testing the script, we found that we shouldn't always assume rank 0 is the first rank, so we need to check all entries and see if it P2P op for this coalesced group.
Test Plan: Directly test with corner case.
Differential Revision: D73266257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151683
Approved by: https://github.com/fegin
We revisited how coalesced collective is working in https://github.com/pytorch/pytorch/pull/151243 and we now want to enable the script to work for slow path. The change is indeed bc-breaking but this is needed to make it work and the API is an internal use API. It is not user facing. For slow path the individual has input-sizes and output sizes recorded but no state. The final one has the state ready. We check the correctness of each individual collective one by one but we don't check the state match for these collectives, we can only check the state match for the last one which is the work item with coalesced label.
Added more unit test for slow path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151247
Approved by: https://github.com/d4l3k, https://github.com/XilunWu
This PR is to enable FR for all coalesce ops for fast path. (batch p2p is enabled in the current script, so we will mainly focus on non-P2P ops). To explain what is fast path, let's revisit how coalesced collective is working today:
For non-P2P coalesced ops, there are are several ways to call it (due to legendary reasons):
- Way one: Directly call python api like all_reduce_coalesced in python, this will be deprecated soon.
- Way two: Directly call api inside PGNCCL like allreduce_coalesced. The way case 1 will eventually call into this. This is not deprecated and will not be deprecated, IIUC.
- Way three: Using _coalescing_manager in python, like:
```
with _coalescing_manager():
for i in range(num_colls):
dist.all_reduce(tensors[i])
```
This way has two path:
- Fast path: when users call all-reduce, all-gather-into-tensor or reduce-scatter, we will only launch one big collective by calling the api from case 1.
- Slow path: we call startCoalescing() in the beginning and then a bunch of collectives (each one will generate a FR entry) and then endCoalescing(). Inside startCoalescing(), groupStart() is called and inside endCoalescing(), groupEnd() is then called. So although this is going to be one collective, we call into PGNCCL for each collective coalesced in the slow path case.
- For uneven all-gather (allgather_v) and reduce-scatter, it follows the pattern mention in slow path. It directly call cpp api inside PGNCCL.
This PR addressed the fast path because this is just an easy case, we store the collectives info on the python side, and we will only call into PGNCCL once so there will only be one work and one FR entry. We can just treat them as regular coalesced collective.
We add some e2e unit test for build_db function so that the change to FR is more thoroughly tested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151243
Approved by: https://github.com/d4l3k, https://github.com/wz337
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements
> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
> f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
The original MatchState type was declared as a python Enum. Although we did make it callable but we consume it right away. There are downstream cases when we need it to be a python class which is not supported in Python enum. So we did a small refactoring so that we keep both the enum state and dynamic info (culprit) for the fr analysis script.
Differential Revision: [D69830994](https://our.internmc.facebook.com/intern/diff/D69830994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147439
Approved by: https://github.com/fegin
This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc). Most of the PRs were completely automated with RUFF as follows:
Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes:
```
--- a/tools/linter/adapters/ruff_linter.py
+++ b/tools/linter/adapters/ruff_linter.py
@@ -313,6 +313,7 @@
"ruff",
"check",
"--fix-only",
+ "--unsafe-fixes",
"--exit-zero",
*([f"--config={config}"] if config else []),
"--stdin-filename",
```
Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent):
```
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -40,7 +40,7 @@
[tool.ruff]
-target-version = "py38"
+target-version = "py39"
line-length = 88
src = ["caffe2", "torch", "torchgen", "functorch", "test"]
@@ -87,7 +87,6 @@
"SIM116", # Disable Use a dictionary instead of consecutive `if` statements
"SIM117",
"SIM118",
- "UP006", # keep-runtime-typing
"UP007", # keep-runtime-typing
]
select = [
```
Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101
Approved by: https://github.com/bobrenjc93
Summary:
D67068632 introduced a better profiling name for barrier operations to be able to distinguish various ops.
Unfortunately, this broke Flight Recorder Analysis with the following error as reported by dmwu
```
fr_trace -m torchx-param_bench_16g_mi300x-all_to_all -a 0 --mast_job_version 98 -w 16
Traceback (most recent call last):
File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 86, in _run_code
```
Test Plan: Test manually.
Differential Revision: D67305997
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143354
Approved by: https://github.com/wconstab
Summary: We want to return the mismatch ranks info in the `mismatch_collectives` field. Also update the logging message when no error is found and it's not partial analysis.
Test Plan: CI
Differential Revision: D66522602
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141631
Approved by: https://github.com/c-p-i-o
Summary: Without this change calling `str(MatchState.SOMETHING)` will cause exception.
Test Plan:
Can we add unittest somewhere?
Ensure `str(MatchState.FULLY_MATCHED)` and `str(MatchState.FULLY_MATCHED())` won't raise exception.
Differential Revision: D66321609
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141297
Approved by: https://github.com/fduwjj
Summary:
1. Clearly specify error messages that we are refering to a collective_sequence_id and an internal_record id for entry.
The entry id is semi-useless for the end consumer so at least let them know that this is an internal record id.
2. Add some missing fields in types.py.
self.missing_ranks = set()
self.input_numel = tuple()
self.output_numel = tuple()
self.errors = set()
These were showing up as linter errors when I opened the file in vs-code
Test Plan:
```
buck2 run //caffe2/fb/flight_recorder:fr_trace -- -m f665492593-nerf_training-96ab95e0 -w 8 --mast_job_version 0 -a 0
Buck UI: https://www.internalfb.com/buck2/2cac9273-1b7b-47bf-867f-82f9a4c1d581
Network: Up: 0B Down: 0B
Not all ranks joining collective: sequence number: 31117
internal record id: 31116
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {3, 4, 5, 6, 7}
input sizes: [[1571911]]
output sizes: [[1571911]]
world size: 8
expected ranks: {0, 1, 2, 3, 4, 5, 6, 7}
collective state: scheduled
collective stack trace:
all_reduce at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/distributed_c10d.py:2707
wrapper at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/c10d_logger.py:81
sync_buffers at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/models/gaussian_splatting.py:650
decorate_context at /packages/fblearner.flow.canary/workflow#link-tree/torch/utils/_contextlib.py:116
step at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/training/training_manager/splatting.py:356
main at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/nerf_training.py:260
main_impl at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:57
main at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:34
wrapper at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py:355
<module> at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:118
_run_code at /packages/fblearner.flow.canary/workflow#link-tree/runtime/lib/python3.10/runpy.py:86
_run_module_as_main at /packages/fblearner.flow.canary/workflow#link-tree/runtime/lib/python3.10/runpy.py:196
run_as_main at /packages/fblearner.flow.canary/workflow#link-tree/__par__/bootstrap.py:69
run_as_main at /packages/fblearner.flow.canary/workflow#link-tree/__par__/meta_only/bootstrap.py:98
__invoke_main at /packages/fblearner.flow.canary/workflow#link-tree/__run_lpar_main__.py:28
<module> at /packages/fblearner.flow.canary/workflow#link-tree/__run_lpar_main__.py:31
...
Differential Revision: D66018461
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140969
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
Summary:
1. We don't want to exit with exceptions when there are so many mismatches. We should just break and return.
2. Polish the message of dtype mismatch. This is because dtype of input/output is actually a list not a string. So we don't want to show a list of ['double'] in the output message.
Test Plan:
Testing on the case when we see too many collective dtype mismatch
{F1958467224}
Differential Revision: D65841830
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140451
Approved by: https://github.com/c-p-i-o
Summary:
Move flight recorder logger class out from utils.py into its own file.
This makes the program more modular.
This is mostly a refactoring/non-functional change.
Test Plan:
Build fr_trace locally and ran it.
```
buck build //caffe2/fb/flight_recorder:fr_trace
Buck UI: https://www.internalfb.com/buck2/875ca6a3-e86e-4263-95a0-579502494c5c
Network: Up: 0B Down: 0B
Jobs completed: 6818. Time elapsed: 0.2s.
BUILD SUCCEEDED
```
Ran it as follows:
```
cd buck-out/v2/gen/fbcode/caffe2/fb/flight_recorder
./fr_trace.par -p trace_ /tmp
Not all ranks joining collective 3 at entry 2
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {1}
input sizes: [[4, 5]]
output sizes: [[4, 5]]
expected ranks: 2
collective state: scheduled
collective stack trace:
<module> at /home/cpio/test/c.py:66
```
Differential Revision: D65503768
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139806
Approved by: https://github.com/fduwjj
Summary:
Move check_no_missing_dump_files to after the "just print" location.
This allows us to print dump_files when there are actual missing files.
Test Plan:
```
torchfrtrace -j ~/pyper-training-online-924394600 --selected-ranks 1 2
Inferred common prefix nccl_trace_rank_
loaded 95 files in 0.040270328521728516s
built groups, memberships
Rank 1 Rank 2
------------------------------------------------------------------ ------------------------------------------------------------------
broadcast(input_sizes=[[2]], state=completed) broadcast(input_sizes=[[2]], state=completed)
```
Without this change, the command was erroring out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139417
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
As title, when testing on an internal case, we found that we have very similar output for the error when certain ranks does not join one collective. This is because we didn't put all ranks into `candidate_ranks` so that they didn't get wiped out from entries and gets checked again.
Ideally for the given case, we should report this is an out of order case, because rank 0, 1 calls all-to-all while all the rest ranks call all-gather-base. But when we select entries to compare, we don't have global view of the entries.
In the specific case, on rank 0 and 1, it has collective of PG 7 on entry 1130 with seq ID = 1130. However, on other ranks, they have collective of PG 0 on entry 1130 with seq ID = 2. It's hard to use entry idx to do the match because if we later consider p2p, this assumption will collapse, so we now still defer it for users or further down debugging stream to figure it out. To make the message clearer, I also include both seqID and record_id (aka, entry index) in the message. (That does not mean this is not possible to implement in the code, for example, we can let all record_id to minus the maximum p2p seq id before it; but users will easily see the wrong order, so we don't think it's necessary to have that logic now)
P1626755348
Differential Revision: [D63815335](https://our.internmc.facebook.com/intern/diff/D63815335/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137256
Approved by: https://github.com/c-p-i-o