Commit Graph

19 Commits

Author SHA1 Message Date
a69785b3ec [BE] fix typos in tools/ (#156082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156082
Approved by: https://github.com/soulitzer
ghstack dependencies: #156079
2025-06-17 19:25:50 +00:00
95abc0f515 [c10d][fr] Fix another bug when we should continue when the op list is empty (#151798)
Differential Revision: D73375318

We shouldn't check the op list when it is empty. And later, when it is empty we pops it out from the queue we will check for collective matching. Added a unit test for this case and also covered the case fixed https://github.com/pytorch/pytorch/pull/151683 in the unit test as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151798
Approved by: https://github.com/d4l3k, https://github.com/wconstab, https://github.com/fegin
2025-04-22 04:43:31 +00:00
6e7b6e8d57 [c10d][fr] Fix a bug when first rank is not zero in the script (#151683)
Summary: Further testing the script, we found that we shouldn't always assume rank 0 is the first rank, so we need to check all entries and see if it P2P op for this coalesced group.

Test Plan: Directly test with corner case.

Differential Revision: D73266257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151683
Approved by: https://github.com/fegin
2025-04-18 20:55:06 +00:00
6f9ffaa991 [c10d][fr] Fix script for uneven reduce scatter and update test cases (#151475)
Somehow the type string for reduce scatter is "REDUCE_SCATTER" not "REDUCESCATTER". This PR fixed it and added more test cases.

Differential Revision: [D73141245](https://our.internmc.facebook.com/intern/diff/D73141245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151475
Approved by: https://github.com/fegin
2025-04-17 02:11:08 +00:00
ae648f047c [c10d][fr] Enable FR analysis script for rest of all coalesce op (#151247)
We revisited how coalesced collective is working in https://github.com/pytorch/pytorch/pull/151243 and we now want to enable the script to work for slow path. The change is indeed bc-breaking but this is needed to make it work and the API is an internal use API. It is not user facing. For slow path the individual has input-sizes and output sizes recorded but no state. The final one has the state ready. We check the correctness of each individual collective one by one but we don't check the state match for these collectives, we can only check the state match for the last one which is the work item with coalesced label.

Added more unit test for slow path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151247
Approved by: https://github.com/d4l3k, https://github.com/XilunWu
2025-04-15 20:53:03 +00:00
8aaf296efc [c10d][fr] Refactor analysis script for modularization and reusing for coalesce collectives (#150881)
Trying to make the code of FR analysis more reusable and modularized. So we split core error analysis logic into separate functions.

This PR mostly is shuffle around the code a bit.

Differential Revision: [D72690120](https://our.internmc.facebook.com/intern/diff/D72690120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150881
Approved by: https://github.com/wz337
2025-04-09 16:10:19 +00:00
c73a92fbf5 [BE][CI] bump ruff to 0.9.2: multiline assert statements (#144546)
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements

> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
>     f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
2025-02-27 20:46:16 +00:00
fb55bac3de [fr][fix] Split MatchState and dynamic info for fr analysis downstream (#147439)
The original MatchState type was declared as a python Enum. Although we did make it callable but we consume it right away. There are downstream cases when we need it to be a python class which is not supported in Python enum. So we did a small refactoring so that we keep both the enum state and dynamic info (culprit) for the fr analysis script.

Differential Revision: [D69830994](https://our.internmc.facebook.com/intern/diff/D69830994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147439
Approved by: https://github.com/fegin
2025-02-19 22:09:16 +00:00
07669ed960 PEP585 update - benchmarks tools torchgen (#145101)
This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc).  Most of the PRs were completely automated with RUFF as follows:

Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes:

```
--- a/tools/linter/adapters/ruff_linter.py
+++ b/tools/linter/adapters/ruff_linter.py
@@ -313,6 +313,7 @@
                     "ruff",
                     "check",
                     "--fix-only",
+                    "--unsafe-fixes",
                     "--exit-zero",
                     *([f"--config={config}"] if config else []),
                     "--stdin-filename",
```

Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent):

```
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -40,7 +40,7 @@

 [tool.ruff]
-target-version = "py38"
+target-version = "py39"
 line-length = 88
 src = ["caffe2", "torch", "torchgen", "functorch", "test"]

@@ -87,7 +87,6 @@
     "SIM116", # Disable Use a dictionary instead of consecutive `if` statements
     "SIM117",
     "SIM118",
-    "UP006", # keep-runtime-typing
     "UP007", # keep-runtime-typing
 ]
 select = [
```

Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101
Approved by: https://github.com/bobrenjc93
2025-01-18 05:05:07 +00:00
47446cb5f3 [fr][c10d] move logger out from utils.py (#139806)
Summary:
Move flight recorder logger class out from utils.py into its own file.
This makes the program more modular.
This is mostly a refactoring/non-functional change.

Test Plan:
Build fr_trace locally and ran it.
```
buck build //caffe2/fb/flight_recorder:fr_trace
Buck UI: https://www.internalfb.com/buck2/875ca6a3-e86e-4263-95a0-579502494c5c
Network: Up: 0B  Down: 0B
Jobs completed: 6818. Time elapsed: 0.2s.
BUILD SUCCEEDED
```
Ran it as follows:
```
cd buck-out/v2/gen/fbcode/caffe2/fb/flight_recorder

./fr_trace.par  -p trace_ /tmp
Not all ranks joining collective 3 at entry 2
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {1}
input sizes: [[4, 5]]
output sizes: [[4, 5]]
expected ranks: 2
collective state: scheduled
collective stack trace:
 <module> at /home/cpio/test/c.py:66
```

Differential Revision: D65503768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139806
Approved by: https://github.com/fduwjj
2024-11-07 01:44:12 +00:00
d549ddfb14 [fr][rfc] use a logger to control output for flight recorder analyzer (#139656)
Summary: Use a logger to control output to console. This is useful for hiding out debug/detail messages from the console v/s showing everything together.

Test Plan:
Ran `torchfrtrace` with various switches.

The `-v` verbose swtch
```
torchfrtrace --prefix "trace_" /tmp/ -v
loaded 2 files in 0.2567298412322998s
built groups, memberships
Not all ranks joining collective 3 at entry 2
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {1}
input sizes: [[4, 5]]
output sizes: [[4, 5]]
expected ranks: 2
collective state: scheduled
collective stack trace:
 <module> at /home/cpio/test/c.py:66
appending a non-matching collective
built collectives, nccl_calls
Groups
                  id  desc          size
--------------------  ----------  ------
09000494312501845833  default_pg       2
Memberships
            group_id    global_rank
--------------------  -------------
09000494312501845833              0
09000494312501845833              1
Collectives
  id    group_id
----  ----------
   0           0
   1           0
NCCLCalls
  id    collective_id    group_id    global_rank    traceback_id  collective_type    sizes
----  ---------------  ----------  -------------  --------------  -----------------  --------
   0                0           0              0               0  nccl:all_reduce    [[3, 4]]
   1                0           0              1               0  nccl:all_reduce    [[3, 4]]
   2                1           0              0               0  nccl:all_reduce    [[3, 4]]
   3                1           0              1               0  nccl:all_reduce    [[3, 4]]
   4                            0              0               0  nccl:all_reduce    [[4, 5]]
```

Without the verbose switch
```
❯ torchfrtrace --prefix "trace_" /tmp/
Not all ranks joining collective 3 at entry 2
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {1}
input sizes: [[4, 5]]
output sizes: [[4, 5]]
expected ranks: 2
collective state: scheduled
collective stack trace:
 <module> at /home/cpio/test/c.py:66
```

With the `-j` switch:
```
❯ torchfrtrace --prefix "trace_" /tmp/ -j
Rank 0                                             Rank 1
-------------------------------------------------  -------------------------------------------------
all_reduce(input_sizes=[[3, 4]], state=completed)  all_reduce(input_sizes=[[3, 4]], state=completed)
all_reduce(input_sizes=[[3, 4]], state=completed)  all_reduce(input_sizes=[[3, 4]], state=completed)
all_reduce(input_sizes=[[4, 5]], state=scheduled)
```

Differential Revision: D65438520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139656
Approved by: https://github.com/fduwjj
2024-11-05 20:14:18 +00:00
267f82b860 [BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577
Approved by: https://github.com/malfet
2024-10-11 18:30:26 +00:00
9372692c7b [FR] Make OSS fr_trace function available for internal script and improve pg filtering (#136473)
Differential Revision: [D63287384](https://our.internmc.facebook.com/intern/diff/D63287384/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136473
Approved by: https://github.com/c-p-i-o
2024-09-24 02:34:43 +00:00
da51fe1c42 [FR] Fix errors in all2all check, improve some log output (#136399)
We found that we show the hashed pg name in our script output, which is not UX friendly.
Also we found a bug in our all2all check and we made a bunch of changes to error messages to make it better readable.

Differential Revision: [D63206469](https://our.internmc.facebook.com/intern/diff/D63206469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136399
Approved by: https://github.com/c-p-i-o
2024-09-23 16:31:31 +00:00
4a661e089a [FR] Add version based logic to FR script and make traces print can be filtered (#135154)
This PR makes version passing around the version, so that we can have different behaviors for different versions of FR dump. This PR also adds the logic of filtering to certain PG(desc) and ranks to show their traces.

Some minor refactors to make the name more accurate and util function working.

<img width="1180" alt="image" src="https://github.com/user-attachments/assets/4ef8a2d6-1296-4a45-b9a7-6d3b48fbe233">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135154
Approved by: https://github.com/wconstab
2024-09-05 00:59:32 +00:00
1993a2aa9e [FR] Make pg_name unique, show P2P collective status and fix bugs when running the script as command (#134780)
Fixes a bunches of bugs in the script when running with the generated command and 3D parallel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134780
Approved by: https://github.com/c-p-i-o
ghstack dependencies: #134528
2024-08-30 18:03:17 +00:00
4c16797e71 [c10d FR analyzer] Output a meaningful debug report for users (#134528)
- This PR generates a more useful output log for users: P1552399180.
- It also fixes the logic when we check the all-gather size mismatch.
- Add dtype check for collective input/output
- We store more context information for error match_state so that we can report them in the file.
- Disable the size match for alltoall because we don't log the size for all inputs/outputs.
- Correct some types for func args specification.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134528
Approved by: https://github.com/c-p-i-o
2024-08-28 21:22:47 +00:00
bf5c7bf06d [FR] Fix the bug in FR script (e.g., checking all ranks dump check) (#134383)
We somehow convert the rank to string which makes the ranks check fail. This fix now convert them all to int.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134383
Approved by: https://github.com/c-p-i-o
2024-08-26 08:21:14 +00:00
8301add833 [4/N] Further refactor FR script to make it more modulized (#134196)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134196
Approved by: https://github.com/c-p-i-o
2024-08-23 01:15:29 +00:00