4 Commits

Author SHA1 Message Date
61d9a5180e [Fix XPU CI] [Inductor UT] Fix test cases broken by community. (#165714)
Fixes #165719, Fixes #165771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165714
Approved by: https://github.com/jansel
2025-10-19 23:59:04 +00:00
b3f6d49b69 Overlap scheduler improvements (#165318)
Bucketing a number of smallish improvements:

- Account for bucketing in overlap calculation: if an in-flight collective exists with the same bucket key, reduce new collectives estimated time by its latency time
-  Update compute domination so we are ordering based on compute idx, as opposed to compute depth, so we never reorder compute. this makes it a bit easier to reason about memory, and pre-fetching, although we can exploring reordering in the future.
- When we wait on a collective, force all collectives on the same process group as it that were enqueued prior to the collective to wait as well.

Better Memory Handling:
- Pre-fetch limiting - when scheduling collectives for overlap, only pre-fetch up to a certain distance, then schedule off-path collectives (which are typically memory reducing).
- When we are above peak memory, schedule waits.

TODO:
- for each compute node, we know its original memory in the graph. we could limit pre-fetching that goes across peak memory
- By scheduling off-path collectives for overlap, we reduce memory, but if there weren't enough compute for overlap, we need to proactively schedule them. not an issue yet on examples.
- config some hard coded constants, clean up enablement (can do in subsequent pr)

On small llama 2d backward :
578 of 618 potentially hideable collectives hidden
original mem 14.4GB, rescheduled mem, 15.9GB

on forward:
254/256 potentially hideable collectives hidden
original mem 5.8 gb, reshceduled mem 5.8GB

WIP: adding tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165318
Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev
ghstack dependencies: #164738, #164783, #164944, #164945, #165059
2025-10-15 21:58:47 +00:00
2b71b62045 Add Memory Estimation Tracker (#165059)
Add Memory Tracker utility, which will track live memory given alternate ordering of nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165059
Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev
ghstack dependencies: #164738, #164783, #164944, #164945
2025-10-15 19:44:29 +00:00
aed5ed1076 Refactor memory estimator to use node storages, add test (#164783)
- Update the Memory Estimator to use node storages for analysis, which simplifies book keeping, as opposed to manually looking at operator schema. This will also allow me to reuse this component elsewhere.

- Factor out into separate class, so that this same logic can be used  in scheduling (node allocations / aliasing / uses)

- Adds Tests for correctness - right now only on fwd/bwd by itself, not with both.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164783
Approved by: https://github.com/ruisizhang123
ghstack dependencies: #164738
2025-10-08 22:07:43 +00:00