pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Files

IvanKobzarev db44de4c0d [inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113 )

1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory:
```
    """
    Alternative version of estimate_peak_memory, that respects the fact,
    that every SchedulerNode has multiple phases:
    1. alloc ( outputs )
    2. run_kernel
    3. dealloc last_use buffers
    estimate_peak_memory collapses memory into one value: size_alloc - size_free
    While peak memory happens after alloc.

    Duplicating the code to not migrate all callsites at once,
    In future usages of estimate_peak_memory will migrate to this version.
    """
```

- Applying this in `reorder_communication_preserving_peak_memory` pass.

2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode.

- Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size).

4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder.

What is after this PR:

Iterative recomputation of memory estimations matches full memory estimations.

Active memory is not regressing a lot, but reserved memory is significantly regressed.

Investigation and fix of "reserved" memory will be in following PRs.

BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb
```
[rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step:  1  loss: 12.2722  grad_norm:  4.2192  active_memory: 24.66GiB(25.96%)  reserved_memory: 25.38GiB(26.72%)  tps: 99  tflops: 5.71  mfu: 0.58%
[rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step:  2  loss: 13.1738  grad_norm: 50.5566  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 4,448  tflops: 257.63  mfu: 26.05%
[rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step:  3  loss: 15.6866  grad_norm: 80.0862  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,900  tflops: 341.72  mfu: 34.55%
[rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step:  4  loss: 13.4853  grad_norm:  7.8538  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,881  tflops: 340.57  mfu: 34.44%
[rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step:  5  loss: 16.1191  grad_norm: 53.2481  active_memory: 32.14GiB(33.83%)  reserved_memory: 34.21GiB(36.01%)  tps: 5,867  tflops: 339.77  mfu: 34.35%
```
REORDER: active: 32Gb reserved: 36Gb
```
[rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step:  1  loss: 12.2490  grad_norm:  4.1944  active_memory: 24.66GiB(25.96%)  reserved_memory: 26.81GiB(28.22%)  tps: 85  tflops: 4.90  mfu: 0.50%
[rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step:  2  loss: 13.1427  grad_norm: 39.5942  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 3,205  tflops: 185.61  mfu: 18.77%
[rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step:  3  loss: 14.6084  grad_norm: 51.0743  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,688  tflops: 329.44  mfu: 33.31%
[rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step:  4  loss: 13.6181  grad_norm:  8.1122  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,744  tflops: 332.68  mfu: 33.64%
[rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step:  5  loss: 15.8913  grad_norm: 59.8510  active_memory: 32.14GiB(33.83%)  reserved_memory: 36.40GiB(38.31%)  tps: 5,046  tflops: 292.22  mfu: 29.55%
```

REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb
```
[rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step:  1  loss: 12.2646  grad_norm:  4.1282  active_memory: 27.60GiB(29.05%)  reserved_memory: 32.49GiB(34.20%)  tps: 173  tflops: 10.00  mfu: 1.01%
[rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step:  2  loss: 13.2353  grad_norm: 42.4234  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,152  tflops: 356.26  mfu: 36.02%
[rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step:  3  loss: 13.8205  grad_norm: 24.0156  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,169  tflops: 357.29  mfu: 36.13%
[rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step:  4  loss: 13.1033  grad_norm:  9.1167  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,183  tflops: 358.10  mfu: 36.21%
[rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step:  5  loss: 16.3530  grad_norm: 51.8118  active_memory: 35.08GiB(36.92%)  reserved_memory: 41.62GiB(43.80%)  tps: 6,130  tflops: 355.03  mfu: 35.90%
```

Differential Revision: [D80718143](https://our.internmc.facebook.com/intern/diff/D80718143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113
Approved by: https://github.com/wconstab, https://github.com/eellison

Co-authored-by: eellison <elias.ellison@gmail.com>

2025-08-22 14:19:57 +00:00

_awaits

…

[fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL (#158568 )

2025-08-22 09:03:35 +00:00

_C_flatbuffer

…

_custom_op

[BE]: ruff PLC0207 - use maxsplit kwarg (#160107 )

2025-08-08 03:14:59 +00:00

_decomp

[dynamic shapes] unbacked-safe slicing (#157944 )

2025-08-20 22:52:56 +00:00

_dispatch

Improve torch.ops typing (#154555 )

2025-06-22 15:52:27 +00:00

_dynamo

[BE] [dynamo] Simplify two methods in ConstDictVariable (#159361 )

2025-08-22 11:11:30 +00:00

_export

[export] Remove unused Model, tensor_paths, constant_paths (#161185 )

2025-08-22 01:07:01 +00:00

_functorch

[invoke_subgraph][inductor] Thread graphsafe rng input states for hops (#160713 )

2025-08-21 20:41:29 +00:00

_higher_order_ops

[invoke_subgraph][inductor] Thread graphsafe rng input states for hops (#160713 )

2025-08-21 20:41:29 +00:00

_inductor

[inductor] Estimate peak memory allocfree and applying to reordering collectives (#160113 )

2025-08-22 14:19:57 +00:00

_lazy

[BE][2/16] fix typos in torch/ (torch/_*/) (#156312 )

2025-07-12 05:47:06 +00:00

_library

Account for triton kernel source code hidden in custom ops properly in AOTAutogradCache (#160120 )

2025-08-12 14:11:06 +00:00

_logging

Add compile_id: Optional[CompileID] to torch._logging._internal.trace_structured_artifact (#160440 )

2025-08-13 06:28:23 +00:00

_numpy

Fix torch._numpy to match NumPy when empty ellipsis causes advanced indexing separation (#158297 )

2025-07-16 08:11:53 +00:00

_prims

[BE]: ruff PLC0207 - use maxsplit kwarg (#160107 )

2025-08-08 03:14:59 +00:00

_prims_common

[dynamic shapes] prims_common non_overlapping_and_dense (#160462 )

2025-08-19 01:35:28 +00:00

_refs

Revert "Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. (#159197 )"

2025-08-18 07:22:13 +00:00

_strobelight

[BE][2/16] fix typos in torch/ (torch/_*/) (#156312 )

2025-07-12 05:47:06 +00:00

_subclasses

[rfc] add hint_override kwarg to mark_dynamic (#161007 )

2025-08-21 02:22:52 +00:00

_vendor

…

accelerator

Add unified memory APIs for torch.accelerator (#152932 )

2025-08-08 17:41:22 +00:00

amp

Fix autocast context manager when there is exception (#159565 )

2025-08-01 02:12:24 +00:00

Remove the uncessary empty file (#160728 )

2025-08-19 10:54:08 +00:00

autograd

Add ownership token when needed on GradientEdge (#160098 )

2025-08-12 20:14:18 +00:00

backends

[ROCm] SDPA fix mem fault when dropout is enabled (#154864 )

2025-08-21 14:23:13 +00:00

compiler

[PGO] add extra read/write keys (#160715 )

2025-08-18 01:41:08 +00:00

contrib

…

cpu

Replace _device_t with torch.types.Device in torch/cpu/__init__.py (#161031 )

2025-08-21 00:22:43 +00:00

csrc

Add CUDA 13.0 x86 builds (#160956 )

2025-08-22 11:31:09 +00:00

cuda

Remove the uncessary empty file (#160728 )

2025-08-19 10:54:08 +00:00

distributed

[fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL (#158568 )

2025-08-22 09:03:35 +00:00

distributions

[BE][1/16] fix typos in torch/ (#156311 )

2025-07-09 11:02:22 +00:00

export

[export] Allow tempfile._TemporaryFileWrapper in package_pt2 (#161203 )

2025-08-22 04:10:35 +00:00

fft

[BE][PYFMT] migrate PYFMT for torch/[e-n]*/ to ruff format (#144553 )

2025-06-17 08:18:47 +00:00

func

…

futures

Simplify the base classes of _PyFutureMeta (#157757 )

2025-07-08 15:39:56 +00:00

[rfc] add hint_override kwarg to mark_dynamic (#161007 )

2025-08-21 02:22:52 +00:00

headeronly

[Reland] Migrate ScalarType to headeronly (#159911 )

2025-08-06 07:36:37 +00:00

jit

[4/n] Remove references to TorchScript in PyTorch docs (#158317 )

2025-07-16 20:01:34 +00:00

legacy

…

lib

[2/N] Fix cppcoreguidelines-init-variables suppression (#146237 )

2025-06-19 23:26:42 +00:00

linalg

Fix for ambiguity in linalg.norm()'s ord argument of +2 & -2 (#155148 )

2025-06-04 21:15:20 +00:00

masked

Revert "Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. (#159197 )"

2025-08-18 07:22:13 +00:00

monitor

…

mps

[BE][12/16] fix typos in torch/ (#156602 )

2025-07-02 22:55:29 +00:00

mtia

[Re-land][Inductor] Support native Inductor as backend for MTIA (#159211 )

2025-07-29 17:03:24 +00:00

multiprocessing

Support NUMA Binding for Callable Entrypoints (#160163 )

2025-08-12 20:08:49 +00:00

nativert

[nativert] make runtime const folding aware of run_const_graph (#160760 )

2025-08-21 05:22:03 +00:00

nested

Revert "Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. (#159197 )"

2025-08-18 07:22:13 +00:00

typing debugging.py (#160364 )

2025-08-15 02:09:31 +00:00

numa

[ez] Make NUMA signpost parameters JSON serializable (#160710 )

2025-08-15 16:52:43 +00:00

onnx

Revert "[ONNX] Default to dynamo export (#159646 )"

2025-08-18 21:41:32 +00:00

optim

Optimzie zero_grad description (#161239 )

2025-08-22 06:18:25 +00:00

package

[BE][PYFMT] migrate PYFMT for torch/[p-z]*/ to ruff format (#144552 )

2025-08-07 00:09:56 +00:00

profiler

[BE][PYFMT] migrate PYFMT for torch/[p-z]*/ to ruff format (#144552 )

2025-08-07 00:09:56 +00:00

quantization

[BE][PYFMT] migrate PYFMT for torch/[p-z]*/ to ruff format (#144552 )

2025-08-07 00:09:56 +00:00

signal

[BE][PYFMT] migrate PYFMT for torch/[p-z]*/ to ruff format (#144552 )

2025-08-07 00:09:56 +00:00

sparse

[BE][PYFMT] migrate PYFMT for torch/[p-z]*/ to ruff format (#144552 )

2025-08-07 00:09:56 +00:00

special

[BE][PYFMT] migrate PYFMT for torch/[p-z]*/ to ruff format (#144552 )

2025-08-07 00:09:56 +00:00

testing

[ROCm] SDPA fix mem fault when dropout is enabled (#154864 )

2025-08-21 14:23:13 +00:00

utils

Update checkpoint warning to target PyTorch 2.9 (#160725 )

2025-08-19 15:08:50 +00:00

xpu

[BE][PYFMT] migrate PYFMT for torch/[p-z]*/ to ruff format (#144552 )

2025-08-07 00:09:56 +00:00

__config__.py

…

__future__.py

…

__init__.py

[BE] remove torch deploy - conditionals (#158288 )

2025-07-29 17:40:49 +00:00

_appdirs.py

Fix broken URLs (#152237 )

2025-04-27 09:56:42 +00:00

_classes.py

remove allow-untyped-defs from torch/_classes.py (#157231 )

2025-07-08 00:11:52 +00:00

_compile.py

[precompile] Ensure @disable()-ed function won't trigger recompile from precompile bytecode. (#155363 )

2025-06-10 16:13:38 +00:00

_custom_ops.py

Render Example: and not Example:: in docs (#153978 )

2025-05-21 01:03:26 +00:00

_environment.py

…

_guards.py

Move save guard error throwing to separate phase (#160662 )

2025-08-19 14:46:43 +00:00

_jit_internal.py

[BE][1/16] fix typos in torch/ (#156311 )

2025-07-09 11:02:22 +00:00

_linalg_utils.py

Update is_sparse doc to mention that it is sparse_coo specific (#157378 )

2025-07-09 18:22:14 +00:00

_lobpcg.py

[BE][1/16] fix typos in torch/ (#156311 )

2025-07-09 11:02:22 +00:00

_lowrank.py

[BE][1/16] fix typos in torch/ (#156311 )

2025-07-09 11:02:22 +00:00

_meta_registrations.py

Fix meta function for aten.complex (#160894 )

2025-08-20 16:30:04 +00:00

_namedtensor_internals.py

…

_ops.py

[BE] remove torch deploy - conditionals (#158288 )

2025-07-29 17:40:49 +00:00

_python_dispatcher.py

Typo fixes for "overridden" in comments and function names (#155944 )

2025-06-14 03:37:38 +00:00

_size_docs.py

Render Example: and not Example:: in docs (#153978 )

2025-05-21 01:03:26 +00:00

_sources.py

…

_storage_docs.py

Fix docstring for torch.UntypedStorage.from_file (#155067 )

2025-06-05 14:30:49 +00:00

_streambase.py

…

_tensor_docs.py

Add missing optional for tensor ops (#159028 )

2025-07-25 04:36:55 +00:00

_tensor_str.py

Fix max_width computation in _tensor_str._Formatter (#126859 )

2025-08-01 15:05:41 +00:00

_tensor.py

[MPS] Enable dlpack integration (#158888 )

2025-07-24 18:05:41 +00:00

_thread_safe_fork.py

…

_torch_docs.py

[cuda][cupy] Improve cupy device placement when device is provided with explicit index (#158529 )

2025-08-15 00:27:42 +00:00

_utils_internal.py

Wire in pt2_triton_builds (#159897 )

2025-08-06 07:39:51 +00:00

_utils.py

[BE][1/16] fix typos in torch/ (#156311 )

2025-07-09 11:02:22 +00:00

_VF.py

…

_vmap_internals.py

Fix broken URLs (#152237 )

2025-04-27 09:56:42 +00:00

_weights_only_unpickler.py

added class or module info for functions blocked by weight-only load (#159935 )

2025-08-12 20:52:25 +00:00

CMakeLists.txt

CMake build: preserve PYTHONPATH (#160144 )

2025-08-08 16:03:49 +00:00

custom_class_detail.h

…

custom_class.h

[BE][1/16] fix typos in torch/ (#156311 )

2025-07-09 11:02:22 +00:00

extension.h

…

functional.py

unify broadcast_shapes functions and avoid duplicates (#160251 )

2025-08-16 00:54:32 +00:00

header_only_apis.txt

[Reland] Migrate ScalarType to headeronly (#159911 )

2025-08-06 07:36:37 +00:00

hub.py

Allow torch.hub.load with unauthorized GITHUB_TOKEN (#159896 )

2025-08-14 18:15:49 +00:00

library.h

Using std::make_unique<T>() instead of unique<T>(new T()) (#160723 )

2025-08-19 10:25:47 +00:00

library.py

Add utility to get computed kernel in torch.library (#158393 )

2025-08-13 21:00:59 +00:00

overrides.py

[PT2]: Add Static Dispatch Kernel for wrapped_fbgemm_linear_fp16_weight (#160451 )

2025-08-15 04:06:17 +00:00

py.typed

…

quasirandom.py

…

random.py

Update description for torch.random.fork_rng (#151881 )

2025-04-23 16:59:29 +00:00

return_types.py

…

script.h

…

serialization.py

added class or module info for functions blocked by weight-only load (#159935 )

2025-08-12 20:52:25 +00:00

storage.py

mypy 1.16.0 (#155821 )

2025-06-14 18:18:43 +00:00

torch_version.py

…

types.py

…

version.py.tpl

…