Commit Graph

34 Commits

Author SHA1 Message Date
2e0e08588e [BE][PYFMT] migrate PYFMT for torch/[e-n]*/ to ruff format (#144553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144553
Approved by: https://github.com/ezyang
ghstack dependencies: #144551
2025-06-17 08:18:47 +00:00
c52a002a22 Add getDeviceProperties api to torch mtia device (#153577)
topic: not user facing

Test Plan: Internal benchmark.

Differential Revision: D74256550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153577
Approved by: https://github.com/nautsimon
2025-05-27 11:55:58 +00:00
861945100e [Kineto] Enable OOM observer (#152160)
Summary:
# Context:
When memory leak happens, it usually trigger the OOM in the later iterations. The snapshot of full iteration will be huge and hard to interpret.
On CUDA side, they provide OOM observer which generates snapshot when OOM happens with latest 1,500,000 entries for debugging.

In this diff, we want to implement the feature on MTIA side

Test Plan:
Run this test with last diff in the stack.
```
buck run @//mode/opt  kineto/libkineto/fb/mtia/integration_tests:mtia_memory_auto_trace_test
```

As shown, the memory_snapshot is generated when oom happens
Log: P1794792326
Snapshot: https://fburl.com/pytorch_memory_visualizer/lx73y6s3 {F1977402355}

Differential Revision: D71993315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152160
Approved by: https://github.com/sraikund16
2025-04-27 15:56:44 +00:00
213eea216a [MTIA] Add _mtia_maybeExchangeDevice to MTIA module (#149340)
Summary: The FlexAttention path uses `_maybe_exchange_device`, so it will be needed eventually for MTIA as well.

Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_maybe_exchange_device`

Reviewed By: chaos5958

Differential Revision: D70072063

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149340
Approved by: https://github.com/chaos5958
2025-03-18 15:15:12 +00:00
769f19bf95 [MTIA] Add _mtia_exchangeDevice to MTIA module (#149322)
Summary: The FlexAttention path uses `_exchange_device`, so it will be needed eventually for MTIA as well.

Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_exchange_device`

Reviewed By: chaos5958

Differential Revision: D70072059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149322
Approved by: https://github.com/chaos5958
2025-03-17 19:31:10 +00:00
db4ce78d46 PEP585: More UP006 fixes (#146392)
This should be the final PR before we can enable RUFF UP006.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392
Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007
2025-02-20 06:18:13 +00:00
001ebbf734 [MTIA] (4/n) Implement PyTorch APIs to query/reset device peak memory usage (#146751)
Summary: Public summary (shared with Github): This diff updates the unit test for the PyTorch API "reset_peak_memory_stats".

Test Plan:
```
buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_reset_peak_memory_stats
```

https://www.internalfb.com/intern/testinfra/testrun/9007199321947161

Reviewed By: yuhc

Differential Revision: D68989900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146751
Approved by: https://github.com/nautsimon
2025-02-11 03:51:48 +00:00
4803e20bc7 [S481486] Move MTIA dynamic library loading from __init__.py to a separate module (#145322)
Summary: As titled

Test Plan:
- Passed CI tests

buck2 test 'fbcode//mode/opt' fbcode//ai_infra/distributed_ai/pyper_local_run/tests/integration_tests:test_icvr_e2e_gpu -- --exact 'ai_infra/distributed_ai/pyper_local_run/tests/integration_tests:test_icvr_e2e_gpu - test_icvr_e2e_gpu (ai_infra.distributed_ai.pyper_local_run.tests.integration_tests.test_icvr_e2e_gpu.TestIcvrE2EGpu)' --run-disabled
```

https://www.internalfb.com/intern/testinfra/testconsole/testrun/9007199320480497/

Differential Revision: D68463242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145322
Approved by: https://github.com/yuhc, https://github.com/albanD
2025-01-22 23:39:43 +00:00
057d9aff39 [S481486] [MTIA] Correct mtia.device_count() API (#145338)
Summary:
Prev: Count the number of "general" accelerators

Curr: Count the number of MTIA devices by using the MTIA runtime API

Test Plan:
```
buck test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r  test_get_device_count
```

https://www.internalfb.com/intern/testinfra/testrun/8162774572631995

Reviewed By: BoyueZheng

Differential Revision: D68472668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145338
Approved by: https://github.com/BoyueZheng, https://github.com/egienvalue
2025-01-22 17:45:15 +00:00
805c4b597a PEP585 update - torch/_higher_order_ops torch/_subclasses torch/backends torch/compiler torch/cuda torch/masked torch/mtia torch/nested (#145202)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145202
Approved by: https://github.com/bobrenjc93
2025-01-20 22:37:26 +00:00
dabc9566c4 Revert "(MTIA) Move "empty_cache" API (#143402)"
This reverts commit c7d9f298072a3f59b39517e367c7d3d2ea30e6d9.

Reverted https://github.com/pytorch/pytorch/pull/143402 on behalf of https://github.com/huydhn due to The internal diff D67148738 has been reverted ([comment](https://github.com/pytorch/pytorch/pull/143402#issuecomment-2557982597))
2024-12-21 04:01:23 +00:00
c7d9f29807 (MTIA) Move "empty_cache" API (#143402)
Summary: This diff moves one of memory-related APIs to the consolidated location, which is `mtia/memory.py`.

Test Plan:
```
buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api
```

https://www.internalfb.com/intern/testinfra/testrun/13510798943184259

Reviewed By: nautsimon

Differential Revision: D67148738

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143402
Approved by: https://github.com/nautsimon
2024-12-20 17:39:06 +00:00
eb67dd3e2d [3/N][Memory Profiling] Add memory profiling function for MTIA hooks (#142149)
Design Doc: https://fburl.com/gdoc/47zpuweb
Prototyping:  D66469341

In this diff, we implement two new mtia hooks to start/stop profiler and export the memory snapshot.

In next diff, we will integrate the mtia backend with profiler python api

Differential Revision: [D66823583](https://our.internmc.facebook.com/intern/diff/D66823583/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142149
Approved by: https://github.com/nautsimon
2024-12-18 11:58:23 +00:00
fd65bd755d [BE] replace incorrect .. note:: invocations (#142868)
Something I've noticed is that a lot of the distributed sites don't render on our docs at all, but if they ever do, the notes will render properly now 😛

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142868
Approved by: https://github.com/albanD
2024-12-11 19:58:18 +00:00
92cc345683 Implement "torch.mtia.max_memory_allocated" API (#142406)
Summary: This diff implements the inferface of  "torch.mtia.max_memory_allocated" API. The internal implementation will be addressed in a separate diff.

Test Plan:
Passed a local unit test: `buck run //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

```
----------------------------------------------------------------------
Ran 15 tests in 16.862s

OK
I1127 11:31:14.613909 2272144 afg_bindings.cpp:943] afg-aten::mul.out-dtype_Float-uqJKuNc0 executable has been unloaded
I1127 11:31:14.615438 2272144 afg_bindings.cpp:943] afg-add-dtype_Float-fa37JncC executable has been unloaded
```

Reviewed By: ttrung149, nautsimon

Differential Revision: D66553954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142406
Approved by: https://github.com/nautsimon
2024-12-11 03:06:18 +00:00
005c5694eb Refactor "torch.mtia.memory_stats" API (#141723)
Summary:
This diff refactors the code for the "torch.mtia.memory_stats" API to maintain the same file hierarchy as its CUDA counterpart:
- All device memory APIs are now located under ".../mtia/memory.py".
- Device memory APIs can be accessed using either "torch.mtia.XYZ" or "torch.mtia.memory.XYZ".

Test Plan:
Passed a local unit test: `buck run //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

```
Ran 14 tests in 16.657s

OK
I1127 11:06:06.505201 2133030 afg_bindings.cpp:943] afg-aten::mul.out-dtype_Float-bBtLGD6Y executable has been unloaded
I1127 11:06:06.506654 2133030 afg_bindings.cpp:943] afg-add-dtype_Float-fa37JncC executable has been unloaded
W1127 11:06:08.731138 2133030 HazptrDomain.h:148] Tagged objects remain. This may indicate a higher-level leak of object(s) that use hazptr_obj_cohort.
```

Differential Revision: D66549179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141723
Approved by: https://github.com/nautsimon
2024-12-09 19:19:19 +00:00
d70b7029c8 [MTIA] Support torch.mtia.empty_cache() (#141533)
Summary: As title

Test Plan:
Passed a local unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

https://www.internalfb.com/intern/testinfra/testrun/4785074861101240

Reviewed By: nautsimon

Differential Revision: D66481778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141533
Approved by: https://github.com/nautsimon
2024-11-28 02:24:19 +00:00
12e95aa4ee [BE]: Apply PERF401 autofixes from ruff (#140980)
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-11-20 17:52:07 +00:00
cc365fdd7b [MTIA] Support torch.cuda.get_device_capability equivalent API on MTIA (#135889)
Summary:
Mirror `get_device_capability` on MTIA per https://fburl.com/gdoc/p4lo5avn

At the moment, both the major and minor version are just 0

Test Plan:
Unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688850109958190/

Differential Revision: D62595296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135889
Approved by: https://github.com/egienvalue
2024-09-17 17:42:56 +00:00
828ab84e19 Improve error msg on _lazy_init() error (#134159)
Reviewed By: hanzlfs

Differential Revision: D61627609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134159
Approved by: https://github.com/hanzlfs
2024-08-22 15:10:50 +00:00
758a0a88a2 [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200)
This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change.

Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980
2024-08-15 15:50:19 +00:00
f3fce597e9 [BE][Easy][17/19] enforce style for empty lines in import segments in torch/[a-c]*/ and torch/[e-n]*/ (#129769)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129769
Approved by: https://github.com/ezyang
2024-08-04 10:24:09 +00:00
dcb03106b7 [Land Internally] MTIA equivalent of torch.cuda.memory_stats (#132007)
Summary: as title

Test Plan: pytorch ci failing: https://github.com/pytorch/pytorch/issues/131962

Differential Revision: D60335413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132007
Approved by: https://github.com/hanzlfs, https://github.com/egienvalue
2024-07-29 20:47:18 +00:00
b343644f3a Revert "MTIA equivalent of torch.cuda.memory_stats (#131673)"
This reverts commit 513ce5f69a7f53742b7aa5798082dd158beec2ed.

Reverted https://github.com/pytorch/pytorch/pull/131673 on behalf of https://github.com/clee2000 due to linked internal diff has internal changes, not sure what happened here, but this shouldn't have been merged externally without also merging the internal diff ([comment](https://github.com/pytorch/pytorch/pull/131673#issuecomment-2251749644))
2024-07-26 00:54:37 +00:00
513ce5f69a MTIA equivalent of torch.cuda.memory_stats (#131673)
Summary: Adding MTIA equivalent of `torch.cuda.memory_stats`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131673
Approved by: https://github.com/egienvalue
2024-07-25 21:59:59 +00:00
abb313b466 [torch.mtia] Noop set_rng_state and get_rng_state APIs (#130873)
Summary: As title

Test Plan: CI tests

Reviewed By: joebos

Differential Revision: D59036602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130873
Approved by: https://github.com/hanzlfs
2024-07-24 01:52:21 +00:00
1fd7496ab2 [MTIA] Fix synchronize API (#128714)
Reviewed By: fenypatel99

Differential Revision: D58590313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128714
Approved by: https://github.com/aaronenyeshi
2024-06-17 21:58:46 +00:00
f843ccbb1a [MTIA] Add set_device support (#128040)
Summary: Support set_device API in MTIA backend.

Reviewed By: gnahzg

Differential Revision: D58089498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128040
Approved by: https://github.com/gnahzg
2024-06-10 23:42:52 +00:00
038b927590 Flip default value for mypy disallow_untyped_defs [7/11] (#127844)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127844
Approved by: https://github.com/oulgen
ghstack dependencies: #127842, #127843
2024-06-08 18:49:45 +00:00
73744a2c00 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-26 16:17:54 +00:00
e04c7b19f4 Revert "torch.mtia module for MTIA device backend (#123612)"
This reverts commit 381653de63df4b1b31cc95531320caf83b1b60b3.

Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to this PR broke ROCm with message RuntimeError: Cannot have MTIA with other devices ([comment](https://github.com/pytorch/pytorch/pull/123612#issuecomment-2077649762))
2024-04-25 16:06:46 +00:00
381653de63 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------

Differential Revision: [D56443356](https://our.internmc.facebook.com/intern/diff/D56443356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-24 20:51:20 +00:00
929242a15c Revert "torch.mtia module for MTIA device backend (#123612)"
This reverts commit d7e1bf9ff908d2a9c20d5354426d34c539fcb7a1.

Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
d7e1bf9ff9 torch.mtia module for MTIA device backend (#123612)
MTIA device has its own Module in PyTorch now.
torch.mtia has following APIs similar to other backends. The lazy_init is also supported.
```
__all__ = [
    "init",
    "is_available",
    "synchronize",
    "device_count",
    "current_device",
    "current_stream",
    "default_stream",
    "set_stream",
    "stream",
    "device",
]

```
------------
For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon.
```
def _accelerator_hooks_device_count() -> _int: ...
def _accelerator_hooks_set_current_device(device_index: _int) -> None: ...
def _accelerator_hooks_get_current_device() -> _int : ...
def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ...
def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ...
```

---------
Adding get_device_module API to retrieve device modules for different device types.
```
def get_device_module(device: Optional[Union[torch.device, str]] = None)
```
---------
@exported-using-ghexport

Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612
Approved by: https://github.com/albanD
ghstack dependencies: #123611
2024-04-18 17:38:06 +00:00