f76f4abf3f
Track monitor ( #156907 )
...
Tracking gpu mem allocation, we were tracking the gpu bandwidth memory, the mem allocation is the one reflect wether the gpu is oom or not, upcoming ui fix.
UI fix: https://github.com/pytorch/test-infra/pull/6878/files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156907
Approved by: https://github.com/huydhn
2025-07-18 22:54:13 +00:00
a69785b3ec
[BE] fix typos in tools/ ( #156082 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156082
Approved by: https://github.com/soulitzer
ghstack dependencies: #156079
2025-06-17 19:25:50 +00:00
335c89c6f1
[Monitoring] enable local logs and add mac test monitoring ( #153454 )
...
Enable to run the upload utilzation logics using local pointer instead of reading from s3, this could be useful for rocm too,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153454
Approved by: https://github.com/huydhn
2025-05-20 17:14:40 +00:00
c54b9f2969
[Monitoring] Add util for linux build ( #153456 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153456
Approved by: https://github.com/huydhn
2025-05-19 17:28:17 +00:00
a88d7d4268
[util] fetch logical count cpu ( #147413 )
...
To match with Vcpu count with aws:
after (96), before (48)
Instance Ref: https://instances.vantage.sh/aws/ec2/g4dn.metal
before: https://hud.pytorch.org/utilization/13377376406/37360984234/1
after: https://hud.pytorch.org/utilization/13401543806/37435031356/1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147413
Approved by: https://github.com/clee2000
2025-02-19 23:44:54 +00:00
b0553cee6b
[Utilization] post-test-process workflow ( #145310 )
...
# Overview
Add reusable workflow to trigger the post-test right after each test job is complete.
Cousion with pr to setup the runner permissions:
Add m fleet instances: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595/files
add to lix fleet:https://github.com/pytorch/ci-infra/pull/322/files
Currently I turn on the debug flag for testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145310
Approved by: https://github.com/huydhn
2025-02-13 18:51:19 +00:00
fd73ae2068
[Utilization] Convert timestamp to str for datetime64 ( #145985 )
...
Convert all timestamp(float) to int timestamp during data pipeline for db type datetime64.
float does not work when try to insert into clickhouse using jsonExtract.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145985
Approved by: https://github.com/huydhn
2025-02-03 21:05:18 +00:00
a9ed7bd78e
[utilization] pipeline to create clean db records ( #145327 )
...
upload_utilization_script to generate db-ready-insert records to s3
- generate two files: metadata and timeseries in ossci-utilization buckets
- convert log record to db format ones
- add unit test job for tools/stats/
Related Prs:
setup composite action for data pipeline: https://github.com/pytorch/pytorch/pull/145310
add permission for composite action to access S3 bucket: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595
add insert logic in s3 replicator: https://github.com/pytorch/test-infra/pull/6217
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145327
Approved by: https://github.com/huydhn
Co-authored-by: Huy Do <huydhn@gmail.com >
2025-01-29 23:48:50 +00:00
6d4f5f7688
[Utilization][Usage Log] Add data model for record ( #145114 )
...
Add data model for consistency and data model change in the future.
The data model will be used during the post-test-process pipeline
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145114
Approved by: https://github.com/huydhn
2025-01-23 19:04:41 +00:00
fea9d18d5a
[Utilization Log] Concurrently collect aggregate data during the output interval ( #143235 )
...
# overview
Add worker to collect metrics in short intervals
1.Worker: Add a worker to collect usage metrics, by default, every 500ms, notice this is configurable
2.Calculate & avg and max as data point, by default, every 5 second.
# Other
clean up the log format for necessary needs, currentl we do not need to track gpu processesors etc, or all pids from psutil
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143235
Approved by: https://github.com/huydhn
2025-01-16 23:52:43 +00:00
498a7808ff
Fix unused Python variables outside torch/ and test/ ( #136359 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136359
Approved by: https://github.com/albanD
2024-12-11 17:10:23 +00:00
b7a45dbae3
Add monitor script ( #141438 )
...
# Overview
Add monitor script to collect system-level utilization data during CI tests.
Currently all monitoring scripts are disabled.
# Details
- Add flag to customize the time intervals for logging
- Enable multiple GPU utilization logging
# Next step
enable monitor scritpt in non-perf-test workflows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141438
Approved by: https://github.com/huydhn
2024-11-29 04:14:31 +00:00
1266be21f4
deprecated datetime.utcnow() fix and _RendezvousJoinOp module initiation bug fix ( #136141 )
...
Fix to #136140
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136141
Approved by: https://github.com/kwen2501
2024-09-24 07:26:10 +00:00
8a67daf283
[BE][Easy] enable postponed annotations in tools
( #129375 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375
Approved by: https://github.com/malfet
2024-06-29 09:23:35 +00:00
a32ce5ce34
Revert "[BE][Easy] enable postponed annotations in tools
( #129375 )"
...
This reverts commit 59eb2897f1745f513edb6c63065ffad481c4c8d0.
Reverted https://github.com/pytorch/pytorch/pull/129375 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374 , please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541 ))
2024-06-29 00:44:25 +00:00
59eb2897f1
[BE][Easy] enable postponed annotations in tools
( #129375 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375
Approved by: https://github.com/malfet
2024-06-28 15:37:54 +00:00
06934518a2
[AMD] Fix deprecated amdsmi api ( #126962 )
...
Summary: https://github.com/pytorch/pytorch/pull/119182 uses an API that has already been deprecated by c551c3caed
. So fixing this in a backward compatible way
Differential Revision: D57711088
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126962
Approved by: https://github.com/eqy , https://github.com/izaitsevfb
2024-05-26 20:11:23 +00:00
d30cdc4321
[ROCm] amdsmi library integration ( #119182 )
...
Adds monitoring support for ROCm using amdsmi in place of pynvml.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily , https://github.com/malfet , https://github.com/xw285cornell
2024-05-21 01:59:26 +00:00
0d4fdb0bb7
Revert "[ROCm] amdsmi library integration ( #119182 )"
...
This reverts commit 85447c41e32b1e43a025ea19ac812a0c7f88ff57.
Reverted https://github.com/pytorch/pytorch/pull/119182 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the ROCm failed test is legit 85447c41e3
([comment](https://github.com/pytorch/pytorch/pull/119182#issuecomment-2103433197 ))
2024-05-09 21:18:21 +00:00
85447c41e3
[ROCm] amdsmi library integration ( #119182 )
...
Adds monitoring support for ROCm using amdsmi in place of pynvml.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily , https://github.com/malfet , https://github.com/xw285cornell
2024-05-09 18:21:38 +00:00
60a68477a6
Bump black version to 23.1.0 ( #96578 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96578
Approved by: https://github.com/ezyang
2023-03-15 06:27:59 +00:00
f11dc26ed5
[ROCm] tools/stats/monitor.py support ( #91732 )
...
Initial support for rocm-smi monitoring of GPU utilization. Works around difficulties of using the rocm-smi python bindings without having an explicit package.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91732
Approved by: https://github.com/huydhn , https://github.com/pruthvistony
2023-01-05 18:34:11 +00:00
7c6fe21a38
Fix monitoring script for macos ( #88159 )
...
The monitoring script is currently failing with AccessDenied when trying to access uss memory on mac because [psutil.memory_full_info](https://psutil.readthedocs.io/en/latest/index.html?highlight=memory_full_info ) requires higher user privileges
Example failures:
* https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3363066309/1/artifact/usage-log-test-default-2-2-macos-12_9208104847.zip
* https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3363066309/1/artifact/usage-log-test-default-2-2-macos-m1-12_9207913759.zip
I could also make this script run with sudo, effectively granting this permission. But I'm not entirely sure that we need uss memory for mac, so gracefully handling the error looks nicer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88159
Approved by: https://github.com/clee2000
2022-11-01 05:58:44 +00:00
795906f207
Add total GPU memory utilization ( #86250 )
...
Although we already have per process GPU memory usage, I'm curious to see what is the number for `gpu_utilization.memory` per https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html . Also fixing a tiny typo issue that has been bugging me for a while `total_gpu_utilizaiton`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86250
Approved by: https://github.com/ZainRizvi
2022-10-06 18:53:59 +00:00
6f2a88dd50
script to monitor memory + cpu utilization ( #82006 )
...
Add a python script that runs in the background during test jobs to log cpu + gpu memory usage and cpu utilization of python tests (really any python process) to a file and upload the file as an artifact.
I plan on using the the gpu memory usage stats to better understand how to parallelize them, but it is easy to add on other stats if people want them.
In the future, we want to add the ability to track network usage to see if we can decrease it. GPU utilization will also likely need to be improved.
Click the hud link to see uploaded usage log artifacts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82006
Approved by: https://github.com/huydhn
2022-07-25 16:53:31 +00:00