pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
Yang Wang	f76f4abf3f	Track monitor (#156907 ) Tracking gpu mem allocation, we were tracking the gpu bandwidth memory, the mem allocation is the one reflect wether the gpu is oom or not, upcoming ui fix. UI fix: https://github.com/pytorch/test-infra/pull/6878/files Pull Request resolved: https://github.com/pytorch/pytorch/pull/156907 Approved by: https://github.com/huydhn	2025-07-18 22:54:13 +00:00
Xuehai Pan	a69785b3ec	[BE] fix typos in tools/ (#156082 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156082 Approved by: https://github.com/soulitzer ghstack dependencies: #156079	2025-06-17 19:25:50 +00:00
Yang Wang	335c89c6f1	[Monitoring] enable local logs and add mac test monitoring (#153454 ) Enable to run the upload utilzation logics using local pointer instead of reading from s3, this could be useful for rocm too, Pull Request resolved: https://github.com/pytorch/pytorch/pull/153454 Approved by: https://github.com/huydhn	2025-05-20 17:14:40 +00:00
Yang Wang	c54b9f2969	[Monitoring] Add util for linux build (#153456 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153456 Approved by: https://github.com/huydhn	2025-05-19 17:28:17 +00:00
Yang Wang	a88d7d4268	[util] fetch logical count cpu (#147413 ) To match with Vcpu count with aws: after (96), before (48) Instance Ref: https://instances.vantage.sh/aws/ec2/g4dn.metal before: https://hud.pytorch.org/utilization/13377376406/37360984234/1 after: https://hud.pytorch.org/utilization/13401543806/37435031356/1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147413 Approved by: https://github.com/clee2000	2025-02-19 23:44:54 +00:00
Yang Wang	b0553cee6b	[Utilization] post-test-process workflow (#145310 ) # Overview Add reusable workflow to trigger the post-test right after each test job is complete. Cousion with pr to setup the runner permissions: Add m fleet instances: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595/files add to lix fleet:https://github.com/pytorch/ci-infra/pull/322/files Currently I turn on the debug flag for testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145310 Approved by: https://github.com/huydhn	2025-02-13 18:51:19 +00:00
Yang Wang	fd73ae2068	[Utilization] Convert timestamp to str for datetime64 (#145985 ) Convert all timestamp(float) to int timestamp during data pipeline for db type datetime64. float does not work when try to insert into clickhouse using jsonExtract. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145985 Approved by: https://github.com/huydhn	2025-02-03 21:05:18 +00:00
Yang Wang	a9ed7bd78e	[utilization] pipeline to create clean db records (#145327 ) upload_utilization_script to generate db-ready-insert records to s3 - generate two files: metadata and timeseries in ossci-utilization buckets - convert log record to db format ones - add unit test job for tools/stats/ Related Prs: setup composite action for data pipeline: https://github.com/pytorch/pytorch/pull/145310 add permission for composite action to access S3 bucket: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595 add insert logic in s3 replicator: https://github.com/pytorch/test-infra/pull/6217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145327 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-01-29 23:48:50 +00:00
Yang Wang	6d4f5f7688	[Utilization][Usage Log] Add data model for record (#145114 ) Add data model for consistency and data model change in the future. The data model will be used during the post-test-process pipeline Pull Request resolved: https://github.com/pytorch/pytorch/pull/145114 Approved by: https://github.com/huydhn	2025-01-23 19:04:41 +00:00
Yang Wang	fea9d18d5a	[Utilization Log] Concurrently collect aggregate data during the output interval (#143235 ) # overview Add worker to collect metrics in short intervals 1.Worker: Add a worker to collect usage metrics, by default, every 500ms, notice this is configurable 2.Calculate & avg and max as data point, by default, every 5 second. # Other clean up the log format for necessary needs, currentl we do not need to track gpu processesors etc, or all pids from psutil Pull Request resolved: https://github.com/pytorch/pytorch/pull/143235 Approved by: https://github.com/huydhn	2025-01-16 23:52:43 +00:00
Tom Ritchford	498a7808ff	Fix unused Python variables outside torch/ and test/ (#136359 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136359 Approved by: https://github.com/albanD	2024-12-11 17:10:23 +00:00
Yang Wang	b7a45dbae3	Add monitor script (#141438 ) # Overview Add monitor script to collect system-level utilization data during CI tests. Currently all monitoring scripts are disabled. # Details - Add flag to customize the time intervals for logging - Enable multiple GPU utilization logging # Next step enable monitor scritpt in non-perf-test workflows Pull Request resolved: https://github.com/pytorch/pytorch/pull/141438 Approved by: https://github.com/huydhn	2024-11-29 04:14:31 +00:00
Amin Alam	1266be21f4	deprecated datetime.utcnow() fix and _RendezvousJoinOp module initiation bug fix (#136141 ) Fix to #136140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136141 Approved by: https://github.com/kwen2501	2024-09-24 07:26:10 +00:00
Xuehai Pan	8a67daf283	[BE][Easy] enable postponed annotations in `tools` (#129375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375 Approved by: https://github.com/malfet	2024-06-29 09:23:35 +00:00
PyTorch MergeBot	a32ce5ce34	Revert "[BE][Easy] enable postponed annotations in `tools` (#129375 )" This reverts commit 59eb2897f1745f513edb6c63065ffad481c4c8d0. Reverted https://github.com/pytorch/pytorch/pull/129375 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))	2024-06-29 00:44:25 +00:00
Xuehai Pan	59eb2897f1	[BE][Easy] enable postponed annotations in `tools` (#129375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375 Approved by: https://github.com/malfet	2024-06-28 15:37:54 +00:00
Xiaodong Wang	06934518a2	[AMD] Fix deprecated amdsmi api (#126962 ) Summary: https://github.com/pytorch/pytorch/pull/119182 uses an API that has already been deprecated by `c551c3caed`. So fixing this in a backward compatible way Differential Revision: D57711088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126962 Approved by: https://github.com/eqy, https://github.com/izaitsevfb	2024-05-26 20:11:23 +00:00
Jack Taylor	d30cdc4321	[ROCm] amdsmi library integration (#119182 ) Adds monitoring support for ROCm using amdsmi in place of pynvml. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell	2024-05-21 01:59:26 +00:00
PyTorch MergeBot	0d4fdb0bb7	Revert "[ROCm] amdsmi library integration (#119182 )" This reverts commit 85447c41e32b1e43a025ea19ac812a0c7f88ff57. Reverted https://github.com/pytorch/pytorch/pull/119182 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the ROCm failed test is legit `85447c41e3` ([comment](https://github.com/pytorch/pytorch/pull/119182#issuecomment-2103433197))	2024-05-09 21:18:21 +00:00
Jack Taylor	85447c41e3	[ROCm] amdsmi library integration (#119182 ) Adds monitoring support for ROCm using amdsmi in place of pynvml. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell	2024-05-09 18:21:38 +00:00
BowenBao	60a68477a6	Bump black version to 23.1.0 (#96578 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96578 Approved by: https://github.com/ezyang	2023-03-15 06:27:59 +00:00
Jeff Daily	f11dc26ed5	[ROCm] tools/stats/monitor.py support (#91732 ) Initial support for rocm-smi monitoring of GPU utilization. Works around difficulties of using the rocm-smi python bindings without having an explicit package. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91732 Approved by: https://github.com/huydhn, https://github.com/pruthvistony	2023-01-05 18:34:11 +00:00
Huy Do	7c6fe21a38	Fix monitoring script for macos (#88159 ) The monitoring script is currently failing with AccessDenied when trying to access uss memory on mac because [psutil.memory_full_info](https://psutil.readthedocs.io/en/latest/index.html?highlight=memory_full_info) requires higher user privileges Example failures: * https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3363066309/1/artifact/usage-log-test-default-2-2-macos-12_9208104847.zip * https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3363066309/1/artifact/usage-log-test-default-2-2-macos-m1-12_9207913759.zip I could also make this script run with sudo, effectively granting this permission. But I'm not entirely sure that we need uss memory for mac, so gracefully handling the error looks nicer Pull Request resolved: https://github.com/pytorch/pytorch/pull/88159 Approved by: https://github.com/clee2000	2022-11-01 05:58:44 +00:00
Huy Do	795906f207	Add total GPU memory utilization (#86250 ) Although we already have per process GPU memory usage, I'm curious to see what is the number for `gpu_utilization.memory` per https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html. Also fixing a tiny typo issue that has been bugging me for a while `total_gpu_utilizaiton` Pull Request resolved: https://github.com/pytorch/pytorch/pull/86250 Approved by: https://github.com/ZainRizvi	2022-10-06 18:53:59 +00:00
Catherine Lee	6f2a88dd50	script to monitor memory + cpu utilization (#82006 ) Add a python script that runs in the background during test jobs to log cpu + gpu memory usage and cpu utilization of python tests (really any python process) to a file and upload the file as an artifact. I plan on using the the gpu memory usage stats to better understand how to parallelize them, but it is easy to add on other stats if people want them. In the future, we want to add the ability to track network usage to see if we can decrease it. GPU utilization will also likely need to be improved. Click the hud link to see uploaded usage log artifacts Pull Request resolved: https://github.com/pytorch/pytorch/pull/82006 Approved by: https://github.com/huydhn	2022-07-25 16:53:31 +00:00

25 Commits